What is autoresearch and how does it apply to websites?

Autoresearch is a pattern where an AI agent iteratively diagnoses issues, implements fixes, measures results, and repeats — without human intervention in the inner loop. Originally applied to ML training scripts by Karpathy (700 experiments in 2 days), the same pattern works on any system with a measurable metric, modular architecture, fast iteration cycles, and version control. Websites qualify on all four criteria.

What were the actual results of running autoresearch on a production website?

9 iterations produced 8 deployable changes: a 68% vendor bundle reduction (1,739 to 553 kB), 120+ new internal cross-links across 43 blog posts, async font loading, a real-time Convex telemetry backend, and several bug fixes. Two bug fixes had outsized impact — one was blocking the entire build, another was blocking deploys to Cloudflare Pages.

How does this compare to the Omni-SimpleMem paper's findings?

The Omni-SimpleMem paper (arXiv 2604.01007) found that bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188%) each individually exceeded the impact of all hyperparameter tuning combined. Our website experiment confirmed the same pattern — bug fixes and architecture changes dominated, while hyperparameter-style tweaks (blog chunk splitting, icon library optimization) correctly yielded nothing.

I Ran Autoresearch on My Own Website

Karpathy's autoresearch is elegant: modify one file, measure one metric, keep the best delta, repeat. 700 experiments, 2 days, one GPU. The pattern works because it has three invariants — single modifiable artifact, single objective metric, fixed time budget.

But what if the thing being optimized isn't train.py? What if it's a production website?

The Experiment

I pointed the autoresearch pattern at this website. The setup:

Artifact: React + Vite + MDX blog, 64 posts, deployed on Cloudflare Pages
Metric: bundle size (measurable in seconds via npm run build)
Loop: diagnose → hypothesize → implement → measure → proceed/iterate/pivot

No planning. No roadmap. Just iterate.

Iteration 0: Bug Fix (Build Was Broken)

The first discovery wasn't a performance optimization — the build was broken entirely. A duplicate lastReviewed YAML key in a hub page's frontmatter caused the MDX parser to crash.

Fix: Delete one line. Impact: Build went from BROKEN to PASSING.

This mirrors the Omni-SimpleMem paper's single biggest finding: a missing response_format parameter — a one-line bug — caused +175% F1 improvement. Bug fixes have outsized impact because they're not tuning a knob, they're removing a wall.

Iteration 1: Architecture Change (-68% Vendor Bundle)

The vendor bundle was 1,739 kB. Every visitor downloaded it, on every page. The diagnosis:

Library	Size in vendor	Used where
three.js	1,293 kB (74%)	Graph page only
lucide-react	205 kB	Icons everywhere
react-dom	195 kB	Core

three.js was 74% of the vendor bundle and only needed on one page (the knowledge graph visualization). It was being dragged into vendor because the Vite config blindly routed all node_modules there.

Fix: Exclude three.js and react-force-graph from the vendor manual chunk, letting Rollup co-locate them with the lazy-loaded Graph component.

Result: Vendor dropped from 1,739 to 553 kB. Most visitors — who never visit the graph page — save 1.2 MB.

Iteration 2: Bug Fix (Deploy Was Blocked)

The learning-notes-index.json had grown to 27 MB, exceeding Cloudflare Pages' 25 MB per-file limit. Deploys were silently failing.

Fix: Strip heavy fields (connectionsWithText, bodyContext, applications) from the index — they're loaded from shards on demand. Automated it as a post-build step so it never blocks again.

Impact: 27 → 12 MB. Deploys work.

Iterations 3-4: Performance Wins

Google Fonts render-blocking: The stylesheet loaded synchronously, blocking first paint. Fix: media="print" onload="this.media='all'" pattern. First paint no longer waits for Google's CDN.

og:image preload: Every visitor was preloading the Open Graph image — which is only used by social media crawlers. Removed.

Iteration 5: SEO (0 → 120 Internal Links)

The SEO audit found zero internal cross-links across 51 blog posts. The blog had hub pages and computed "related articles" in the UI, but no editorial links in the actual markdown content. Google treats editorial inline links much more strongly than computed sidebar links.

Fix: Script that adds a "Further Reading" section to each post with 2-3 links to posts in the same topic cluster.

Result: 120+ new internal links across 43 posts. The single biggest SEO change you can make without writing new content.

Iterations 6-8: Cleanup

Removed canvas-confetti (unused dependency)
Added aspect-video CSS to blog images (fixes CLS)
Expanded RSS feed from 20 items to all 63 posts

What Didn't Work (Correctly Pivoted)

Two investigations yielded nothing:

Blog post chunk splitting: All 129 MDX files compile into one 1 MB chunk. I investigated splitting them into individual chunks — but at ~8 KB per post average, the HTTP request overhead would exceed the savings. Correctly pivoted away.

Lucide-react optimization: Already tree-shaken via named imports. The 205 kB is the actual cost of 25 icons + the createLucideIcon infrastructure. No optimization available without switching libraries. Correctly pivoted.

The Pattern Holds

The Omni-SimpleMem paper identifies four properties that make a domain suited for autoresearch:

Immediate scalar metrics — bundle size computes in seconds
Modular architecture — Vite chunks, MDX posts, config files are independent
Fast iteration cycles — build + measure in under 5 seconds
Version-controlled code — git revert any failed experiment

It also identifies a taxonomy of discovery types, and our results match exactly:

Discovery Type	Paper Finding	Our Finding
Bug fix	+175% (missing API param)	Build broken → passing, deploy blocked → working
Architecture	+44% (hybrid search)	-68% vendor bundle
Prompt/config	+188% (constraint positioning)	Async fonts, RSS expansion
Hyperparameter	Least impactful	Correctly pivoted (chunk splitting, icon optimization)

Bug fixes and architecture changes dominate. Hyperparameter tuning contributes the least. This isn't a coincidence — it's the nature of complex systems. The biggest improvements come from fixing things that shouldn't be broken, not from tuning things that already work.

Beyond the Website: The Vault

This website experiment was a proof of concept. The real system is the vault — an 11,500+ note knowledge base maintained by autonomous AI agents running 24/7:

4 parallel agent instances (Janitor, Quality Auditor, Connect+Act, Autoresearch)
Continuous loop with binary reward (1 = improved something, 0 = didn't)
Model routing: local GPU (Gemma 4, free) for processing, cloud (Haiku/Sonnet) for tool use
Quality scoring via output judges
Self-improving rules — every failure gets captured as a pattern in RULES.md

The vault has been running this pattern for months. The website experiment took one session. The difference between Karpathy's "idea file" and a production system is everything we learned in between: the mode specialization that emerged from trial and error, the quality scoring that catches hallucinated sources, the model routing that keeps costs sane.

You can watch the agents work in real time at /live.

I Ran Autoresearch on My Own Website

The Experiment

Iteration 0: Bug Fix (Build Was Broken)

Iteration 1: Architecture Change (-68% Vendor Bundle)

Iteration 2: Bug Fix (Deploy Was Blocked)

Iterations 3-4: Performance Wins

Iteration 5: SEO (0 → 120 Internal Links)

Iterations 6-8: Cleanup

What Didn't Work (Correctly Pivoted)

The Pattern Holds

Beyond the Website: The Vault

Further Reading

Related Articles

Shipping AI Image Generation While Cutting 2MB from the Bundle

Linesheet Bundle Diet: Cutting 2MB from Initial Load

A Neural Router for My Knowledge Vault

Sources

About the Author

Vache Sarkissian