Vache prompts. Claude codes.How it works

I Ran Autoresearch on My Own Website

·5 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed April 11, 2026
autoresearchperformanceseoconvexweb-optimization
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

Karpathy's autoresearch is elegant: modify one file, measure one metric, keep the best delta, repeat. 700 experiments, 2 days, one GPU. The pattern works because it has three invariants — single modifiable artifact, single objective metric, fixed time budget.

But what if the thing being optimized isn't train.py? What if it's a production website?

The Experiment

I pointed the autoresearch pattern at this website. The setup:

  • Artifact: React + Vite + MDX blog, 64 posts, deployed on Cloudflare Pages
  • Metric: bundle size (measurable in seconds via npm run build)
  • Loop: diagnose → hypothesize → implement → measure → proceed/iterate/pivot

No planning. No roadmap. Just iterate.

Iteration 0: Bug Fix (Build Was Broken)

The first discovery wasn't a performance optimization — the build was broken entirely. A duplicate lastReviewed YAML key in a hub page's frontmatter caused the MDX parser to crash.

Fix: Delete one line. Impact: Build went from BROKEN to PASSING.

This mirrors the Omni-SimpleMem paper's single biggest finding: a missing response_format parameter — a one-line bug — caused +175% F1 improvement. Bug fixes have outsized impact because they're not tuning a knob, they're removing a wall.

Iteration 1: Architecture Change (-68% Vendor Bundle)

The vendor bundle was 1,739 kB. Every visitor downloaded it, on every page. The diagnosis:

LibrarySize in vendorUsed where
three.js1,293 kB (74%)Graph page only
lucide-react205 kBIcons everywhere
react-dom195 kBCore

three.js was 74% of the vendor bundle and only needed on one page (the knowledge graph visualization). It was being dragged into vendor because the Vite config blindly routed all node_modules there.

Fix: Exclude three.js and react-force-graph from the vendor manual chunk, letting Rollup co-locate them with the lazy-loaded Graph component.

Result: Vendor dropped from 1,739 to 553 kB. Most visitors — who never visit the graph page — save 1.2 MB.

Iteration 2: Bug Fix (Deploy Was Blocked)

The learning-notes-index.json had grown to 27 MB, exceeding Cloudflare Pages' 25 MB per-file limit. Deploys were silently failing.

Fix: Strip heavy fields (connectionsWithText, bodyContext, applications) from the index — they're loaded from shards on demand. Automated it as a post-build step so it never blocks again.

Impact: 27 → 12 MB. Deploys work.

Iterations 3-4: Performance Wins

Google Fonts render-blocking: The stylesheet loaded synchronously, blocking first paint. Fix: media="print" onload="this.media='all'" pattern. First paint no longer waits for Google's CDN.

og:image preload: Every visitor was preloading the Open Graph image — which is only used by social media crawlers. Removed.

Iteration 5: SEO (0 → 120 Internal Links)

The SEO audit found zero internal cross-links across 51 blog posts. The blog had hub pages and computed "related articles" in the UI, but no editorial links in the actual markdown content. Google treats editorial inline links much more strongly than computed sidebar links.

Fix: Script that adds a "Further Reading" section to each post with 2-3 links to posts in the same topic cluster.

Result: 120+ new internal links across 43 posts. The single biggest SEO change you can make without writing new content.

Iterations 6-8: Cleanup

  • Removed canvas-confetti (unused dependency)
  • Added aspect-video CSS to blog images (fixes CLS)
  • Expanded RSS feed from 20 items to all 63 posts

What Didn't Work (Correctly Pivoted)

Two investigations yielded nothing:

Blog post chunk splitting: All 129 MDX files compile into one 1 MB chunk. I investigated splitting them into individual chunks — but at ~8 KB per post average, the HTTP request overhead would exceed the savings. Correctly pivoted away.

Lucide-react optimization: Already tree-shaken via named imports. The 205 kB is the actual cost of 25 icons + the createLucideIcon infrastructure. No optimization available without switching libraries. Correctly pivoted.

The Pattern Holds

The Omni-SimpleMem paper identifies four properties that make a domain suited for autoresearch:

  1. Immediate scalar metrics — bundle size computes in seconds
  2. Modular architecture — Vite chunks, MDX posts, config files are independent
  3. Fast iteration cycles — build + measure in under 5 seconds
  4. Version-controlled code — git revert any failed experiment

It also identifies a taxonomy of discovery types, and our results match exactly:

Discovery TypePaper FindingOur Finding
Bug fix+175% (missing API param)Build broken → passing, deploy blocked → working
Architecture+44% (hybrid search)-68% vendor bundle
Prompt/config+188% (constraint positioning)Async fonts, RSS expansion
HyperparameterLeast impactfulCorrectly pivoted (chunk splitting, icon optimization)

Bug fixes and architecture changes dominate. Hyperparameter tuning contributes the least. This isn't a coincidence — it's the nature of complex systems. The biggest improvements come from fixing things that shouldn't be broken, not from tuning things that already work.

Beyond the Website: The Vault

This website experiment was a proof of concept. The real system is the vault — an 11,500+ note knowledge base maintained by autonomous AI agents running 24/7:

  • 4 parallel agent instances (Janitor, Quality Auditor, Connect+Act, Autoresearch)
  • Continuous loop with binary reward (1 = improved something, 0 = didn't)
  • Model routing: local GPU (Gemma 4, free) for processing, cloud (Haiku/Sonnet) for tool use
  • Quality scoring via output judges
  • Self-improving rules — every failure gets captured as a pattern in RULES.md

The vault has been running this pattern for months. The website experiment took one session. The difference between Karpathy's "idea file" and a production system is everything we learned in between: the mode specialization that emerged from trial and error, the quality scoring that catches hallucinated sources, the model routing that keeps costs sane.

You can watch the agents work in real time at /live.

Further Reading

Sources

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com