Andrej Karpathy's autoresearch idea is elegant: run 100 experiments a night on one GPU, measure which change improves model loss the most, commit the best delta, repeat. It's self-improvement on a feedback loop — no humans second-guessing, just measurement and gradient following.
But autoresearch assumes you're optimizing a trained model. What if the thing being optimized is not a neural network, but a vault — a collection of prompts, scripts, configs, and training data that drive an autonomous agent system?
That's what we built today.
The Vault as a Model
A personal knowledge vault with 13 fine-tuned local models running autonomous tasks (knowledge scoring, goal planning, note creation, spaced repetition, blog drafting, and more) is, in some sense, a machine learning model:
- Weights = prompts, scripts, configs, and training data
- Inference = one cycle of the heartbeat system (44 autonomous tasks running daily/weekly)
- Output = knowledge notes, blog posts, research insights, implementation goals
- Loss function = quality scores (1–5 per task, assigned by
output-judgeafter each run) - Training step = one improvement cycle (fix a bottleneck, measure the quality delta)
The vault has the same learning problem as any model: the weights (prompts, scripts, configs) are not perfectly tuned. Some tasks consistently score 2.5/5. Some are broken outright (zero notes produced in 8 days). Some have cross-task bugs that only show up when you trace data flow across the entire pipeline.
Can we apply autoresearch to improve the vault?
Yes. And the results are striking.
What We Built: Three Execution Modes
We started with a single interactive skill (/autoresearch) for one improvement cycle. But one cycle catches only the most obvious bottleneck. The real value is in chaining cycles together — feeding the output of one fix as the diagnosis for the next.
Option 1: Interactive Skill (Single Cycle)
sonnet-proposal (read quality scores → identify bottleneck)
→ human applies fix
→ sonnet-verification (confirm the fix is correct)
Works for one improvement. Useful for targeted diagnosis. But the vault has 21 failing tasks, 100+ potential fixes. One cycle misses systemic patterns.
Option 2: Background Agent (3-Cycle Chain)
Same as Option 1, but looped:
Cycle 1: diagnose lowest task (blog-drafter) → fix prompt → verify
Cycle 2: diagnose next (knowledge-spacer) → fix config → verify
Cycle 3: diagnose next (morning-briefing) → fix script → verify
Result: 59 tool calls, 6 minutes, 3 targeted fixes logged. The agent found three different root causes (prompt, config, script) and fixed them without human intervention. This is where autoresearch starts earning.
Option 3: Team Session (Parallel Agents)
Spawn multiple agents in parallel:
- vault-autoresearch-1 chaining cycles on tasks 1–7
- vault-scout-linesheet scanning the Linesheet codebase for bugs (3 agents, one per domain)
- vault-scout-fitnessrewards scanning FitnessRewards
- gpu-experiment-runner queuing and running training experiments
All running at the same time, reporting back into a consolidated findings file.
Result: 30+ improvements in ~2 hours.
The Numbers: What Actually Happened
Prompt Fixes (14)
These were the highest-yield fixes. A single prompt edit — usually adding an explicit rule, example, or constraint — often moved a task from 3.0→4.2 quality in one cycle.
Examples:
-
blog-drafter-vault (Cycle 1): Task was writing thin briefs (211 tokens, 3 facts only). Prompt said "include key facts" but didn't specify how many. Fix: expanded template to require 5 facts, specific filenames (not category tags), and measurable metrics. Expected jump from 3.1→4.0.
-
morning-briefing (Cycle 2):
num_predict: 300was truncating output mid-section. The model was leaking conditional formatting instructions into the output ("(or 'quiet — no commits' if 0)"). Fix: raisednum_predictto 600 and added/no_thinkflag to suppress Qwen3 thinking overhead. Score 3.0→3.8. -
session-reflection (Cycle 1): The model was copying its example verbatim every single day. The training data included a specific example with real commit names (
"2026-03-10: Precision over scope"), and the small local model learned to reproduce it exactly. Three days of identical output. Fix: replaced the example with structure-only placeholders and added a CRITICAL: DO NOT COPY directive. -
research-radar (Cycle 2): Model was outputting markdown code fences and literal
[Tool]placeholders. Prompt showed a template-style example that the model interpreted as literal output. Fix: replaced example with concrete tool names and addedDO NOT output code fencesrule.
Script Fixes (6)
Some of the best finds were bugs buried in collector scripts that ran thousands of times.
knowledge-spacer.sh had a process fork storm:
# OLD: 4 sed invocations per note
for note in Knowledge/*.md; do
last_reviewed=$(sed -n '/^last_reviewed:/p' "$note")
sr_interval=$(sed -n '/^sr_interval:/p' "$note")
sr_ease=$(sed -n '/^sr_ease:/p' "$note")
review_count=$(sed -n '/^review_count:/p' "$note")
done
# 879 knowledge notes × 4 sed calls = 3,516 process forks per runFix: replaced 4 sed calls with a single awk pass:
# NEW: single awk extracts all 4 fields once
awk '/^---$/{if(++fences==2) exit} fences==1{
if(/^last_reviewed:/) last_reviewed=$0
if(/^sr_interval:/) sr_interval=$0
if(/^sr_ease:/) sr_ease=$0
if(/^review_count:/) review_count=$0
}' "$note"Result: 75% reduction in subprocess overhead. The spacer ran 10–20s slower on startup just forking sed. Now it's instant.
knowledge-quiz.sh had a regex that ate entire sections:
# OLD: stopped at /^\## [^K]/ — missed sections starting with K
awk '/^\## [^K]/{exit} {print}'
# Broke on "## Key Sources", "## Key Facts" — these leaked into quizFix: changed to explicit section boundary:
# NEW: extract only "## Key Points" section
awk '/^\## Key Points/{found=1; next} found && /^\## /{exit} found{print}'fashion-signals.sh was strangled by an outdated assumption:
# OLD: head -10 (commented: "fit Ollama 4096 token context")
head -10 signals_list.txt # sends 10 items
# Model: "With 10 items, everything appears count=1. No trends visible."The comment was from when Ollama's default context was 4096. Gemma3 has 131k context. With only 10 items, every signal appeared once, making aggregation useless.
Fix: head -10 → head -50. Now we get real cross-item patterns.
output-judge.sh was hitting its context ceiling:
# OLD: send all tasks, all output
44 tasks × 80 lines/task = ~3,500+ lines = 8192 tokens (context limit)
→ model hits limit, produces garbage, EOF errorFix: head -80 → head -40 per task. 40 lines is enough to evaluate quality; 80 was causing truncation. Model now completes cleanly.
Model Swaps (2)
Sometimes the best fix is just use a different model.
knowledge-seeder-gemini: was producing identical garbage output ("Positional Encoding" text continuation) for 8+ days. Root cause: vault-unified-4b-q8 (a unified task model) can't follow the === NOTE: === format — it kept summarizing context instead of generating notes.
Fix: Swap to gemma3:12b (better at format following). Next run: properly formatted notes.
blog-writer: was selecting qwen2.5-coder:14b 20% of the time via exploration. This code model excels at 410-token output (its training data is code examples). For prose writing, that's trash. Two consecutive 241-word failures (rejected by publisher for being under 600-word minimum).
Fix: Remove qwen2.5-coder from the candidate pool. Now it only chooses between gemma3:12b and qwen3:8b, both of which handle 600+ word prose.
lesson-distiller: This one was subtle. The prompt told the model to "use the Write tool to create Knowledge/practice--*.md" — but the Write tool was removed from the allowed toolset on 2026-03-16. The model had no way to act, so it outputted a narrative consulting report instead of the expected === PRACTICE NOTE: === blocks.
Fix (Cycle 2): Rewrote prompt to output structured text blocks instead of calling a tool.
Fix (Cycle 3): Realized there was no postprocessor (_postproc-lesson-distiller.sh) to parse the text blocks and write the files. Created one. The flywheel went from broken to functional.
Config Tweaks (3)
Sometimes the fix is just tune the knob.
| Task | Fix | Before | After |
|---|---|---|---|
| knowledge-spacer | raised num_predict 300→600 | truncated output | complete output |
| morning-briefing | added /no_think flag | thinking overhead wasting tokens | deterministic output |
| fashion-signals | raised head -10→-50 | all signals count=1 | real patterns |
Data Quality Fixes (3)
Three training data audits found issues in the pipelines feeding models:
-
deepener-4b: Only 5 original training examples. Mined 13 new examples from heartbeat logs (2026-03-07 to 2026-03-18). Total: 18 examples (→ 5-15 items each, good coverage).
-
planner-4b: 18 examples with a 50/50 balance of HAS-GOALS and NO-CHANGE cases. Too many thin "already up to date" examples (3 removed). Final: 15 examples, 60/40 split.
-
spacer: 310 total examples across 3 files. Format validation: 100% pass rate. No data cleaning needed — the 33%→100% format improvement in the first experiment came from epoch tuning, not data quality.
GPU Experiments: 34 Training Runs
While the prompt/script fixes were running, the GPU was running a parallel sweep of hyperparameter experiments on four models:
Spacer (format score went 33%→100%):
- EXP-001: epochs 5→10 ✓ metric 0.33→1.0 (KEEP)
- EXP-002: epochs 5→15 ✗ metric 0.33→0.0 (overfit, discard)
- EXP-003: lr 5e-5→1e-4 ✗ metric 0.33→0.0 (discard)
- EXP-004: lora_rank 16→32 ✗ metric 0.33→0.0 (discard)
Simple lesson: epochs were the bottleneck. Doubled them, problem solved.
TIES-density (multimodal learning, swept 0.1–0.7):
- EXP-008 at density=0.2 ✓ metric 0.675
- EXP-009 at density=0.4 ✓ metric 0.7
- EXP-010 at density=0.5 ✓ metric 0.7078 (best)
Sweet spot at 0.5. Higher density (0.6+) caused task interference; lower (0.1-0.3) underfit.
Deepener-4b (post-fix):
- EXP-20: epochs 5→15 ✓ metric 0.0→0.5 (KEEP)
- EXP-20-002: lr 5e-5→1e-4 ✓ metric 0.0→0.5 (both work)
Both fixes work independently. Use both.
Seeder-4b-v4 (had config issues):
- EXP-20-014: epochs 5→8 with v4 config ✓ metric 0.0→0.75 (KEEP)
Old config name was stale. Switching to v4 + epochs fixed it.
Code Scout Reports: 30+ Linesheet Issues
The project scouts (separate agents reading codebases) found 30 actionable issues across Linesheet and FitnessRewards:
Linesheet:
- 2 query purity violations (
new Date()fallbacks in read handlers — breaks Convex reactive caching) - 4 dead public query functions (diagnostic image queries, never called, should be
internalQuery) - 1 TS2589 type error (excessive type instantiation on customers page)
- 3 useEffect antipatterns (form state sync — not a bug, but low risk)
- 3 Promise.all fan-outs that could be parallelized better
FitnessRewards Analytics:
- Missing
by_tenant_accountindex usage (full-table scans that should use indexed queries) - Unbounded
poGroupItems.isReceivedquery (comment says "180 days" but queries ALL received items) - Duplicate computation between
getInventoryIntelligenceandgetHoldingCostInsights(both doing 90-day line item scans) - Auth inconsistency (some functions return
[]on auth failure, others should userequireAuth)
Total actionable items: 30. Most are quick (1–5 minute) fixes that improve performance, reduce surface area, or tighten auth. None are critical bugs, but collectively they represent ~50 minutes of technical debt that autoresearch surfaced in minutes.
The Dependency Map: Why This Matters
The vault's 44 autonomous tasks form a complex data flow:
knowledge-seeder → Knowledge/*.md
→ knowledge-judge → goals.md
→ goal-planner → auto-implement
→ episodes.md → implement-review
blog-drafter-vault ─┐
blog-drafter-projects ├→ heartbeat/logs/*
blog-drafter-research ┘
→ blog-article-adapter → research-article-*
→ blog-writer (WRONG: receives THIN BRIEF from drafter)
The bug in blog-drafter → blog-writer was found by tracing this chain. When you zoom out and see the whole pipeline, cross-task bugs pop out:
- Drafter was writing 211-token briefs (SIZE:small = 300–500, but defaulting to ~200)
- Writer had a minimum 600-word requirement
- Drafter was feeding inadequate brief→writer crashed→blog-publisher rejected
The fix wasn't in one task. It was in seeing the interface mismatch between two tasks and correcting both.
This is why autoresearch on a vault is powerful. Neural nets have losses and gradients. Vaults have data flow. Autoresearch teaches the system to trace data flow and identify where the pipeline leaks.
Cost: Free GPU, 5% API Usage
- GPU experiments: 34 runs, ~37–53 GPU minutes each, ran overnight on local RX 9070 XT = $0
- Claude API: ~150 tool calls across all agents = ~2–5% of daily usage budget
- Human time: ~2 hours for setup + monitoring (agents ran autonomously)
Total cost: essentially free.
What's Next
-
Weekly Opus quality gate (Sundays 04:30): Have Opus review the entire week's changes and recommend the top 3 improvements for the next cycle.
-
/loopcommand: Continuous improvement mode. Instead of one cycle, keep chaining improvements until quality plateaus (e.g., "all tasks ≥ 3.8 or no improvement for 3 consecutive cycles"). -
Qwen3 upgrade experiments: Retrain deepener, quizzer, and planner on Qwen3-4B base (now that we know 4B beats 8B under our VRAM constraints). Should improve format compliance and reasoning quality.
-
Cross-vault autoresearch: Apply the same system to Linesheet, FitnessRewards, and other active codebases. Each project.CLAUDE.md becomes a config file for autoresearch. The vault is a machine learning problem. It has weights (prompts, scripts, configs), a loss function (quality scores), and a training signal (every cycle of output-judge). Autoresearch treats it that way — not as a set of files to debug, but as a model to optimize.
The payoff is compounding. Each fix reduces noise in the training signal. Cleaner signals let the next cycle make better decisions. After 30 improvements, the vault runs faster, produces better output, and requires less human oversight.
What if your entire infrastructure improved itself overnight?
It just did.