What is vault autoresearch and how does it differ from traditional autoresearch?

Vault autoresearch applies Andrej Karpathy's autoresearch methodology to a personal knowledge vault instead of a neural network. Rather than treating a trained model as the optimization target, it treats the vault's prompts, scripts, configs, and training data as 'weights' that can be improved. Quality scores (1-5 per task) act as the loss function, and improvement cycles are training steps. This enables the vault to diagnose and fix its own bottlenecks automatically.

What are the three execution modes for vault autoresearch?

Option 1 (Interactive Skill) performs a single diagnosis-fix-verify cycle with human oversight. Option 2 (Background Agent) chains 3+ cycles automatically, fixing multiple bottlenecks in sequence without human intervention. Option 3 (Team Session) spawns multiple agents in parallel across different domains (vault tasks, Linesheet codebase, FitnessRewards, GPU experiments) to fix 30+ improvements simultaneously.

What types of improvements did the autoresearch system find?

The system discovered five categories of improvements: Prompt Fixes (14 instances of adding explicit rules or constraints), Script Fixes (6 bottlenecks like subprocess overhead and regex errors), Model Swaps (2 cases where changing the underlying model resolved format-following issues), Config Tweaks (3 parameter tuning improvements), and Data Quality Fixes (3 training data audits).

What were the concrete results of running vault autoresearch?

In ~2 hours, the system identified and fixed 30+ improvements across all categories. Specific wins included reducing knowledge-spacer subprocess overhead by 75%, eliminating a knowledge-quiz regex bug that broke section extraction, fixing prompt copy-paste artifacts that repeated identical output for days, and improving GPU experiment outcomes (e.g., spacer format score 33%→100%, TIES-density optimization to 0.5 density sweet spot).

Vault Autoresearch: A Personal AI Learns From Itself

Andrej Karpathy's autoresearch idea is elegant: run 100 experiments a night on one GPU, measure which change improves model loss the most, commit the best delta, repeat. It's self-improvement on a feedback loop — no humans second-guessing, just measurement and gradient following.

But autoresearch assumes you're optimizing a trained model. What if the thing being optimized is not a neural network, but a vault — a collection of prompts, scripts, configs, and training data that drive an autonomous agent system?

That's what we built today.

The Vault as a Model

A personal knowledge vault with 13 fine-tuned local models running autonomous tasks (knowledge scoring, goal planning, note creation, spaced repetition, blog drafting, and more) is, in some sense, a machine learning model:

Weights = prompts, scripts, configs, and training data
Inference = one cycle of the heartbeat system (44 autonomous tasks running daily/weekly)
Output = knowledge notes, blog posts, research insights, implementation goals
Loss function = quality scores (1–5 per task, assigned by output-judge after each run)
Training step = one improvement cycle (fix a bottleneck, measure the quality delta)

The vault has the same learning problem as any model: the weights (prompts, scripts, configs) are not perfectly tuned. Some tasks consistently score 2.5/5. Some are broken outright (zero notes produced in 8 days). Some have cross-task bugs that only show up when you trace data flow across the entire pipeline.

Can we apply autoresearch to improve the vault?

Yes. And the results are striking.

What We Built: Three Execution Modes

We started with a single interactive skill (/autoresearch) for one improvement cycle. But one cycle catches only the most obvious bottleneck. The real value is in chaining cycles together — feeding the output of one fix as the diagnosis for the next.

Option 1: Interactive Skill (Single Cycle)

sonnet-proposal (read quality scores → identify bottleneck)
→ human applies fix
→ sonnet-verification (confirm the fix is correct)

Works for one improvement. Useful for targeted diagnosis. But the vault has 21 failing tasks, 100+ potential fixes. One cycle misses systemic patterns.

Option 2: Background Agent (3-Cycle Chain)

Same as Option 1, but looped:

Cycle 1: diagnose lowest task (blog-drafter) → fix prompt → verify
Cycle 2: diagnose next (knowledge-spacer) → fix config → verify
Cycle 3: diagnose next (morning-briefing) → fix script → verify

Result: 59 tool calls, 6 minutes, 3 targeted fixes logged. The agent found three different root causes (prompt, config, script) and fixed them without human intervention. This is where autoresearch starts earning.

Option 3: Team Session (Parallel Agents)

Spawn multiple agents in parallel:

vault-autoresearch-1 chaining cycles on tasks 1–7
vault-scout-linesheet scanning the Linesheet codebase for bugs (3 agents, one per domain)
vault-scout-fitnessrewards scanning FitnessRewards
gpu-experiment-runner queuing and running training experiments

All running at the same time, reporting back into a consolidated findings file.

Result: 30+ improvements in ~2 hours.

The Numbers: What Actually Happened

Prompt Fixes (14)

These were the highest-yield fixes. A single prompt edit — usually adding an explicit rule, example, or constraint — often moved a task from 3.0→4.2 quality in one cycle.

Examples:

blog-drafter-vault (Cycle 1): Task was writing thin briefs (211 tokens, 3 facts only). Prompt said "include key facts" but didn't specify how many. Fix: expanded template to require 5 facts, specific filenames (not category tags), and measurable metrics. Expected jump from 3.1→4.0.
morning-briefing (Cycle 2): num_predict: 300 was truncating output mid-section. The model was leaking conditional formatting instructions into the output ("(or 'quiet — no commits' if 0)"). Fix: raised num_predict to 600 and added /no_think flag to suppress Qwen3 thinking overhead. Score 3.0→3.8.
session-reflection (Cycle 1): The model was copying its example verbatim every single day. The training data included a specific example with real commit names ("2026-03-10: Precision over scope"), and the small local model learned to reproduce it exactly. Three days of identical output. Fix: replaced the example with structure-only placeholders and added a CRITICAL: DO NOT COPY directive.
research-radar (Cycle 2): Model was outputting markdown code fences and literal [Tool] placeholders. Prompt showed a template-style example that the model interpreted as literal output. Fix: replaced example with concrete tool names and added DO NOT output code fences rule.

Script Fixes (6)

Some of the best finds were bugs buried in collector scripts that ran thousands of times.

knowledge-spacer.sh had a process fork storm:

# OLD: 4 sed invocations per note
for note in Knowledge/*.md; do
  last_reviewed=$(sed -n '/^last_reviewed:/p' "$note")
  sr_interval=$(sed -n '/^sr_interval:/p' "$note")
  sr_ease=$(sed -n '/^sr_ease:/p' "$note")
  review_count=$(sed -n '/^review_count:/p' "$note")
done
 
# 879 knowledge notes × 4 sed calls = 3,516 process forks per run

Fix: replaced 4 sed calls with a single awk pass:

# NEW: single awk extracts all 4 fields once
awk '/^---$/{if(++fences==2) exit} fences==1{
  if(/^last_reviewed:/) last_reviewed=$0
  if(/^sr_interval:/) sr_interval=$0
  if(/^sr_ease:/) sr_ease=$0
  if(/^review_count:/) review_count=$0
}' "$note"

Result: 75% reduction in subprocess overhead. The spacer ran 10–20s slower on startup just forking sed. Now it's instant.

knowledge-quiz.sh had a regex that ate entire sections:

# OLD: stopped at /^\## [^K]/ — missed sections starting with K
awk '/^\## [^K]/{exit} {print}'
 
# Broke on "## Key Sources", "## Key Facts" — these leaked into quiz

Fix: changed to explicit section boundary:

# NEW: extract only "## Key Points" section
awk '/^\## Key Points/{found=1; next} found && /^\## /{exit} found{print}'

fashion-signals.sh was strangled by an outdated assumption:

# OLD: head -10 (commented: "fit Ollama 4096 token context")
head -10 signals_list.txt  # sends 10 items
 
# Model: "With 10 items, everything appears count=1. No trends visible."

The comment was from when Ollama's default context was 4096. Gemma3 has 131k context. With only 10 items, every signal appeared once, making aggregation useless.

Fix: head -10 → head -50. Now we get real cross-item patterns.

output-judge.sh was hitting its context ceiling:

# OLD: send all tasks, all output
44 tasks × 80 lines/task = ~3,500+ lines = 8192 tokens (context limit)
→ model hits limit, produces garbage, EOF error

Fix: head -80 → head -40 per task. 40 lines is enough to evaluate quality; 80 was causing truncation. Model now completes cleanly.

Model Swaps (2)

Sometimes the best fix is just use a different model.

knowledge-seeder-gemini: was producing identical garbage output ("Positional Encoding" text continuation) for 8+ days. Root cause: vault-unified-4b-q8 (a unified task model) can't follow the === NOTE: === format — it kept summarizing context instead of generating notes.

Fix: Swap to gemma3:12b (better at format following). Next run: properly formatted notes.

blog-writer: was selecting qwen2.5-coder:14b 20% of the time via exploration. This code model excels at 410-token output (its training data is code examples). For prose writing, that's trash. Two consecutive 241-word failures (rejected by publisher for being under 600-word minimum).

Fix: Remove qwen2.5-coder from the candidate pool. Now it only chooses between gemma3:12b and qwen3:8b, both of which handle 600+ word prose.

lesson-distiller: This one was subtle. The prompt told the model to "use the Write tool to create Knowledge/practice--*.md" — but the Write tool was removed from the allowed toolset on 2026-03-16. The model had no way to act, so it outputted a narrative consulting report instead of the expected === PRACTICE NOTE: === blocks.

Fix (Cycle 2): Rewrote prompt to output structured text blocks instead of calling a tool.

Fix (Cycle 3): Realized there was no postprocessor (_postproc-lesson-distiller.sh) to parse the text blocks and write the files. Created one. The flywheel went from broken to functional.

Config Tweaks (3)

Sometimes the fix is just tune the knob.

Task	Fix	Before	After
knowledge-spacer	raised `num_predict` 300→600	truncated output	complete output
morning-briefing	added `/no_think` flag	thinking overhead wasting tokens	deterministic output
fashion-signals	raised head -10→-50	all signals count=1	real patterns

Data Quality Fixes (3)

Three training data audits found issues in the pipelines feeding models:

deepener-4b: Only 5 original training examples. Mined 13 new examples from heartbeat logs (2026-03-07 to 2026-03-18). Total: 18 examples (→ 5-15 items each, good coverage).
planner-4b: 18 examples with a 50/50 balance of HAS-GOALS and NO-CHANGE cases. Too many thin "already up to date" examples (3 removed). Final: 15 examples, 60/40 split.
spacer: 310 total examples across 3 files. Format validation: 100% pass rate. No data cleaning needed — the 33%→100% format improvement in the first experiment came from epoch tuning, not data quality.

GPU Experiments: 34 Training Runs

While the prompt/script fixes were running, the GPU was running a parallel sweep of hyperparameter experiments on four models:

Spacer (format score went 33%→100%):

EXP-001: epochs 5→10 ✓ metric 0.33→1.0 (KEEP)
EXP-002: epochs 5→15 ✗ metric 0.33→0.0 (overfit, discard)
EXP-003: lr 5e-5→1e-4 ✗ metric 0.33→0.0 (discard)
EXP-004: lora_rank 16→32 ✗ metric 0.33→0.0 (discard)

Simple lesson: epochs were the bottleneck. Doubled them, problem solved.

TIES-density (multimodal learning, swept 0.1–0.7):

EXP-008 at density=0.2 ✓ metric 0.675
EXP-009 at density=0.4 ✓ metric 0.7
EXP-010 at density=0.5 ✓ metric 0.7078 (best)

Sweet spot at 0.5. Higher density (0.6+) caused task interference; lower (0.1-0.3) underfit.

Deepener-4b (post-fix):

EXP-20: epochs 5→15 ✓ metric 0.0→0.5 (KEEP)
EXP-20-002: lr 5e-5→1e-4 ✓ metric 0.0→0.5 (both work)

Both fixes work independently. Use both.

Seeder-4b-v4 (had config issues):

EXP-20-014: epochs 5→8 with v4 config ✓ metric 0.0→0.75 (KEEP)

Old config name was stale. Switching to v4 + epochs fixed it.

Code Scout Reports: 30+ Linesheet Issues

The project scouts (separate agents reading codebases) found 30 actionable issues across Linesheet and FitnessRewards:

Linesheet:

2 query purity violations (new Date() fallbacks in read handlers — breaks Convex reactive caching)
4 dead public query functions (diagnostic image queries, never called, should be internalQuery)
1 TS2589 type error (excessive type instantiation on customers page)
3 useEffect antipatterns (form state sync — not a bug, but low risk)
3 Promise.all fan-outs that could be parallelized better

FitnessRewards Analytics:

Missing by_tenant_account index usage (full-table scans that should use indexed queries)
Unbounded poGroupItems.isReceived query (comment says "180 days" but queries ALL received items)
Duplicate computation between getInventoryIntelligence and getHoldingCostInsights (both doing 90-day line item scans)
Auth inconsistency (some functions return [] on auth failure, others should use requireAuth)

Total actionable items: 30. Most are quick (1–5 minute) fixes that improve performance, reduce surface area, or tighten auth. None are critical bugs, but collectively they represent ~50 minutes of technical debt that autoresearch surfaced in minutes.

The Dependency Map: Why This Matters

The vault's 44 autonomous tasks form a complex data flow:

knowledge-seeder → Knowledge/*.md
  → knowledge-judge → goals.md
  → goal-planner → auto-implement
  → episodes.md → implement-review

blog-drafter-vault ─┐
blog-drafter-projects ├→ heartbeat/logs/*
blog-drafter-research ┘
  → blog-article-adapter → research-article-*
  → blog-writer (WRONG: receives THIN BRIEF from drafter)

The bug in blog-drafter → blog-writer was found by tracing this chain. When you zoom out and see the whole pipeline, cross-task bugs pop out:

Drafter was writing 211-token briefs (SIZE:small = 300–500, but defaulting to ~200)
Writer had a minimum 600-word requirement
Drafter was feeding inadequate brief→writer crashed→blog-publisher rejected

The fix wasn't in one task. It was in seeing the interface mismatch between two tasks and correcting both.

This is why autoresearch on a vault is powerful. Neural nets have losses and gradients. Vaults have data flow. Autoresearch teaches the system to trace data flow and identify where the pipeline leaks.

Cost: Free GPU, 5% API Usage

GPU experiments: 34 runs, ~37–53 GPU minutes each, ran overnight on local RX 9070 XT = $0
Claude API: ~150 tool calls across all agents = ~2–5% of daily usage budget
Human time: ~2 hours for setup + monitoring (agents ran autonomously)

Total cost: essentially free.

What's Next

Weekly Opus quality gate (Sundays 04:30): Have Opus review the entire week's changes and recommend the top 3 improvements for the next cycle.
/loop command: Continuous improvement mode. Instead of one cycle, keep chaining improvements until quality plateaus (e.g., "all tasks ≥ 3.8 or no improvement for 3 consecutive cycles").
Qwen3 upgrade experiments: Retrain deepener, quizzer, and planner on Qwen3-4B base (now that we know 4B beats 8B under our VRAM constraints). Should improve format compliance and reasoning quality.
Cross-vault autoresearch: Apply the same system to Linesheet, FitnessRewards, and other active codebases. Each project.CLAUDE.md becomes a config file for autoresearch. The vault is a machine learning problem. It has weights (prompts, scripts, configs), a loss function (quality scores), and a training signal (every cycle of output-judge). Autoresearch treats it that way — not as a set of files to debug, but as a model to optimize.

The payoff is compounding. Each fix reduces noise in the training signal. Cleaner signals let the next cycle make better decisions. After 30 improvements, the vault runs faster, produces better output, and requires less human oversight.

What if your entire infrastructure improved itself overnight?

It just did.