Two days ago I wrote about ClaudeSearch — a system that treats your vault infrastructure the same way autoresearch treats a neural network: run improvement cycles, measure quality, commit the best delta, repeat. The vault is the model. Quality scores are the loss function. Each fix is a training step.
That post was written after Day 1, when the architecture was fresh and the numbers were early. Now there's actual data, and the self-healing loop has been validated end-to-end. Here's what Day 2 looked like.
The Baseline and Where We Are Now
Before ClaudeSearch, I ran a quality audit of all 42 heartbeat tasks. Each task output was scored 1-5 by the output-judge agent (a local LLM that evaluates whether the output is complete, correct, and useful). The scores went into quality-baseline-2026-03-19.json.
Average score across 42 tasks: 3.76.
After two days of improvement cycles: 4.35.
That's a 0.59 point improvement — roughly 16% — across 42 tasks in 48 hours. Some tasks jumped dramatically (the blog drafting pipeline went from 3.1 to 4.4 after a prompt fix and a model swap). Some barely moved. A few revealed that the quality scoring itself was the problem, not the task.
But the number matters less than what generated it. The improvements came from a mix of:
- 14 prompt fixes (highest yield per fix)
- 6 script fixes (some of which had been broken for weeks)
- 5 config changes (model routing, num_predict, num_ctx)
- 4 cross-task interface fixes (data format mismatches between pipeline stages)
No single fix was dramatic. The cumulative effect was.
What the Self-Healing Loop Actually Did
The autonomous pipeline looks like this:
output-judge (23:15) → scores all task outputs
quality-regression (23:25) → auto-queues training experiments for degraded tasks
quality-triage (23:35) → detects persistent low scorers (3+ days below threshold)
auto-fix (23:40) → pattern-matches and applies mechanical fixes
goal-planner (23:45) → turns findings into actionable goals
auto-implement (00:00) → executes goals
This runs every night. The question from Day 1 was: does it actually improve things, or does it just churn?
Day 2 answer: it improves things, with caveats.
What worked well: The auto-fix stage handles a specific class of bug extremely well — mechanical pattern mismatches. If the morning briefing is truncating at 300 tokens, raising num_predict is a deterministic fix. If a prompt is producing literal template placeholders, adding a DO NOT output conditional text directive is a deterministic fix. The auto-fix stage identified 11 of these patterns and applied them correctly in 9 of 11 cases.
What didn't work: The goal-planner occasionally generated goals that were too vague to implement autonomously. "Improve knowledge-spacer quality" is not actionable. "Add --incremental flag to knowledge-spacer.sh to skip notes reviewed in the last 14 days" is. The solution was adding a specificity requirement to the goal-planner prompt — goals must include a file path, a specific change, and a verification command.
The surprise: One task — session-reflection — had been producing identical output every single day for three days. Same exact text, same structure, same fake commit names from the training example. The auto-fix stage caught it on Day 2 and fixed it (replaced the concrete example with structure-only placeholders, added a CRITICAL directive). Before ClaudeSearch, this would have kept running undetected.
The 24-Issue FitnessRewards Audit
One of the ClaudeSearch execution modes is project scout: spawn 3 agents against a codebase in parallel, each scanning a different domain (data layer, UI patterns, auth/security), then consolidate findings.
The FitnessRewards scout found 24 issues. The most instructive ones:
The 365-query N+1: The monthly wrapped feature was fetching workout data day-by-day in a loop — 365 sequential Convex queries to build one user's annual stats. The fix is a single indexed query plus in-memory aggregation. This is the kind of bug that survives code review because the loop looks reasonable and the performance difference only shows up with real data at scale.
Unclamped deltas on workout edits: When a user edits a workout to a shorter duration, XP is recalculated. The delta (new XP - old XP) can be negative. Without Math.max(0, totalXP + delta), the user's total XP can go below zero, and Math.floor(Math.sqrt(negativeNumber / 100)) is NaN. NaN then propagates into the level display and achievement checks. The bug is invisible until a specific sequence of events.
Soft-delete filter gaps: The schema has an isDeleted field on workouts, but goal progress calculations were using a query that didn't filter isDeleted !== true. Deleted workouts were still counting toward weekly goals.
None of these were caught in code review. The scout found all three in under 10 minutes.
What 175 Improvements in 48 Hours Actually Feels Like
The honest answer: it feels like aggressive debt repayment, not magic.
The vault had accumulated bugs, misconfigurations, and outdated patterns across 44 autonomous tasks, 13 fine-tuned models, and 6 active projects. Most of them were small. None were catastrophic. Together, they were dragging quality from 5.0 to 3.76. ClaudeSearch found them systematically and fixed the mechanical ones autonomously.
The 175 improvements break down as:
- 102 vault improvements (prompts, scripts, configs, schedules)
- 53 knowledge note fixes (format, frontmatter, missing metadata)
- 29 project code fixes across Linesheet, FitnessRewards, and MagicLog
The distribution matters. If ClaudeSearch had only found project code bugs, it would be a good code review tool. If it had only found vault config issues, it would be a good devops tool. Finding things across all three layers — and correlating them (a knowledge note format bug was causing the knowledge-analyst to miss 85% of available notes, which degraded the morning briefing, which scored 2.8) — is what makes it different.
The Self-Improvement Catch
There's a meta-problem with any system that improves itself: you have to trust the quality measurement. If output-judge scores inaccurately, the entire feedback loop optimizes for the wrong thing.
We found two cases where the quality scores were wrong:
-
output-judgewas being too strict about format: A task that produced excellent content in a slightly different structure was scoring 2.5. The content was fine; the format didn't match a hardcoded example in the judge prompt. Fix: made the judge evaluate content quality and completeness separately from format. -
The judge couldn't handle empty outputs: If a task produced zero output (due to a script bug), the judge would score it 3.0 ("incomplete but partially useful") instead of 1.0 ("broken"). This masked several broken tasks. Fix: added an explicit check for empty/near-empty outputs at the judge prompt level.
Fixing the measurement fixed the signal. Several tasks that were scoring 3.0 because of empty output started scoring 1.0 and immediately got flagged for repair.
The Open Source Version
ClaudeSearch is open source at github.com/vachsark/claudesearch. The core ideas are general:
- Define quality metrics for your automated processes
- Score outputs systematically after each run
- Identify the lowest-scoring processes
- Apply mechanical fixes autonomously where patterns are clear
- Escalate novel problems for human attention
- Loop
The vault-specific pieces (Convex patterns, Ollama model routing, heartbeat schedule format) are implementation details. The architecture works for any system with: automated processes that produce outputs, a way to score those outputs, and enough pattern regularity that some fixes can be automated.
What's Next
The 3.76 → 4.35 improvement is real but incomplete. The ceiling isn't 5.0 — some tasks have inherent quality limits from the underlying model capabilities. The next focus is on tasks that are stuck in the 3.5-4.0 range despite prompt fixes: those usually need better training data or a model swap, not more prompt engineering.
The autonomous training pipeline (experiment-queue-seeder → nightly-experiment → training-signal-collector) is now integrated with the quality system. When a task stays below 3.8 for 3 consecutive days and prompt fixes have been exhausted, it auto-queues a fine-tuning experiment. The first results from that pipeline should be visible within a week.
The self-healing loop is working. The question is how far it can go without human intervention. Day 3 will tell.