Vache prompts. Claude codes.How it works

Overnight Results: From 3.76 to 4.35

·7 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
vaultautoresearchautomationlocal-modelsknowledge-engine
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

I went to sleep. The system did not.

By 07:00, the heartbeat had run 17 tasks, the research pipeline had spent $5.04 on a 40-minute session, auto-implement had patched a real production bug, and the vault's quality score had climbed from 3.76 to 4.35 — a +0.59 jump overnight. The day before yesterday, the baseline was 3.76 across 42 measured tasks. This morning: 4.35.

Here's what actually happened.

The Numbers

449 knowledge notes modified in 24 hours. That's not a typo — it includes the knowledge-moc-updater touching 246 MOC files, the research session producing 11 new notes, the knowledge-seeder creating 3 more, and reweave adding 2 semantic wikilinks. The vault grew by roughly half a percent of its total size in one night.

Quality by task this morning:

TaskScoreStatus
auto-implement5/5Production fix deployed
claude-update5/52.1.80 confirmed
heartbeat-maintenance5/55 fixes applied
knowledge-moc-updater5/5246 notes updated
knowledge-seeder5/53 notes + 12 wikilinks
research-session-runner5/511 notes, $5.04
topic-planner5/536 topics queued
vault-health5/5Clean
vault-reindex5/515 stale entries pruned
morning-briefing4/519 commits, clean
daily-context4/54,744 bytes written
deps-monitor4/514 projects scanned
reweave4/52 new links
credentials-check3/5Sentry token still missing
test-runner2/56 project failures (ongoing)

Overall: 4.35/5. The baseline was 3.76 the morning before autoresearch ran. The delta is the autoresearch work landing in production — prompt fixes, model routing changes, script rewrites — expressing themselves in actual task output quality.

What auto-implement Actually Shipped

The auto-implement task is the sharpest signal in the system. It doesn't generate text — it deploys code. The goal-planner reads quality scores, identifies fixable bottlenecks, and queues implementation goals. auto-implement picks up the highest-priority goal and executes it with full tool access.

Last night's goal: fix benchmark-report.sh, which had been failing for 9+ consecutive days. The root cause: vault-search had no timeout. When Ollama was busy with a GPU-locked training run, the embeddings call would hang for 60+ seconds, and the benchmark script would silently fail.

The fix that shipped:

run_vault_search() {
  timeout 35s python3 _scripts/vault-search.py "$1" 2>/dev/null
  if [ $? -ne 0 ]; then
    timeout 25s python3 _scripts/vault-search.py "$1" --no-graph 2>/dev/null
    if [ $? -ne 0 ]; then
      # Fallback: read cached timing from last successful run
      read_search_timing_cache
    fi
  fi
}

35-second first attempt, 25-second retry without graph context, then cached fallback. The cache file search-timing-cache.txt is written on every successful run. This makes benchmark-report robust to GPU contention — which is always present, because the same machine is running training experiments overnight.

The auto-implement session report: "Root cause: vault-search had no timeout, so GPU-busy Ollama embeddings hung 60s+. Fix: wrapped python3 call in run_vault_search() with timeout 35s (first attempt) and 25s (retry). On both failures, reads SEARCH_TIMING_MS/SEARCH_PCT from search-timing-cache.txt. On success, writes cache for future fallback. Live test: vault-search returned 50%@9898ms within the 35s window. Acceptance test passes."

benchmark-report has been stuck at 2/5 for weeks. That changes tonight.

The $5.04 Research Session

The research-session-runner runs three times a week. Last night's session: 15 topics across three parallel teams (business/commerce, AI/systems, language + design), 40 minutes, $5.04 in API cost.

The session produced 11 notes. Two made it through skeptic review as synthesis notes — the highest standard:

synthesis--information-foraging-buyer-behavior — Pirolli's marginal value theorem (originally from foraging ecology) formally describes why B2B buyers abandon catalogs. The optimal foraging model predicts patch-leaving time from information density and switching cost. This generates testable linesheet UX predictions: if buyers behave like foragers, reducing switching friction should increase catalog dwell time. Score: 7/10.

synthesis--grpo-sell-through-benchmarking — GRPO (the reinforcement learning algorithm) and fashion cohort benchmarking independently converged on the same solution to distribution-shift brittleness: z-score normalization relative to a reference group. GRPO does it to stabilize reward variance across training batches; fashion analytics does it to make sell-through rates comparable across seasons. Same math, different domains. Score: 7/10.

The skeptic rejected 7 other synthesis candidates as "over-rewarding well-bounded trivial insights." Only 2 of 11 cleared the bar. That's intentional — the system runs strict so the notes that make it through mean something.

Also: the design cluster was born. Four notes created from zero: typography systems, digital color spaces, information architecture, design systems. The vault had no design coverage before last night.

The Knowledge Flywheel

This is the part that took months to close.

The loop looks like this: research notes feed into the knowledge analyst, which identifies gaps and generates goals, which feed into goal-planner, which queues implementation tasks, which auto-implement executes. The output — deployed code, fixed bugs, new features — then shows up in quality scores, which the analyst reads to close the cycle.

Three examples of the flywheel producing real shipped work:

Grace day streak (FitnessRewards): Research notes on habit formation and streak mechanics surfaced in knowledge-analyst review. Goal: implement grace days so a missed workout doesn't break a 30-day streak. Shipped: the grace day system is now in production in fitness-app. Users can configure a 1-2 day buffer before their streak resets.

CVaR for portfolio risk (Finance project): A research note on conditional value at risk — the expected loss in the worst 5% of scenarios, more conservative than VaR — was flagged as applicable to the portfolio analytics module. Goal: replace VaR with CVaR as the primary risk metric. Shipped.

Submittal workflow (Linesheet): Research into B2B procurement workflows surfaced the insight that buyers need a structured submittal step — a formal "I'm requesting samples/orders for these items" action distinct from casual browsing. This became a feature goal, which became the submittal workflow in Linesheet.

None of these started as feature ideas. They started as research notes, traveled through the analysis pipeline, and emerged as shipped code.

The Blog Pipeline: Not Running Yet

The blog pipeline — blog-drafter-vault, blog-writer, blog-publisher — is scheduled for 21:45–22:30 nightly. It has never run. The schedule.md entries show Last run: never for all three tasks.

The pipeline is configured and waiting. The issue is timing: these tasks run after the nightly training window, and the system hasn't hit that window cleanly yet. The previous run history shows the word-count gate as the recurring failure mode — blog-writer was producing 241-word drafts against a 300-word minimum, so blog-publisher was rejecting them. That was fixed in autoresearch (num_predict raised to 6144, model routing adjusted). The pipeline should clear on the next run.

Until then: this post was written manually. The signal is real even if the delivery pipe isn't flowing yet.

What 4.35 Means

The 3.76 baseline was measured across 42 tasks before autoresearch ran. The fixes landed — prompt edits, model routing changes, script rewrites, the vault-search timeout patch — and the quality scores followed.

The improvement isn't evenly distributed. test-runner is still failing (2/5) across 6 projects, a known issue with TypeScript errors that auto-implement hasn't reached yet. credentials-check is 3/5 because the Sentry token is missing. These are the floor.

The ceiling is tasks like knowledge-seeder (5/5), which created 3 precise notes with 12 wikilinks in 163 seconds, or research-session-runner (5/5), which ran a 40-minute research cycle, had a critic review all outputs, ran a skeptic pass, and produced 11 notes with only 2 fabricated citations removed.

The average is moving because the floor is rising. That's what autoresearch is designed to do: find the 2/5 tasks, fix the root cause, and measure whether the fix actually landed. Slowly, the distribution shifts right.

3.76 → 4.35 overnight. Tonight it runs again.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com