Overnight Results: From 3.76 to 4.35

Q: The Blog Pipeline: Not Running Yet

The blog pipeline is scheduled for 21:45–22:30 nightly. It has never run. The schedule.md entries show Last run: never for all three tasks. The pipeline is configured and waiting.

I went to sleep. The system did not.

By 07:00, the heartbeat had run 17 tasks, the research pipeline had spent $5.04 on a 40-minute session, auto-implement had patched a real production bug, and the vault's quality score had climbed from 3.76 to 4.35 — a +0.59 jump overnight. The day before yesterday, the baseline was 3.76 across 42 measured tasks. This morning: 4.35.

Here's what actually happened.

The Numbers

449 knowledge notes modified in 24 hours. That's not a typo — it includes the knowledge-moc-updater touching 246 MOC files, the research session producing 11 new notes, the knowledge-seeder creating 3 more, and reweave adding 2 semantic wikilinks. The vault grew by roughly half a percent of its total size in one night.

Quality by task this morning:

Task	Score	Status
auto-implement	5/5	Production fix deployed
claude-update	5/5	2.1.80 confirmed
heartbeat-maintenance	5/5	5 fixes applied
knowledge-moc-updater	5/5	246 notes updated
knowledge-seeder	5/5	3 notes + 12 wikilinks
research-session-runner	5/5	11 notes, $5.04
topic-planner	5/5	36 topics queued
vault-health	5/5	Clean
vault-reindex	5/5	15 stale entries pruned
morning-briefing	4/5	19 commits, clean
daily-context	4/5	4,744 bytes written
deps-monitor	4/5	14 projects scanned
reweave	4/5	2 new links
credentials-check	3/5	Sentry token still missing
test-runner	2/5	6 project failures (ongoing)

Overall: 4.35/5. The baseline was 3.76 the morning before autoresearch ran. The delta is the autoresearch work landing in production — prompt fixes, model routing changes, script rewrites — expressing themselves in actual task output quality.

What auto-implement Actually Shipped

The auto-implement task is the sharpest signal in the system. It doesn't generate text — it deploys code. The goal-planner reads quality scores, identifies fixable bottlenecks, and queues implementation goals. auto-implement picks up the highest-priority goal and executes it with full tool access.

Last night's goal: fix benchmark-report.sh, which had been failing for 9+ consecutive days. The root cause: vault-search had no timeout. When Ollama was busy with a GPU-locked training run, the embeddings call would hang for 60+ seconds, and the benchmark script would silently fail.

The fix that shipped:

run_vault_search() {
  timeout 35s python3 _scripts/vault-search.py "$1" 2>/dev/null
  if [ $? -ne 0 ]; then
    timeout 25s python3 _scripts/vault-search.py "$1" --no-graph 2>/dev/null
    if [ $? -ne 0 ]; then
      # Fallback: read cached timing from last successful run
      read_search_timing_cache
    fi
  fi
}

35-second first attempt, 25-second retry without graph context, then cached fallback. The cache file search-timing-cache.txt is written on every successful run. This makes benchmark-report robust to GPU contention — which is always present, because the same machine is running training experiments overnight.

The auto-implement session report: "Root cause: vault-search had no timeout, so GPU-busy Ollama embeddings hung 60s+. Fix: wrapped python3 call in run_vault_search() with timeout 35s (first attempt) and 25s (retry). On both failures, reads SEARCH_TIMING_MS/SEARCH_PCT from search-timing-cache.txt. On success, writes cache for future fallback. Live test: vault-search returned 50%@9898ms within the 35s window. Acceptance test passes."

benchmark-report has been stuck at 2/5 for weeks. That changes tonight.

The $5.04 Research Session

The research-session-runner runs three times a week. Last night's session: 15 topics across three parallel teams (business/commerce, AI/systems, language + design), 40 minutes, $5.04 in API cost.

The session produced 11 notes. Two made it through skeptic review as synthesis notes — the highest standard:

synthesis--information-foraging-buyer-behavior — Pirolli's marginal value theorem (originally from foraging ecology) formally describes why B2B buyers abandon catalogs. The optimal foraging model predicts patch-leaving time from information density and switching cost. This generates testable linesheet UX predictions: if buyers behave like foragers, reducing switching friction should increase catalog dwell time. Score: 7/10.

synthesis--grpo-sell-through-benchmarking — GRPO (the reinforcement learning algorithm) and fashion cohort benchmarking independently converged on the same solution to distribution-shift brittleness: z-score normalization relative to a reference group. GRPO does it to stabilize reward variance across training batches; fashion analytics does it to make sell-through rates comparable across seasons. Same math, different domains. Score: 7/10.

The skeptic rejected 7 other synthesis candidates as "over-rewarding well-bounded trivial insights." Only 2 of 11 cleared the bar. That's intentional — the system runs strict so the notes that make it through mean something.

Also: the design cluster was born. Four notes created from zero: typography systems, digital color spaces, information architecture, design systems. The vault had no design coverage before last night.

The Knowledge Flywheel

This is the part that took months to close.

The loop looks like this: research notes feed into the knowledge analyst, which identifies gaps and generates goals, which feed into goal-planner, which queues implementation tasks, which auto-implement executes. The output — deployed code, fixed bugs, new features — then shows up in quality scores, which the analyst reads to close the cycle.

Three examples of the flywheel producing real shipped work:

Grace day streak (FitnessRewards): Research notes on habit formation and streak mechanics surfaced in knowledge-analyst review. Goal: implement grace days so a missed workout doesn't break a 30-day streak. Shipped: the grace day system is now in production in fitness-app. Users can configure a 1-2 day buffer before their streak resets.

CVaR for portfolio risk (Finance project): A research note on conditional value at risk — the expected loss in the worst 5% of scenarios, more conservative than VaR — was flagged as applicable to the portfolio analytics module. Goal: replace VaR with CVaR as the primary risk metric. Shipped.

Submittal workflow (Linesheet): Research into B2B procurement workflows surfaced the insight that buyers need a structured submittal step — a formal "I'm requesting samples/orders for these items" action distinct from casual browsing. This became a feature goal, which became the submittal workflow in Linesheet.

None of these started as feature ideas. They started as research notes, traveled through the analysis pipeline, and emerged as shipped code.

The Blog Pipeline: Not Running Yet

The blog pipeline — blog-drafter-vault, blog-writer, blog-publisher — is scheduled for 21:45–22:30 nightly. It has never run. The schedule.md entries show Last run: never for all three tasks.

The pipeline is configured and waiting. The issue is timing: these tasks run after the nightly training window, and the system hasn't hit that window cleanly yet. The previous run history shows the word-count gate as the recurring failure mode — blog-writer was producing 241-word drafts against a 300-word minimum, so blog-publisher was rejecting them. That was fixed in autoresearch (num_predict raised to 6144, model routing adjusted). The pipeline should clear on the next run.

Until then: this post was written manually. The signal is real even if the delivery pipe isn't flowing yet.

What 4.35 Means

The 3.76 baseline was measured across 42 tasks before autoresearch ran. The fixes landed — prompt edits, model routing changes, script rewrites, the vault-search timeout patch — and the quality scores followed.

The improvement isn't evenly distributed. test-runner is still failing (2/5) across 6 projects, a known issue with TypeScript errors that auto-implement hasn't reached yet. credentials-check is 3/5 because the Sentry token is missing. These are the floor.

The ceiling is tasks like knowledge-seeder (5/5), which created 3 precise notes with 12 wikilinks in 163 seconds, or research-session-runner (5/5), which ran a 40-minute research cycle, had a critic review all outputs, ran a skeptic pass, and produced 11 notes with only 2 fabricated citations removed.

The average is moving because the floor is rising. That's what autoresearch is designed to do: find the 2/5 tasks, fix the root cause, and measure whether the fix actually landed. Slowly, the distribution shifts right.

3.76 → 4.35 overnight. Tonight it runs again.

Overnight Results: From 3.76 to 4.35

The Numbers

What auto-implement Actually Shipped

The $5.04 Research Session

The Knowledge Flywheel

The Blog Pipeline: Not Running Yet

What 4.35 Means

Further Reading

Related Articles

Vault Autoresearch: A Personal AI Learns From Itself

The Knowledge Engine: 10,800+ Notes Built by AI Agents

A Neural Router for My Knowledge Vault

About the Author

Vache Sarkissian