A research pipeline orchestrating Claude agents can generate 80+ high-quality knowledge notes in a single session — each 2,000-4,000 words, with cross-references, prerequisite chains, and connection maps — without manual writing. This scales a personal knowledge base from hundreds to thousands of research-backed notes while keeping all infrastructure on local hardware at zero API cost.
The pipeline produces 10,800+ Zettelkasten-formatted notes (16,000,000+ words) across 28+ disciplines: computer science, economics, mathematics, neuroscience, psychology, physics, biology, philosophy, and many more. Notes are searchable in under 600ms, version-controlled as plain Markdown, and interconnected with explicit prerequisites and conceptual links. Autonomous heartbeat tasks running on local models maintain, deepen, and expand the vault 24/7 without human intervention.
This post covers the architecture: multi-agent research orchestration, quality gates, synthesis generation, and how autonomous agents handle maintenance.
Why Build This
I wanted a searchable knowledge base that grows faster than I can read. Not a bookmark folder. Not a collection of saved articles. A structured corpus where every concept links to its prerequisites, where I can search "what connects attention mechanisms to working memory" and get a real answer in half a second.
The standard approach — read a paper, take notes, file them — doesn't scale. I can read maybe 2-3 papers a day if I'm focused. The vault needs breadth across seven disciplines. At the manual rate, building out comprehensive coverage of even one discipline would take years.
The alternative: treat note generation as a production pipeline. Define the output format. Parallelize the research. Add quality gates. Automate the maintenance. Keep the human (me) in the loop for direction-setting and spot-checking, but let the machines handle volume.
The Note Format
Every note in the vault looks the same. This consistency is load-bearing — it's what makes the downstream automation possible.
---
title: "cs--transformer-architecture"
discipline: "computer-science"
depth: "practitioner"
status: "confirmed"
tags: ["deep-learning", "nlp", "attention"]
prerequisites: ["cs--attention-mechanisms", "cs--matrix-multiplication"]
connections: ["cs--bert", "cs--gpt", "neuro--selective-attention"]
created: 2026-02-15
last_review: 2026-03-10
author:
name: Vache Sarkissian
sameAs:
- https://github.com/vachsark
- https://www.linkedin.com/in/vsarkissian
next_review: 2026-03-17
---
One-paragraph summary of the concept.
## Key Points
1. First key point with enough detail to be useful
2. Second key point...
...
## Connections
- **cs--attention-mechanisms**: How this concept builds on attention
- **neuro--selective-attention**: The biological parallelTopic prefixes (cs--, econ--, math--, psych--, neuro--, bio--, phys--) make discipline obvious at a glance and prevent naming collisions. A note called attention is ambiguous. cs--attention-mechanisms and neuro--selective-attention are not.
Depth levels define how deep the note goes: executive (overview), practitioner (hands-on), specialist (edge cases and failure modes), researcher (current open problems). The seeder creates executive-level notes. The deepener expands them. This matters because a 500-word overview and a 4,000-word deep dive serve different purposes, and the pipeline needs to know which one it's producing.
The Research Session
A single research session is where the bulk of new notes get created. Here's what happens when I kick one off.
Topic Selection
Before anything runs, the pipeline needs to know what to research. A topic planner maintains a prioritized queue based on coverage gaps. It knows which disciplines are thin, which notes are stuck at executive depth, and which prerequisite chains are incomplete. For a session targeting 80 topics, it might allocate 15 to computer science, 12 to economics, 10 to neuroscience, and so on — weighted by where the gaps are largest.
I can override this. If I want a deep dive into mechanism design or category theory, I tell it. But most sessions run on auto-selected topics, and the planner's choices are usually good because it has full visibility into what already exists.
Parallel Research Agents
The core of the pipeline is Claude Code's subagent spawning. Instead of researching topics sequentially — which would take hours — the orchestrator fans out multiple research agents in parallel. Each agent gets a batch of topics, access to web search, and the note template.
Each agent:
- Searches the web for the topic, prioritizing academic sources, textbooks, and established reference material
- Cross-references against existing vault notes (via the semantic search engine) to avoid duplicating coverage
- Generates a structured note following the template
- Tags prerequisites and connections to other notes
Parallelism cuts the wall-clock time dramatically. Eighty topics that would take 4-5 hours sequentially finish in under an hour with parallel agents.
Quality Gates
Raw output from research agents isn't trusted. Two specialized agents review everything before it enters the vault.
The verifier agent checks factual claims. It cross-references key assertions against sources, flags unsupported claims, and catches the most common failure mode: confidently stating something that the cited source doesn't actually say. About 15-20% of first-draft notes get flagged by the verifier, usually for overclaiming or citing a secondary source when a primary one exists.
The skeptic agent is deliberately adversarial. It challenges causal claims ("you say X causes Y — is that established or is this one correlational study?"), identifies missing counterarguments, and flags overconfidence. The skeptic doesn't care about being helpful. Its job is to find weaknesses.
Notes that fail verification get revised. Notes that the skeptic challenges get annotated with uncertainty markers rather than rewritten — I'd rather have a note that says "evidence is limited" than one that sounds certain about something uncertain.
Synthesis: The Most Valuable Output
The quality gates are important, but the synthesis step is where the real value lives.
After generating individual topic notes, a synthesis agent scans the full batch for cross-domain connections. It's looking for structural parallels, shared mathematical foundations, or conceptual bridges between fields that don't normally talk to each other.
Some examples from actual synthesis notes in the vault:
Arrow's Impossibility Theorem and FLP Impossibility. One is from economics (1951), the other from distributed systems (1985). Both prove that you can't aggregate local information into global agreement without sacrificing one of three reasonable-sounding axioms. The leader in Paxos is the mathematical analogue of Arrow's dictator. This isn't a loose metaphor — the structural mapping is tight enough that knowing one theorem deepens your understanding of the other.
Bank runs and gossip protocols. Diamond-Dybvig's model of bank runs and epidemic-style gossip protocols in distributed systems share the same information cascade structure. In both cases, local rational behavior (withdrawing your deposit / propagating a message) creates global emergent behavior that's either stabilizing or catastrophic depending on initial conditions. The vaccination threshold in epidemiology maps to the reserve ratio in banking.
Spaced repetition and TCP congestion control. Both are feedback-driven systems that probe capacity limits. Spaced repetition increases review intervals until recall fails, then backs off. TCP increases window size until packet loss occurs, then backs off. Both converge on an optimal rate through the same additive-increase-multiplicative-decrease pattern. The Leitner box system is, structurally, slow-start for memory.
These connections aren't things I would have found by reading papers in one discipline. They emerge because the vault has breadth across fields and a synthesis step that explicitly looks for structural parallels. This is the output that justifies the entire pipeline — not the individual notes (which are useful but could be replaced by a good textbook), but the connections between them.
The Heartbeat: Autonomous Maintenance
Research sessions produce the initial notes. But a knowledge base that only grows during active sessions will rot. Notes go stale, prerequisite chains stay incomplete, coverage gaps persist. The heartbeat system handles ongoing maintenance.
The heartbeat is a scheduler that runs automated tasks on a loop — some on Claude's API, most on local models via Ollama running on my GPU. The tasks relevant to the knowledge vault:
Knowledge Seeder (Daily, Twice)
Picks topics from the planner queue and generates new notes using local models. Each run produces 3-8 notes. The seeder runs at 14:00 and 01:00 — the night run generates longer-form content while the GPU is otherwise idle. Combined, these two runs add 40-50 new notes per week without any API cost.
Knowledge Analyst (Daily)
This is the only task that runs on Claude's most capable model. The analyst reviews recent notes, identifies quality issues, spots coverage gaps, and writes concrete improvement goals. A typical goal: "Add practitioner-depth notes on cs--attention-mechanisms covering multi-head attention, cross-attention, and flash attention. Prerequisites: cs--transformer-architecture (exists), cs--matrix-multiplication (exists)."
The analyst is expensive, but it's the task that keeps the vault coherent. Without it, the seeder would produce notes in whatever distribution it felt like, without awareness of what the vault actually needs.
Knowledge Deepener (Three Times Daily)
Takes confirmed executive-level notes and expands them to practitioner depth — adding implementation details, concrete examples, and practical applications. This runs on a fine-tuned local model (more on this below), so it gets three daily slots at zero marginal cost.
The executive-to-practitioner transition is the most impactful depth upgrade. It's the difference between "I've heard of gradient descent" and "I can implement gradient descent and explain why the learning rate schedule matters."
Goal Planner and Auto-Implement (Daily)
The analyst writes goals. The goal planner deduplicates, prioritizes, and schedules them. The auto-implement task picks the top priority goal and executes it — modifying pipeline scripts, updating templates, fixing broken note links. Safety gates prevent it from touching anything outside the vault automation code.
This is a closed loop: the analyst identifies a problem, the planner prioritizes it, auto-implement fixes it, a test runner verifies nothing broke. I find out about it from a Telegram digest the next morning. Last week the analyst noticed the deepener was producing inconsistent tag formats. The fix was implemented, tested, and deployed without me touching anything.
The Training Data Pipeline
Five hundred and forty-five notes, each with structured key points and clear formatting, make excellent training data.
I extracted 5,521 question-answer pairs from confirmed notes. The extraction is mechanical: for each key point in a note, generate a question that tests recall of that point. The question includes the topic and discipline as context. The answer is the key point itself plus surrounding context from the note.
These Q&A pairs trained a fine-tuned Qwen2.5-1.5B model — vault-deepener-q8 — that now runs three of the vault's daily heartbeat tasks. The base model (qwen3:8b, which is 5x larger) had a 10% format compliance rate on the deepener task. It would reason about the task, explain its approach, then produce output in whatever structure it felt like. The fine-tuned 1.5B model hits 100% format compliance. Smaller model, better results, because it learned exactly one job.
The training loop is: vault notes produce training data, training data produces a better local model, the better model produces better notes, better notes produce better training data. Each cycle tightens the format compliance and domain accuracy of the local model. I've run two training iterations so far.
Training takes under 5 minutes on the local GPU. LoRA fine-tuning on a 1.5B parameter model fits in 6GB of VRAM. The whole pipeline — from note extraction to Q&A generation to training to Ollama deployment — runs with zero API cost.
Vault Search
654 notes are useless without retrieval. vault-search is a hybrid search engine I built that combines three strategies:
BM25 for exact term matching. Search "backpropagation," get notes containing that word, ranked by TF-IDF.
Semantic embeddings for conceptual matching. Search "how neural networks learn," get notes about gradient descent and loss functions even if they never use that phrase. Embeddings are generated locally and stored in a numpy index.
Reciprocal Rank Fusion (RRF) combines both result lists without needing to calibrate BM25 scores against cosine similarity scores. A document ranked well in both lists outranks one that's top in one and absent from the other.
Optional enhancements: HyDE query expansion (generate a hypothetical answer, then search for documents similar to it — adds ~120ms but dramatically improves recall for conceptual queries), LLM re-ranking (post-retrieval relevance scoring with position-aware blending), and typed sub-queries (lex:"term" for BM25, vec:"concept" for semantic, hyde:"question" for expansion).
The full search pipeline runs against 1,800+ vault files in ~560ms with no API calls. It's the backbone of every research session, every pre-flight check, and every agent that needs vault context.
What Works
Breadth coverage. The vault has meaningful notes across seven disciplines. No human could build this breadth at this speed while maintaining structural consistency. The pipeline's strength is volume with format adherence.
Cross-domain synthesis. The connections between Arrow and FLP, between bank runs and gossip protocols, between spaced repetition and TCP — these are genuinely useful insights that I wouldn't have found by reading within a single discipline. The synthesis step is the pipeline's most valuable output.
Structured consistency. Every note has the same format. Every note has prerequisites and connections. Every note has a depth level and review date. This consistency enables downstream automation — the deepener, the quiz generator, the spaced repetition scheduler — that would be impossible with freeform notes.
Autonomous maintenance. The heartbeat loop means the vault improves while I sleep. Coverage gaps get identified and filled. Notes get deepened. Training data gets generated. The knowledge base is a system, not a project.
What Doesn't Work
Cutting-edge topics. LLMs have training cutoffs. A note about a paper published last month will either be shallow (if the model knows about it from web search) or wrong (if it confabulates details). The pipeline is strong on established knowledge and weak on the frontier. I manually write or heavily edit researcher-depth notes on recent work.
Mathematical proofs. LLMs hallucinate proof steps. They'll state a theorem correctly, motivate it well, and then produce a "proof" that skips critical steps or introduces subtle errors. The vault's math notes are useful as conceptual overviews, but I don't trust them for rigorous proofs. Every math note with proof content gets manual verification.
Empirical claims. "Studies show that X" is the most dangerous sentence in a generated note. The verifier catches the most egregious cases — citing nonexistent papers, misrepresenting findings — but subtle mischaracterizations slip through. I treat empirical claims in the vault as "probably directionally correct, verify before citing."
Depth on niche topics. The pipeline produces excellent practitioner-level notes on well-documented topics (transformer architecture, mechanism design, cognitive biases). It struggles on niche topics with thin web coverage. Notes on obscure subfields of topology or specific neurotransmitter pathways tend to be shallow even after deepening, because the models don't have enough training data to go deep.
Connection quality varies. The synthesis step finds genuinely brilliant connections about 30% of the time. The other 70% ranges from obvious ("gradient descent and Newton's method are both optimization techniques") to forced ("here's a strained analogy between protein folding and compiler optimization"). I keep the good ones and delete the rest. The hit rate is worth it because the good connections are things I'd never find on my own.
The Numbers
As of today:
- 654 atomic knowledge notes
- 775,000+ words total
- 7 disciplines covered (CS, economics, math, neuroscience, psychology, physics, biology)
- 80+ topics researched in a single focused session
- 5,521 Q&A pairs extracted for fine-tuning
- 2,000-4,000 words per note at practitioner depth
- ~560ms full semantic search across the vault
- $0 ongoing cost for daily maintenance (local GPU)
- 20+ automated heartbeat tasks running the pipeline
What This Is and Isn't
This is not a replacement for reading papers. It's not a replacement for thinking. It's infrastructure — a structured, searchable, continuously growing knowledge base that gives me a higher starting point for every research question I ask and every system I design.
The vault doesn't make me an expert in neuroscience. It makes me someone who can find the right neuroscience concept when I need a cross-domain analogy for a distributed systems problem. It makes me someone who can search "information cascades" and get results from economics, epidemiology, and computer science in the same query. It makes the connections between fields visible and navigable.
The pipeline is imperfect. The notes need spot-checking. The synthesis is hit-or-miss. The math can't be trusted without verification. But 654 structured, interconnected, searchable notes across seven disciplines — maintained autonomously, growing daily, with a trained local model that costs nothing to run — is more useful than any note-taking system I've used before.
The code is a collection of Claude agent definitions, shell scripts, and a Python search engine, all orchestrated by a cron-like heartbeat scheduler running on an Arch Linux machine with an AMD GPU. Nothing exotic. The hard part wasn't the tooling — it was defining the output format, building the quality gates, and being honest about what the pipeline can and can't do.