I have 10,800+ atomic knowledge notes in my vault. They span computer science, economics, neuroscience, psychology, mathematics, physics, biology, and 20+ other disciplines. I didn't write most of them. A pipeline of AI agents — some running on Anthropic's API, some on a fine-tuned 1.5B model on my local GPU — seeds, reviews, deepens, and maintains the entire corpus autonomously.
This isn't a novelty project. It's the knowledge infrastructure that feeds my learning page, informs my research sessions, and provides context for every coding task I do. The notes are Zettelkasten-style: atomic (one concept per file), interconnected (explicit prerequisite and connection links), and structured (frontmatter with depth level, status, discipline tags, and review dates). They live as plain Markdown in an Obsidian vault, version-controlled and searchable.
The Pipeline
Six automated tasks form the core production loop. They run on overlapping schedules, each handling one stage of the knowledge lifecycle.
Seeder — Sonnet, twice daily (14:00 and 01:00). The seeder consults a topic planner that maintains a prioritized queue of concepts to cover. It pulls from web research, arXiv papers (via a custom arxiv-fetch.sh script that queries the API and extracts abstracts), and cross-references against existing notes to avoid duplication. Each run produces 3-8 new notes. The burst run at 01:00 generates longer-form content while the GPU is otherwise idle.
Every note follows a strict template: YAML frontmatter (title, discipline, depth level, status, tags, prerequisites, connections), a summary paragraph, numbered key points, and a connections section explaining how the concept relates to other notes in the vault. The seeder doesn't freestyle — it fills a template, which makes downstream processing predictable.
Judge — Sonnet, daily. New notes get a quality review within 24 hours. The judge checks for factual accuracy, completeness relative to the depth level, proper prerequisite linking, and whether the key points actually support the summary. Notes that pass move from seedling to confirmed status. Notes that fail get flagged with specific issues — "missing prerequisite link to cs--gradient-descent" or "key point 3 contradicts established consensus" — and re-enter the seeder queue for revision.
The judge catches about 15% of notes on the first pass. Most failures are incomplete prerequisite chains or overly broad claims at the executive depth level that should be scoped tighter.
Deepener — fine-tuned Qwen2.5-1.5B, three times daily. This is the workhorse. The deepener takes confirmed executive-level notes and expands them to practitioner depth, adding implementation details, concrete examples, and practical applications. It runs on the local GPU at zero API cost, which is why it gets three daily slots — the marginal cost of running it more is just electricity.
I fine-tuned this model specifically for the deepening task. The base qwen3:8b had a 10% format compliance rate on my note template — it would hallucinate sections, skip frontmatter fields, or produce free-form essays instead of structured notes. The fine-tuned 1.5B model hits 100% format compliance. Smaller model, better results, because it learned exactly one job.
Spacer — fine-tuned Qwen2.5-1.5B, daily. Schedules spaced repetition reviews based on note difficulty, depth level, and last review date. It assigns review intervals following a modified Leitner system: new notes review in 1 day, then 3, 7, 21 days as they progress through review stages. The spacer reads note metadata and writes next_review dates into frontmatter — no separate database, no app, just Markdown fields that the learning page reads at build time.
MOC Updater — daily, no model needed. Regenerates Maps of Content files that serve as discipline-level indexes. Each MOC groups notes by sub-topic, shows depth distribution, and highlights notes missing connections. It's a pure script — reads frontmatter, computes relationships, writes Markdown. Takes about 2 seconds.
Quiz Generator — fine-tuned Qwen2.5-1.5B, daily. Creates retrieval practice questions from confirmed notes. Each question tests recall of key points, not recognition — you see the topic and discipline, then try to reconstruct the key ideas before revealing the answer. The questions feed into the Leitner-box quiz system on my learning page.
The Autonomous Loop
Content generation is only half the system. The other half is self-improvement — a closed loop where the system identifies its own weaknesses and fixes them.
Knowledge Analyst runs weekly on Opus. It's the most expensive task in the entire heartbeat system, and the only one that justifies Opus-tier reasoning. The analyst reviews the full note corpus, identifies quality gaps (disciplines with thin coverage, notes stuck at executive depth for too long, broken prerequisite chains, topics that should exist but don't), and writes concrete improvement goals to heartbeat/goals.md.
These aren't vague suggestions. They're structured goals with scope tags, priority levels, and success criteria: "Add 3 practitioner-depth notes on cs--attention-mechanisms covering multi-head attention, cross-attention, and flash attention. Prerequisites: cs--transformer-architecture (exists), cs--matrix-multiplication (exists)."
Goal Planner runs daily on Sonnet. It deduplicates, prioritizes, and schedules goals. If the analyst wrote two goals that overlap, the planner merges them. If a goal depends on prerequisites that don't exist yet, the planner reorders to create the prerequisites first.
Auto-Implement runs daily on Sonnet. It picks the top-priority scope:vault goal and executes it — writing code changes to heartbeat scripts, updating templates, fixing broken note links, or adjusting pipeline parameters. Safety gates prevent it from touching production project code, modifying CLAUDE.md, pushing to git, installing packages, or deleting files. It gets three attempts per goal before escalating.
Test Runner runs daily after auto-implement. It catches regressions — did the code change break any existing heartbeat tasks? Did the template changes produce invalid frontmatter? If tests fail, the change gets rolled back automatically.
Implement Review runs daily on qwen2.5-coder:14b. It reviews the auto-implemented changes for code quality, convention compliance, and potential issues the test runner wouldn't catch. Think of it as an async code review by a local model.
The result is a system that improves its own tooling without human intervention. Last week, the analyst identified that the deepener was producing notes with inconsistent tag formats. The goal planner prioritized it. Auto-implement fixed the template validation in the deepener script. The test runner confirmed nothing broke. I found out about it from a Telegram digest the next morning.
Search
10,800+ notes are useless if you can't find what you need in under a second. vault-search is a hybrid retrieval system I built and open-sourced that combines three strategies:
BM25 (lexical) for exact term matching. If you search "backpropagation," you get notes with that exact word, ranked by TF-IDF.
Semantic embeddings for conceptual matching. If you search "how neural networks learn," you get notes about gradient descent, loss functions, and optimization — even if they never use the phrase "how neural networks learn." Embeddings are generated locally by qwen3-embedding:0.6b and stored in a numpy index.
RRF fusion combines the two result lists using Reciprocal Rank Fusion, which handles the score normalization problem elegantly — no need to calibrate BM25 scores against cosine similarity scores. A document ranked #2 in both lists outranks a document ranked #1 in one and #50 in the other.
On top of this, three optional features:
- HyDE query expansion: For vague conceptual queries, the system generates a hypothetical answer using a local LLM, then searches for documents similar to that answer. Adds ~120ms but dramatically improves recall for questions like "what connects attention and working memory?"
- LLM re-ranking: Post-retrieval re-ranking using
qwen3:8bto score relevance. Position-aware blending gives 60% weight to retrieval scores for top-5 results and 60% weight to reranker scores for the rest. - Typed sub-queries:
lex:"backpropagation"forces BM25,vec:"learning algorithms"forces semantic,hyde:"why do transformers scale?"forces HyDE expansion. Mix them for precision.
The whole system searches 1,800 vault files in ~560ms with no API calls. It's the backbone of every research session and every agent that needs vault context.
Depth Levels
Notes aren't one-size-fits-all. The depth system defines four levels that map to increasing expertise:
Executive — the overview. What is this concept, why does it matter, where does it fit in the discipline. 3-5 key points. No implementation details. This is what the seeder produces initially.
Practitioner — hands-on knowledge. How to apply the concept, what the tradeoffs are, concrete examples. 5-8 key points with code snippets or formulas where relevant. The deepener automates the executive-to-practitioner transition.
Specialist — deep understanding. Edge cases, failure modes, connections to adjacent fields, historical context. 8-12 key points. These require longer research cycles and typically come from the weekly deep-dive task or manual research sessions.
Researcher — cutting-edge. Current open problems, recent papers, active debates, speculative extensions. These notes cite specific papers and are updated when new research lands. The arXiv integration feeds this level.
The depth distribution right now: roughly 40% executive, 35% practitioner, 20% specialist, 5% researcher. The deepener is steadily converting executive notes to practitioner, which is the most impactful transition — it's the difference between "I've heard of this" and "I can work with this."
What's Next
Three things I'm building toward.
Dynamic prerequisite chains. The prerequisite links exist but they're sparse — maybe 30% of notes have them. I want every note to have a complete prerequisite chain so the system can generate optimal learning paths. "You want to understand transformer architecture? Start here, then here, then here." The analyst is already writing goals to fill these gaps.
Cross-discipline synthesis. The most interesting insights sit at discipline boundaries — how neuroscience research on attention maps to transformer architecture, how economic game theory applies to multi-agent systems. A planned vault-synthesis task will systematically identify and generate these bridge notes.
Research pipeline integration. The research pipeline (multi-wave research sessions with verification and skeptic agents) currently produces reports that sit in scratch files. Connecting it to the knowledge engine means every research session automatically produces atomic notes that enter the seeder-judge-deepener cycle. Research becomes knowledge becomes searchable context becomes better research. The flywheel.
The knowledge engine is infrastructure. It's not the thing I ship — it's the thing that makes everything I ship better. Every coding session has richer context. Every research question starts from a higher baseline. Every decision draws on a structured, searchable, continuously improving knowledge base that grows while I sleep.