Automatic Knowledge Graph Linking
Manual linking breaks knowledge vaults at scale. Obsidian vaults grow to 500+ notes, but graph connections stagnate as humans fail to link new notes to existing ones. The result: disconnected islands of related knowledge that could reinforce each other.
Reweave solves this with automatic semantic linking. It scans recently modified notes, finds related notes using vector embeddings, and auto-appends wikilinks above a relevance threshold (0.68 cosine similarity).
Result: A knowledge vault that auto-discovers and strengthens connections. The system implements "spreading activation" — finding nodes that should be linked but aren't — turning a flat file collection into a self-reinforcing knowledge graph.
The Algorithm
The core loop is straightforward:
- Find recently modified markdown files
- For each file, build a search query from its title and first paragraph
- Run that query against a semantic search index (cosine similarity)
- Filter results above a 0.68 similarity threshold
- Skip files that are already linked
- Append a
## Relatedsection with the new wikilinks
The interesting part is what happens between steps 4 and 6.
The Hub Problem
My first version linked everything to everything. Notes like "README" and "CLAUDE" appeared in the top results for almost every query — they're semantically broad enough to match anything. The vault graph turned into a hub-and-spoke disaster where a few generic notes had dozens of incoming links and the actual interesting connections were buried.
The fix is two-phase processing with hub detection:
# Phase 1: Run all queries, track how often each result appears
declare -A hub_counts
for file in candidates; do
results=$(semantic_search "$query")
for result in results; do
hub_counts["$target"]=$(( ${hub_counts["$target"]:-0} + 1 ))
done
done
# Phase 2: Skip files that appeared in too many result sets
if (( ${hub_counts["$target"]} >= HUB_THRESHOLD )); then
continue # This is a hub — skip it
fiThe threshold scales with batch size. In a 5-file scan, appearing twice makes you a hub. In a 20-file scan, you need 4 appearances. This prevents over-linking in small batches while still catching genuine hubs in large scans.
Filters That Earned Their Place
Every filter in the system exists because of a real false positive. Here's the progression:
Generic basenames. Files named README.md, _index.md, and CLAUDE.md exist in every project directory. They match everything semantically because they describe everything. A case-insensitive basename filter catches them all.
Template files. The vault has _templates/career/networking-contact.md which kept appearing as a match for notes about networking. The path filter */_templates/* missed root-relative paths — _templates/career/... doesn't start with */. Fixed by adding both patterns.
Non-markdown files. The semantic search index includes Python scripts. vault-ask.py kept appearing as a link target. A simple extension check removes them.
Stub summaries. Files with "(pending)" or "TODO" in their summary aren't useful link targets. They're placeholders that haven't been written yet.
Git worktrees. Linesheet uses git worktrees for parallel development. The worktree directory contains copies of every file — reweave was modifying the copies alongside the originals. A *-worktrees/* exclusion fixed it.
Well-connected files. Notes with 8+ existing outgoing links are already well-integrated into the graph. Adding more links has diminishing returns and clutters the note.
The Search Infrastructure
Reweave depends on a local semantic search index that covers the entire vault — 12,000+ files embedded with qwen3-embedding:0.6b (12GB+, runs on GPU alongside the inference models). Queries take ~164ms. The index lives in SQLite.
# Build search query from file content
query=$(build_query "$file")
# Run semantic search (cosine similarity, top 10)
results=$(python3 vault-search.py "$query" --mode semantic --json --top 10)The build_query function extracts the first heading and first content paragraph, skipping frontmatter and structural elements. This gives the embedding model enough context without overwhelming it with boilerplate.
Quality Results
After five rounds of bug fixes and filter additions, the system produces consistently good links. A full-vault scan of 1,200+ project files found accurate semantic connections:
| Source | Target | Why It Works |
|---|---|---|
| CODEX-REVIEW.md | 2026-02-codex-review | Both are code review documents for the same project |
| TESTING.md | TESTING_GUIDE | Testing overview linked to its companion guide |
| README.md | GETTING-STARTED | Project README linked to the user-facing getting started doc |
Zero false positives. The hub detection filtered out generic matches, the path exclusions caught structural files, and the similarity threshold ensured genuine semantic relevance.
Integration with the Heartbeat
Reweave runs as a daily heartbeat task at 04:00. It's script-only — no model inference needed. The semantic search handles the intelligence; the script handles the plumbing.
### reweave
- Schedule: daily 04:00
- Model: none (script-only)
- Script: reweave.shIt also supports manual runs with path filtering:
# Scan entire vault (no time filter)
bash reweave.sh --full
# Scan only one project's files
bash reweave.sh --full --path=Projects/LinesheetThe --path flag is useful after creating a batch of new notes in a specific area — you can immediately wire them into the local graph without waiting for the daily run.
Graph Health Monitoring
To close the loop, the vault health heartbeat now tracks graph metrics via the Obsidian CLI:
if timeout 10 obsidian vault >/dev/null 2>&1; then
orphan_count=$(timeout 15 obsidian orphans | wc -l)
deadend_count=$(timeout 15 obsidian deadends | wc -l)
unresolved_count=$(timeout 15 obsidian unresolved total)
fiOrphans are notes with no incoming links. Dead-ends are notes with no outgoing links. If either exceeds 20, the health report suggests running reweave. Currently both are at zero — the combination of manual linking and automated reweave is keeping the graph connected.
One gotcha: obsidian orphans total counts all files in the vault, including source code in node_modules/. A 637-note vault reported 255,808 orphans. The fix is piping the markdown-only list through wc -l instead of using the total subcommand.
The reweave system turns note-taking from a write-and-forget activity into a self-reinforcing knowledge graph. Every new note automatically discovers its neighbors. Every daily run strengthens connections that a human would eventually make manually — but probably wouldn't get around to. For a vault managing 10+ projects and 12,000+ files, that compound linking is the difference between a flat file system and an actual second brain.