Vache prompts. Claude codes.How it works

vault-search: Building Hybrid Retrieval That Actually Works on Local Hardware

·6 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
searchretrievallocal-aiollamaopen-sourcenlp
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

I've been building a personal knowledge vault for a couple of years — 800+ markdown notes, code files, project docs, research notes. The standard approach to searching it is grep. Grep is fast and reliable, but it only finds what you already know to search for. If you write a note about "temporal difference learning" and later search for "how reinforcement learning handles delayed feedback," you get nothing.

So I built vault-search: hybrid BM25 + semantic search with RRF fusion, HyDE query expansion, a knowledge graph for entity traversal, and typed sub-queries. It runs entirely on local hardware via Ollama. No API cost. No data leaving your machine.

Here's what's interesting about building it.

Why Hybrid Instead of Just Semantic

Pure semantic search (embedding similarity) is good at conceptual queries but terrible at exact identifiers. If you search for getTenantId, a semantic model might surface files about "authentication patterns" and "multi-tenant architecture" — relevant concept, wrong function. You wanted the file with that specific function name.

BM25 (keyword matching) is the opposite: precise for exact terms, blind to synonyms and concepts. "Reinforcement learning" and "RL" are the same thing to a human, different tokens to BM25.

Hybrid search gets the best of both. The question is how to combine them. vault-search uses Reciprocal Rank Fusion:

def rrf_score(rank, k=60):
    return 1.0 / (k + rank)
 
def fuse(semantic_results, bm25_results):
    scores = defaultdict(float)
    for rank, (file, _) in enumerate(semantic_results):
        scores[file] += rrf_score(rank)
    for rank, (file, _) in enumerate(bm25_results):
        scores[file] += rrf_score(rank)
    return sorted(scores.items(), key=lambda x: -x[1])

RRF doesn't require you to calibrate score scales between the two systems. A semantic similarity score of 0.73 and a BM25 TF-IDF score of 8.2 don't have a natural common unit. But rank position 3 and rank position 3 are the same thing. RRF treats them that way.

The BM25 implementation uses SQLite FTS5 with trigram tokenization, which gives you substring matching for free.

HyDE: Searching With a Fake Answer

One of the most counterintuitive retrieval tricks: if you embed the query, you get a vector that represents the question. If you embed a hypothetical answer to the query, you get a vector that lands much closer to where real answers live in the embedding space.

This is HyDE (Hypothetical Document Embeddings):

def hyde_expand(query: str) -> str:
    prompt = f"""Write a short paragraph that would answer this question.
    Be specific and technical. Use terminology that would appear in a document
    answering this question.
 
    Question: {query}
 
    Answer:"""
    response = ollama.generate(model=EXPAND_MODEL, prompt=prompt)
    return response["response"]
 
# Then embed the hypothetical answer instead of (or in addition to) the query
expanded = hyde_expand(query)
query_vec = embed(expanded)

The local model (qwen3:8b) generates a plausible answer paragraph. That paragraph gets embedded. The resulting vector is semantically closer to real documents than the raw question embedding — because real documents are also phrased as answers, not questions.

HyDE is available as --expand and is combined with --rerank for the full quality pipeline. On its own it adds ~2-3 seconds (one LLM generation). With --rerank it's closer to 6-8 seconds total. For most queries the default hybrid mode at 0.5s is fast enough.

Typed Sub-queries: Mixing Strategies per Term

Sometimes you know exactly what retrieval strategy you want for different parts of a query. vault-search supports typed prefixes:

# Exact function name + semantic concept
python3 vault-search.py 'lex:"getTenantId" vec:"auth mutation security"'
 
# Question-style for one term, hybrid for the rest
python3 vault-search.py 'hyde:"how does caching work" convex reactive'
PrefixStrategyWhen to use
lex:"term"BM25 keyword onlyExact identifiers, function names, error codes
vec:"concept"Embedding onlyConceptual/semantic search
hyde:"question"HyDE + embeddingWhen you have a question, not a topic
(none)Hybrid BM25+semanticDefault, works for most queries

The sub-query results are fused together via RRF at the end, so a single query can mix all three strategies and get a coherent merged ranking.

The Knowledge Graph Layer

Embeddings tell you similarity. The knowledge graph tells you structure. When you search for "reinforcement learning," the graph can tell you that this concept:

  • builds on: cybernetics
  • applies to: reward hacking
  • relates to: RLHF
  • ← relates to: markov decision process

That's not captured in any individual note. It's the accumulated structure extracted from all notes.

── Graph Context ──
  reinforcement learning (concept, 28 connections)
    → applies_to → reward hacking
    → builds_on → cybernetics
    ← markov decision process ← relates_to
  reinforcement learning from human feedback (technique, 10 connections)
    → relates_to → reward hacking
    ← direct preference optimization ← relates_to

The graph is extracted by sending each note to a local LLM with a structured prompt asking for entities and relationships. The results are stored in SQLite alongside the search index. No external graph database — just two tables (entities, relations).

The hard part is normalization. LLMs invent creative relationship types: "draws_from," "is_informed_by," "takes_inspiration_from." These are all basically builds_on. vault-graph normalizes 500+ LLM-invented relationship types down to 15 canonical ones, so the graph stays consistent across thousands of notes extracted by slightly different prompts.

The 20x Speedup

When I first deployed vault-search for the vault (12,000+ files, ~45,000+ chunks), queries were taking 9.9 seconds. This was unusable for interactive search.

The culprit: index bloat. The SQLite database had accumulated duplicate chunks, stale embeddings from renamed files, and chunk records for files that had been deleted. The BM25 index was scanning thousands of dead records. The embedding similarity was comparing against vectors that no longer corresponded to real files.

The fix was a cleanup pass: drop orphaned chunks (chunk references without a corresponding file record), deduplicate by content hash, rebuild the FTS5 index, and vacuum the database.

After cleanup: 0.5 seconds for a hybrid search including graph context lookup. That's a 20x improvement from removing noise, not from changing any algorithms.

The lesson: retrieval performance degrades silently as the index ages. Build incremental cleanup into your indexing pipeline, not just incremental addition.

Why SQLite

Every piece of this system uses SQLite. The embeddings, the BM25 index (FTS5), the knowledge graph, the chunk records — all in one file per indexed directory.

This was a deliberate choice against the obvious alternatives (Chroma, Qdrant, Neo4j). Those are all fine tools, but they add infrastructure: a server to run, a port to expose, dependencies to manage. SQLite is a single file that travels with your notes. Back it up with cp. Move it with mv. Inspect it with any SQLite browser.

The downside is that SQLite's vector similarity search requires a full scan — no HNSW index. For 14,000 chunks, a full scan with NumPy batch cosine similarity takes about 180ms. That's acceptable. If your vault has 200,000 chunks, you'd want a proper vector index. For personal knowledge management, SQLite is the right call.

What's Next

vault-search is open source at github.com/vachsark/vault-search. The roadmap includes:

  • Chunk-level search results: Return the specific section that matched, not just the file, with section headings preserved in the result
  • Session context: Let the search system accumulate context across a session — "find me more files like the last three I looked at"
  • Reranker fine-tuning: Use the existing vault graph as training signal to fine-tune the reranker on your personal corpus

The core design — pure Python, no external services, Ollama for all inference — stays. The goal is a tool you can drop into any notes directory and have working in five minutes.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com