Why use hybrid search instead of just semantic search?

Hybrid search combines BM25 keyword matching (precise for exact terms but blind to synonyms) with semantic embeddings (good at conceptual queries but terrible at exact identifiers). Using Reciprocal Rank Fusion, hybrid search gets the best of both worlds—finding exact function names while also capturing conceptual relationships.

What is HyDE (Hypothetical Document Embeddings)?

HyDE improves semantic search by embedding a hypothetical answer to your query instead of the query itself. Since real documents are phrased as answers rather than questions, the hypothetical answer embedding lands much closer to actual relevant documents in the embedding space.

What are typed sub-queries in vault-search?

Typed sub-queries let you mix retrieval strategies within a single query. Use lex: for exact identifiers, vec: for conceptual search, hyde: for question-based queries, or default hybrid mode. Results are fused together via RRF for a coherent merged ranking.

How does the knowledge graph layer improve search results?

The knowledge graph captures structure beyond similarity—showing how concepts relate (applies_to, builds_on, relates_to). When you search for 'reinforcement learning,' the graph reveals connected concepts like reward hacking, cybernetics, and RLHF, adding semantic structure that embeddings alone cannot capture.

Why did vault-search achieve a 20x speedup?

The speedup came from cleanup, not algorithm changes. The index had accumulated duplicate chunks, stale embeddings from renamed files, and records for deleted files. Dropping orphaned chunks, deduplicating by content hash, rebuilding the FTS5 index, and vacuuming the database reduced query time from 9.9s to 0.5s.

Why use SQLite for vault-search instead of a dedicated search engine?

SQLite provides FTS5 (Full-Text Search) with trigram tokenization for substring matching, plus a knowledge graph stored in simple tables. It requires no external database, runs entirely on local hardware with Ollama, keeps data local, and is easy to understand and modify.

vault-search: Building Hybrid Retrieval That Actually Works on Local Hardware

I've been building a personal knowledge vault for a couple of years — 800+ markdown notes, code files, project docs, research notes. The standard approach to searching it is grep. Grep is fast and reliable, but it only finds what you already know to search for. If you write a note about "temporal difference learning" and later search for "how reinforcement learning handles delayed feedback," you get nothing.

So I built vault-search: hybrid BM25 + semantic search with RRF fusion, HyDE query expansion, a knowledge graph for entity traversal, and typed sub-queries. It runs entirely on local hardware via Ollama. No API cost. No data leaving your machine.

Here's what's interesting about building it.

Why Hybrid Instead of Just Semantic

Pure semantic search (embedding similarity) is good at conceptual queries but terrible at exact identifiers. If you search for getTenantId, a semantic model might surface files about "authentication patterns" and "multi-tenant architecture" — relevant concept, wrong function. You wanted the file with that specific function name.

BM25 (keyword matching) is the opposite: precise for exact terms, blind to synonyms and concepts. "Reinforcement learning" and "RL" are the same thing to a human, different tokens to BM25.

Hybrid search gets the best of both. The question is how to combine them. vault-search uses Reciprocal Rank Fusion:

def rrf_score(rank, k=60):
    return 1.0 / (k + rank)
 
def fuse(semantic_results, bm25_results):
    scores = defaultdict(float)
    for rank, (file, _) in enumerate(semantic_results):
        scores[file] += rrf_score(rank)
    for rank, (file, _) in enumerate(bm25_results):
        scores[file] += rrf_score(rank)
    return sorted(scores.items(), key=lambda x: -x[1])

RRF doesn't require you to calibrate score scales between the two systems. A semantic similarity score of 0.73 and a BM25 TF-IDF score of 8.2 don't have a natural common unit. But rank position 3 and rank position 3 are the same thing. RRF treats them that way.

The BM25 implementation uses SQLite FTS5 with trigram tokenization, which gives you substring matching for free.

HyDE: Searching With a Fake Answer

One of the most counterintuitive retrieval tricks: if you embed the query, you get a vector that represents the question. If you embed a hypothetical answer to the query, you get a vector that lands much closer to where real answers live in the embedding space.

This is HyDE (Hypothetical Document Embeddings):

def hyde_expand(query: str) -> str:
    prompt = f"""Write a short paragraph that would answer this question.
    Be specific and technical. Use terminology that would appear in a document
    answering this question.
 
    Question: {query}
 
    Answer:"""
    response = ollama.generate(model=EXPAND_MODEL, prompt=prompt)
    return response["response"]
 
# Then embed the hypothetical answer instead of (or in addition to) the query
expanded = hyde_expand(query)
query_vec = embed(expanded)

The local model (qwen3:8b) generates a plausible answer paragraph. That paragraph gets embedded. The resulting vector is semantically closer to real documents than the raw question embedding — because real documents are also phrased as answers, not questions.

HyDE is available as --expand and is combined with --rerank for the full quality pipeline. On its own it adds ~2-3 seconds (one LLM generation). With --rerank it's closer to 6-8 seconds total. For most queries the default hybrid mode at 0.5s is fast enough.

Typed Sub-queries: Mixing Strategies per Term

Sometimes you know exactly what retrieval strategy you want for different parts of a query. vault-search supports typed prefixes:

# Exact function name + semantic concept
python3 vault-search.py 'lex:"getTenantId" vec:"auth mutation security"'
 
# Question-style for one term, hybrid for the rest
python3 vault-search.py 'hyde:"how does caching work" convex reactive'

Prefix	Strategy	When to use
`lex:"term"`	BM25 keyword only	Exact identifiers, function names, error codes
`vec:"concept"`	Embedding only	Conceptual/semantic search
`hyde:"question"`	HyDE + embedding	When you have a question, not a topic
(none)	Hybrid BM25+semantic	Default, works for most queries

The sub-query results are fused together via RRF at the end, so a single query can mix all three strategies and get a coherent merged ranking.

The Knowledge Graph Layer

Embeddings tell you similarity. The knowledge graph tells you structure. When you search for "reinforcement learning," the graph can tell you that this concept:

builds on: cybernetics
applies to: reward hacking
relates to: RLHF
← relates to: markov decision process

That's not captured in any individual note. It's the accumulated structure extracted from all notes.

── Graph Context ──
  reinforcement learning (concept, 28 connections)
    → applies_to → reward hacking
    → builds_on → cybernetics
    ← markov decision process ← relates_to
  reinforcement learning from human feedback (technique, 10 connections)
    → relates_to → reward hacking
    ← direct preference optimization ← relates_to

The graph is extracted by sending each note to a local LLM with a structured prompt asking for entities and relationships. The results are stored in SQLite alongside the search index. No external graph database — just two tables (entities, relations).

The hard part is normalization. LLMs invent creative relationship types: "draws_from," "is_informed_by," "takes_inspiration_from." These are all basically builds_on. vault-graph normalizes 500+ LLM-invented relationship types down to 15 canonical ones, so the graph stays consistent across thousands of notes extracted by slightly different prompts.

The 20x Speedup

When I first deployed vault-search for the vault (12,000+ files, ~45,000+ chunks), queries were taking 9.9 seconds. This was unusable for interactive search.

The culprit: index bloat. The SQLite database had accumulated duplicate chunks, stale embeddings from renamed files, and chunk records for files that had been deleted. The BM25 index was scanning thousands of dead records. The embedding similarity was comparing against vectors that no longer corresponded to real files.

The fix was a cleanup pass: drop orphaned chunks (chunk references without a corresponding file record), deduplicate by content hash, rebuild the FTS5 index, and vacuum the database.

After cleanup: 0.5 seconds for a hybrid search including graph context lookup. That's a 20x improvement from removing noise, not from changing any algorithms.

The lesson: retrieval performance degrades silently as the index ages. Build incremental cleanup into your indexing pipeline, not just incremental addition.

Why SQLite

Every piece of this system uses SQLite. The embeddings, the BM25 index (FTS5), the knowledge graph, the chunk records — all in one file per indexed directory.

This was a deliberate choice against the obvious alternatives (Chroma, Qdrant, Neo4j). Those are all fine tools, but they add infrastructure: a server to run, a port to expose, dependencies to manage. SQLite is a single file that travels with your notes. Back it up with cp. Move it with mv. Inspect it with any SQLite browser.

The downside is that SQLite's vector similarity search requires a full scan — no HNSW index. For 14,000 chunks, a full scan with NumPy batch cosine similarity takes about 180ms. That's acceptable. If your vault has 200,000 chunks, you'd want a proper vector index. For personal knowledge management, SQLite is the right call.

What's Next

vault-search is open source at github.com/vachsark/vault-search. The roadmap includes:

Chunk-level search results: Return the specific section that matched, not just the file, with section headings preserved in the result
Session context: Let the search system accumulate context across a session — "find me more files like the last three I looked at"
Reranker fine-tuning: Use the existing vault graph as training signal to fine-tune the reranker on your personal corpus

The core design — pure Python, no external services, Ollama for all inference — stays. The goal is a tool you can drop into any notes directory and have working in five minutes.

vault-search: Building Hybrid Retrieval That Actually Works on Local Hardware

Why Hybrid Instead of Just Semantic

HyDE: Searching With a Fake Answer

Typed Sub-queries: Mixing Strategies per Term

The Knowledge Graph Layer

The 20x Speedup

Why SQLite

What's Next

Further Reading

Related Articles

The Mechanics of Information Retrieval: Ranking Relevance at Scale

Building a Fashion Trend Intelligence Pipeline for $3/Month

Private AI for Legal Work

About the Author

Vache Sarkissian