My vault has 1,620 files. Every semantic search query was taking 148ms — not terrible, but the cosine similarity computation alone was eating 60ms of that. On a local AMD RX 9070 XT running ollama-rocm, there was room to do better.
Here's how I profiled the hot path, cut search latency by 43%, and validated that the current embedding model was already optimal.
Profiling the Hot Path
Before optimizing anything, I needed to know where time was actually spent. I instrumented vault-search.py to measure each phase of a hybrid search query:
| Phase | Time | Share |
|---|---|---|
| DB load (SQLite) | 12.5ms | 11% |
| Embedding unpack | 40.5ms | 35% |
| Cosine similarity | 60.8ms | 53% |
| Sort + return | 0.2ms | 0% |
Over half the time was in pure-Python cosine similarity — iterating 1,620 vectors of 1,024 dimensions each, computing dot products with a list comprehension. The embedding API call (~50ms to ollama) wasn't even the bottleneck. The math was.
NumPy Vectorized Search
The fix was straightforward: replace the pure-Python cosine computation with NumPy matrix multiplication.
def vectorized_search(rows, q_emb, query, path_filter):
n = len(rows)
dims = len(q_emb)
emb_matrix = np.zeros((n, dims), dtype=np.float64)
for i, (path, blob, enorm, summary) in enumerate(rows):
if len(blob) == dims * 8:
emb_matrix[i] = np.frombuffer(blob, dtype=np.float64)
else:
emb_matrix[i] = np.frombuffer(blob, dtype=np.float32)
q_vec = np.array(q_emb, dtype=np.float64)
q_norm = np.linalg.norm(q_vec)
# Single matrix multiply: all 1,620 cosine similarities at once
dots = emb_matrix @ q_vec
cosines = dots / (norms_arr * q_norm)Instead of 1,620 individual dot product loops, one @ operator handles everything via BLAS-optimized matrix multiplication. The cosine computation dropped from 60.8ms to 0.12ms — a 506x speedup.
The full pipeline (unpack + cosine) went from 101.3ms to 14.1ms, a 7.2x improvement. The remaining 14ms is mostly np.frombuffer deserialization, which is already fast.
| Metric | Before (pure Python) | After (numpy) | Improvement |
|---|---|---|---|
| Semantic search (20 queries) | 148ms mean | 85ms mean | 1.7x faster |
| Hybrid search (20 queries) | 152ms mean | 90ms mean | 1.7x faster |
| Cosine computation only | 60.8ms | 0.12ms | 506x faster |
| Embedding unpack + cosine | 101.3ms | 14.1ms | 7.2x faster |
The implementation includes a graceful fallback — if NumPy isn't installed, the pure-Python path runs automatically. I also added an LRU cache for query embeddings (64 entries, ~512KB) to avoid redundant ollama API calls for repeated searches.
Float32 Storage
The embeddings were stored as float64 (8 bytes per dimension) but the model outputs float32 precision. Storing them as float32 cuts storage in half with zero accuracy loss:
| Metric | Float64 | Float32 |
|---|---|---|
| Storage per embedding | 8,192 bytes | 4,096 bytes |
| Total index (1,620 files) | ~12.8 MB | ~6.4 MB |
| Cosine accuracy loss | — | 0.0000000000 |
I changed pack_embedding() to use float32 format and added auto-detection in unpack_embedding() so existing float64 blobs are read correctly. The migration happens naturally as files get re-indexed.
Model Comparison: Qwen3 vs EmbeddingGemma
I tested the current model (qwen3-embedding:0.6b) against Google's embeddinggemma (308M params, MTEB 69.67):
| Model | Dims | Speed (warm) | Avg Cosine | MTEB English |
|---|---|---|---|---|
| qwen3-embedding:0.6b | 1024 | 46.3ms | 0.5432 | ~65-66 |
| embeddinggemma | 768 | 60.4ms | 0.4783 | 69.67 |
Despite EmbeddingGemma's higher MTEB benchmark score, Qwen3 won on actual vault retrieval tasks — likely because Qwen's training data includes more code and markdown content. It was also 23% faster. No model change needed.
What I Didn't Change
Not everything that could be optimized should be:
Instruction prefix — Qwen3-Embedding supports instruction-aware queries. I tested prepending task-specific instructions to queries. It hurt 3 out of 4 test queries because the indexed documents weren't embedded with matching instructions. Not worth re-indexing 1,620 files for marginal-at-best improvement.
HyDE query expansion — Generating a hypothetical document before searching helps vague conceptual queries but hurts specific ones, and adds 580–1,586ms of LLM generation latency. It stays as an opt-in --expand flag, not the default.
HNSW/IVF indexing — Approximate nearest neighbor algorithms are designed for 10K+ vectors. At 1,620 files, brute-force cosine with NumPy is already sub-millisecond. The overhead of maintaining an ANN index isn't justified.
What's Next
Three potential optimizations that I'm tracking but haven't implemented:
-
sqlite-vec extension — SIMD-accelerated distance functions as a SQLite extension. Would make search sub-10ms even without NumPy.
-
Matryoshka dimension reduction — Qwen3-embedding supports truncating from 1024 to 512 dimensions at ~95% quality. Would halve storage and speed up computation 2x. Needs A/B testing to verify acceptable quality.
-
Binary quantization pre-filter — Store 1-bit binary embeddings alongside float32. Use Hamming distance for fast candidate retrieval, then rescore top-N with full cosine. sqlite-vec supports this natively. The embedding API call (~50ms) is now the dominant cost, not computation. For a 1,620-file vault with semantic + BM25 hybrid search, 90ms total is fast enough that search feels instant in practice. The real win was confirming that the current model and architecture are already well-suited to the task — sometimes the best optimization is knowing when to stop.