What was the performance improvement achieved?

The semantic search latency was reduced from 148ms to 85ms, a 43% improvement. The cosine similarity computation was the primary bottleneck, taking 60ms initially and dropping to 0.12ms after NumPy vectorization — a 506x speedup.

Why does embedding model size matter?

Larger embedding models (1-2B parameters) produce higher-quality semantic representations than small models (0.6B), improving search relevance without requiring changes to the retrieval algorithm. The bottleneck is model quality, not fusion method.

Can embeddings be stored in float16 instead of float64?

Yes. Float16 provides 50% storage compression with negligible ranking impact at this document scale (<5K documents). The current system defaults to float16 for new embeddings, saving approximately 2GB in database size.

How does hybrid BM25 + cosine retrieval compare to alternatives?

BM25 + cosine with Reciprocal Rank Fusion (k=60) is optimal for collections under 10K documents. At this scale, exhaustive scoring is cheap (<100ms), making sophisticated approximate nearest neighbor indexes like HNSW unnecessary.

Local AI Optimization: From 148ms to 85ms

Q: How does hybrid BM25 + cosine retrieval compare to alternatives?

BM25 + cosine with Reciprocal Rank Fusion (k=60) is optimal for collections under 10K documents. At this scale, exhaustive scoring is cheap (<100ms), making sophisticated approximate nearest neighbor indexes like HNSW unnecessary.

My vault has 1,620 files. Every semantic search query was taking 148ms — not terrible, but the cosine similarity computation alone was eating 60ms of that. On a local AMD RX 9070 XT running ollama-rocm, there was room to do better.

Here's how I profiled the hot path, cut search latency by 43%, and validated that the current embedding model was already optimal.

Profiling the Hot Path

Before optimizing anything, I needed to know where time was actually spent. I instrumented vault-search.py to measure each phase of a hybrid search query:

Phase	Time	Share
DB load (SQLite)	12.5ms	11%
Embedding unpack	40.5ms	35%
Cosine similarity	60.8ms	53%
Sort + return	0.2ms	0%

Over half the time was in pure-Python cosine similarity — iterating 1,620 vectors of 1,024 dimensions each, computing dot products with a list comprehension. The embedding API call (~50ms to ollama) wasn't even the bottleneck. The math was.

NumPy Vectorized Search

The fix was straightforward: replace the pure-Python cosine computation with NumPy matrix multiplication.

def vectorized_search(rows, q_emb, query, path_filter):
    n = len(rows)
    dims = len(q_emb)
    emb_matrix = np.zeros((n, dims), dtype=np.float64)
 
    for i, (path, blob, enorm, summary) in enumerate(rows):
        if len(blob) == dims * 8:
            emb_matrix[i] = np.frombuffer(blob, dtype=np.float64)
        else:
            emb_matrix[i] = np.frombuffer(blob, dtype=np.float32)
 
    q_vec = np.array(q_emb, dtype=np.float64)
    q_norm = np.linalg.norm(q_vec)
 
    # Single matrix multiply: all 1,620 cosine similarities at once
    dots = emb_matrix @ q_vec
    cosines = dots / (norms_arr * q_norm)

Instead of 1,620 individual dot product loops, one @ operator handles everything via BLAS-optimized matrix multiplication. The cosine computation dropped from 60.8ms to 0.12ms — a 506x speedup.

The full pipeline (unpack + cosine) went from 101.3ms to 14.1ms, a 7.2x improvement. The remaining 14ms is mostly np.frombuffer deserialization, which is already fast.

Metric	Before (pure Python)	After (numpy)	Improvement
Semantic search (20 queries)	148ms mean	85ms mean	1.7x faster
Hybrid search (20 queries)	152ms mean	90ms mean	1.7x faster
Cosine computation only	60.8ms	0.12ms	506x faster
Embedding unpack + cosine	101.3ms	14.1ms	7.2x faster

The implementation includes a graceful fallback — if NumPy isn't installed, the pure-Python path runs automatically. I also added an LRU cache for query embeddings (64 entries, ~512KB) to avoid redundant ollama API calls for repeated searches.

Float32 Storage

The embeddings were stored as float64 (8 bytes per dimension) but the model outputs float32 precision. Storing them as float32 cuts storage in half with zero accuracy loss:

Metric	Float64	Float32
Storage per embedding	8,192 bytes	4,096 bytes
Total index (1,620 files)	~12.8 MB	~6.4 MB
Cosine accuracy loss	—	0.0000000000

I changed pack_embedding() to use float32 format and added auto-detection in unpack_embedding() so existing float64 blobs are read correctly. The migration happens naturally as files get re-indexed.

Model Comparison: Qwen3 vs EmbeddingGemma

I tested the current model (qwen3-embedding:0.6b) against Google's embeddinggemma (308M params, MTEB 69.67):

Model	Dims	Speed (warm)	Avg Cosine	MTEB English
qwen3-embedding:0.6b	1024	46.3ms	0.5432	~65-66
embeddinggemma	768	60.4ms	0.4783	69.67

Despite EmbeddingGemma's higher MTEB benchmark score, Qwen3 won on actual vault retrieval tasks — likely because Qwen's training data includes more code and markdown content. It was also 23% faster. No model change needed.

What I Didn't Change

Not everything that could be optimized should be:

Instruction prefix — Qwen3-Embedding supports instruction-aware queries. I tested prepending task-specific instructions to queries. It hurt 3 out of 4 test queries because the indexed documents weren't embedded with matching instructions. Not worth re-indexing 1,620 files for marginal-at-best improvement.

HyDE query expansion — Generating a hypothetical document before searching helps vague conceptual queries but hurts specific ones, and adds 580–1,586ms of LLM generation latency. It stays as an opt-in --expand flag, not the default.

HNSW/IVF indexing — Approximate nearest neighbor algorithms are designed for 10K+ vectors. At 1,620 files, brute-force cosine with NumPy is already sub-millisecond. The overhead of maintaining an ANN index isn't justified.

What's Next

Three potential optimizations that I'm tracking but haven't implemented:

sqlite-vec extension — SIMD-accelerated distance functions as a SQLite extension. Would make search sub-10ms even without NumPy.
Matryoshka dimension reduction — Qwen3-embedding supports truncating from 1024 to 512 dimensions at ~95% quality. Would halve storage and speed up computation 2x. Needs A/B testing to verify acceptable quality.
Binary quantization pre-filter — Store 1-bit binary embeddings alongside float32. Use Hamming distance for fast candidate retrieval, then rescore top-N with full cosine. sqlite-vec supports this natively. The embedding API call (~50ms) is now the dominant cost, not computation. For a 1,620-file vault with semantic + BM25 hybrid search, 90ms total is fast enough that search feels instant in practice. The real win was confirming that the current model and architecture are already well-suited to the task — sometimes the best optimization is knowing when to stop.

Local AI Optimization: From 148ms to 85ms

Profiling the Hot Path

NumPy Vectorized Search

Float32 Storage

Model Comparison: Qwen3 vs EmbeddingGemma

What I Didn't Change

What's Next

Related Articles

Private AI for Legal Work

The Quantum Brain: How Dopamine Encodes Superposition

Building a Fashion Trend Intelligence Pipeline for $3/Month

Sources

About the Author

Vache Sarkissian