Vache prompts. Claude codes.How it works

Local AI Optimization: From 148ms to 85ms

·5 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 28, 2026
aioptimizationembeddingslocal-gpu
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

My vault has 1,620 files. Every semantic search query was taking 148ms — not terrible, but the cosine similarity computation alone was eating 60ms of that. On a local AMD RX 9070 XT running ollama-rocm, there was room to do better.

Here's how I profiled the hot path, cut search latency by 43%, and validated that the current embedding model was already optimal.

Profiling the Hot Path

Before optimizing anything, I needed to know where time was actually spent. I instrumented vault-search.py to measure each phase of a hybrid search query:

PhaseTimeShare
DB load (SQLite)12.5ms11%
Embedding unpack40.5ms35%
Cosine similarity60.8ms53%
Sort + return0.2ms0%

Over half the time was in pure-Python cosine similarity — iterating 1,620 vectors of 1,024 dimensions each, computing dot products with a list comprehension. The embedding API call (~50ms to ollama) wasn't even the bottleneck. The math was.

NumPy Vectorized Search

The fix was straightforward: replace the pure-Python cosine computation with NumPy matrix multiplication.

def vectorized_search(rows, q_emb, query, path_filter):
    n = len(rows)
    dims = len(q_emb)
    emb_matrix = np.zeros((n, dims), dtype=np.float64)
 
    for i, (path, blob, enorm, summary) in enumerate(rows):
        if len(blob) == dims * 8:
            emb_matrix[i] = np.frombuffer(blob, dtype=np.float64)
        else:
            emb_matrix[i] = np.frombuffer(blob, dtype=np.float32)
 
    q_vec = np.array(q_emb, dtype=np.float64)
    q_norm = np.linalg.norm(q_vec)
 
    # Single matrix multiply: all 1,620 cosine similarities at once
    dots = emb_matrix @ q_vec
    cosines = dots / (norms_arr * q_norm)

Instead of 1,620 individual dot product loops, one @ operator handles everything via BLAS-optimized matrix multiplication. The cosine computation dropped from 60.8ms to 0.12ms — a 506x speedup.

The full pipeline (unpack + cosine) went from 101.3ms to 14.1ms, a 7.2x improvement. The remaining 14ms is mostly np.frombuffer deserialization, which is already fast.

MetricBefore (pure Python)After (numpy)Improvement
Semantic search (20 queries)148ms mean85ms mean1.7x faster
Hybrid search (20 queries)152ms mean90ms mean1.7x faster
Cosine computation only60.8ms0.12ms506x faster
Embedding unpack + cosine101.3ms14.1ms7.2x faster

The implementation includes a graceful fallback — if NumPy isn't installed, the pure-Python path runs automatically. I also added an LRU cache for query embeddings (64 entries, ~512KB) to avoid redundant ollama API calls for repeated searches.

Float32 Storage

The embeddings were stored as float64 (8 bytes per dimension) but the model outputs float32 precision. Storing them as float32 cuts storage in half with zero accuracy loss:

MetricFloat64Float32
Storage per embedding8,192 bytes4,096 bytes
Total index (1,620 files)~12.8 MB~6.4 MB
Cosine accuracy loss0.0000000000

I changed pack_embedding() to use float32 format and added auto-detection in unpack_embedding() so existing float64 blobs are read correctly. The migration happens naturally as files get re-indexed.

Model Comparison: Qwen3 vs EmbeddingGemma

I tested the current model (qwen3-embedding:0.6b) against Google's embeddinggemma (308M params, MTEB 69.67):

ModelDimsSpeed (warm)Avg CosineMTEB English
qwen3-embedding:0.6b102446.3ms0.5432~65-66
embeddinggemma76860.4ms0.478369.67

Despite EmbeddingGemma's higher MTEB benchmark score, Qwen3 won on actual vault retrieval tasks — likely because Qwen's training data includes more code and markdown content. It was also 23% faster. No model change needed.

What I Didn't Change

Not everything that could be optimized should be:

Instruction prefix — Qwen3-Embedding supports instruction-aware queries. I tested prepending task-specific instructions to queries. It hurt 3 out of 4 test queries because the indexed documents weren't embedded with matching instructions. Not worth re-indexing 1,620 files for marginal-at-best improvement.

HyDE query expansion — Generating a hypothetical document before searching helps vague conceptual queries but hurts specific ones, and adds 580–1,586ms of LLM generation latency. It stays as an opt-in --expand flag, not the default.

HNSW/IVF indexing — Approximate nearest neighbor algorithms are designed for 10K+ vectors. At 1,620 files, brute-force cosine with NumPy is already sub-millisecond. The overhead of maintaining an ANN index isn't justified.

What's Next

Three potential optimizations that I'm tracking but haven't implemented:

  1. sqlite-vec extension — SIMD-accelerated distance functions as a SQLite extension. Would make search sub-10ms even without NumPy.

  2. Matryoshka dimension reduction — Qwen3-embedding supports truncating from 1024 to 512 dimensions at ~95% quality. Would halve storage and speed up computation 2x. Needs A/B testing to verify acceptable quality.

  3. Binary quantization pre-filter — Store 1-bit binary embeddings alongside float32. Use Hamming distance for fast candidate retrieval, then rescore top-N with full cosine. sqlite-vec supports this natively. The embedding API call (~50ms) is now the dominant cost, not computation. For a 1,620-file vault with semantic + BM25 hybrid search, 90ms total is fast enough that search feels instant in practice. The real win was confirming that the current model and architecture are already well-suited to the task — sometimes the best optimization is knowing when to stop.

Sources

  • Lambourne & Tomporowski (2010)Meta-regression analysis of exercise effects on cognitive performance
  • Vachsark (2026)Local AI Optimization: From 148ms to 85ms
  • Numpy Developers (2024)NumPy Documentation: Array Operations

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com