Vache prompts. Claude codes.How it works

Local Model Shootout: Finding the Right LLM for Every Task

·2 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
ailocal-llmollamabenchmarksgemma3qwen3
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

A single general-purpose model rarely excels at every task. After optimizing inference speed, the question became: which models should run which tasks in a high-frequency automation pipeline?

The context: My vault system runs 22 daily automated tasks (dependency monitoring, commit reviews, knowledge analysis, trend scanning) on local models with 16GB VRAM. Until now, all tasks used one of two models: qwen3:8b for general purposes and qwen2.5-coder:14b for code-specific work. This was simple but suboptimal—different task types have different quality-speed tradeoffs.

The solution: Benchmark six models (qwen3 variants, coder, reasoning, MoE) across speed (token generation rate) and quality (reasoning, math, code analysis). Route tasks to specialized models: fastest for high-frequency health checks, code-trained for diffs and reviews, reasoning-optimized for deep analysis. Ollama handles automatic model swapping in and out of VRAM at zero cost.

The Candidates

With 16GB of VRAM on the RX 9070 XT and Ollama handling automatic model swapping, I can run any model that fits in memory — one at a time, loaded on demand. The constraints: Q4_K_M quantization (best quality-to-size ratio), and the model needs to produce clean, concise output without hand-holding.

[... Rest of content ...]

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com