A single general-purpose model rarely excels at every task. After optimizing inference speed, the question became: which models should run which tasks in a high-frequency automation pipeline?
The context: My vault system runs 22 daily automated tasks (dependency monitoring, commit reviews, knowledge analysis, trend scanning) on local models with 16GB VRAM. Until now, all tasks used one of two models: qwen3:8b for general purposes and qwen2.5-coder:14b for code-specific work. This was simple but suboptimal—different task types have different quality-speed tradeoffs.
The solution: Benchmark six models (qwen3 variants, coder, reasoning, MoE) across speed (token generation rate) and quality (reasoning, math, code analysis). Route tasks to specialized models: fastest for high-frequency health checks, code-trained for diffs and reviews, reasoning-optimized for deep analysis. Ollama handles automatic model swapping in and out of VRAM at zero cost.
The Candidates
With 16GB of VRAM on the RX 9070 XT and Ollama handling automatic model swapping, I can run any model that fits in memory — one at a time, loaded on demand. The constraints: Q4_K_M quantization (best quality-to-size ratio), and the model needs to produce clean, concise output without hand-holding.
[... Rest of content ...]