Which model is fastest?

qwen3:8b dominates on speed with 91 tokens per second generation, nearly 2x faster than other candidates above 12B parameters. This makes it ideal for high-frequency automated tasks like daily health checks and dependency monitoring that run multiple times per day.

Why are reasoning and math tests important?

Reasoning and math tests expose quality differences between models that simple benchmarks miss. Multi-digit multiplication (347 × 23) requires multi-step carrying that some models fail at; the farmer sheep riddle tests whether a model can parse language correctly. These tests revealed that despite devstral's high SWE-bench ranking, it fails basic reasoning tasks.

What is the thinking mode problem?

Both qwen3 and phi4-reasoning have 'thinking modes' where the model reasons before answering. However, qwen3 separates thinking and response into distinct fields, allowing automated pipelines to read only the response. phi4-reasoning dumps its ... tags into the response, consuming token budget before reaching the actual answer, making it unsuitable for fast automated tasks.

What is the three-model strategy?

The optimal routing uses three models: qwen3:8b for speed (91 tok/s, daily high-frequency tasks), qwen2.5-coder:14b for code analysis (code-trained, good for diffs and reviews), and gemma3:12b for quality analysis (50 tok/s, weekly reasoning-heavy tasks). Ollama automatically swaps models in and out of VRAM, requiring no additional hardware or cost.

Local Model Shootout: Finding the Right LLM for Every Task

A single general-purpose model rarely excels at every task. After optimizing inference speed, the question became: which models should run which tasks in a high-frequency automation pipeline?

The context: My vault system runs 22 daily automated tasks (dependency monitoring, commit reviews, knowledge analysis, trend scanning) on local models with 16GB VRAM. Until now, all tasks used one of two models: qwen3:8b for general purposes and qwen2.5-coder:14b for code-specific work. This was simple but suboptimal—different task types have different quality-speed tradeoffs.

The solution: Benchmark six models (qwen3 variants, coder, reasoning, MoE) across speed (token generation rate) and quality (reasoning, math, code analysis). Route tasks to specialized models: fastest for high-frequency health checks, code-trained for diffs and reviews, reasoning-optimized for deep analysis. Ollama handles automatic model swapping in and out of VRAM at zero cost.

The Candidates

With 16GB of VRAM on the RX 9070 XT and Ollama handling automatic model swapping, I can run any model that fits in memory — one at a time, loaded on demand. The constraints: Q4_K_M quantization (best quality-to-size ratio), and the model needs to produce clean, concise output without hand-holding.

[... Rest of content ...]

Local Model Shootout: Finding the Right LLM for Every Task

The Candidates

Further Reading

Related Articles

Private AI for Legal Work

The Quantum Brain: How Dopamine Encodes Superposition

vault-search: Building Hybrid Retrieval That Actually Works on Local Hardware

About the Author

Vache Sarkissian