I run a local AI pipeline on an AMD RX 9070 XT (RDNA 4, 16GB VRAM, 640 GB/s theoretical bandwidth). It handles heartbeat tasks, code review, dependency monitoring, and semantic search — all at zero API cost via Ollama.
The starting point: qwen3:8b at 82 tok/s, qwen2.5-coder:14b at 55 tok/s. Both running through Ollama's ROCm HIP backend with HSA_OVERRIDE_GFX_VERSION=12.0.0 because ROCm doesn't natively support gfx1200 yet.
The question was simple: are we actually at the hardware ceiling, or leaving performance on the table?
The Bandwidth Math
LLM token generation is memory-bandwidth-bound. Each token requires reading all model weights from VRAM once (matrix-vector multiply). The theoretical calculation:
qwen3:8b (Q4_K_M): 4.92 GB model weights
At 82.9 tok/s: 4.92 GB × 82.9 = 407.9 GB/s effective bandwidth
Theoretical max: 640 GB/s
Efficiency: 63.7%
We're at 63-74% of theoretical bandwidth depending on the model. That's 26-37% headroom — not a ceiling at all. Something is leaving bandwidth on the table.
Where the Bandwidth Goes
Context-length scaling benchmarks revealed the first clue. Token generation speed at different context lengths:
| Context | qwen3:8b tok/s | Slowdown |
|---|---|---|
| 11 tokens | 83.6 | baseline |
| 1,212 tokens | 82.9 | -0.8% |
| 4,096 tokens | 78.5 | -6.1% |
| 16,384 tokens | 62.8 | -24.9% |
As context grows, the KV cache grows. KV cache reads achieve only ~250-300 GB/s (strided per-head access across attention heads) vs ~460 GB/s for weight reads (sequential). The GPU's memory controller handles sequential reads well but struggles with the scattered access pattern of multi-head attention.
The Vulkan Discovery
While researching alternatives to Ollama, I found mentions that RADV (Mesa's Vulkan driver for AMD) sometimes outperforms ROCm on consumer RDNA cards. The reasoning: RADV has native gfx1201 shader compilation while ROCm requires an override that targets the wrong microarchitecture variant.
I built llama.cpp from source with both backends and ran a three-way comparison:
# HIP build (ROCm)
cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1200 \
-DGGML_HIP_ROCWMMA_FATTN=ON ..
cmake --build build-hip -j$(nproc)
# Vulkan build (RADV)
cmake -B build-vulkan -DGGML_VULKAN=ON ..
cmake --build build-vulkan -j$(nproc)The results:
| Backend | qwen3:8b tg | coder:14b tg | Implied Bandwidth |
|---|---|---|---|
| Ollama (ROCm) | 82.9 tok/s | 55.2 tok/s | 406-464 GB/s |
| llama.cpp HIP | 84.3 tok/s | 54.9 tok/s | 413-462 GB/s |
| llama.cpp Vulkan | 100.0 tok/s | 60.1 tok/s | 490-505 GB/s |
Vulkan (RADV) was 20% faster on the 8B model and 9% faster on the 14B. It achieved 77-79% of theoretical bandwidth vs ROCm's 63-72%.
Why RADV Wins on RDNA 4
Running vulkaninfo revealed the answer:
deviceName = AMD Radeon RX 9070 XT (RADV GFX1201)
RADV reports GFX1201 — the correct silicon variant. ROCm requires HSA_OVERRIDE_GFX_VERSION=12.0.0, telling the compiler to generate code for gfx1200. I tested building for gfx1201 explicitly with HIP — identical performance to gfx1200. The ISA is the same; the performance gap comes from RADV's shader compiler generating better code for the compute workload, not from the target variant.
Getting Ollama to Use Vulkan
Ollama bundles both backends. On Arch Linux, ollama-rocm provides libggml-hip.so and ollama-vulkan provides libggml-vulkan.so. Problem: when both are present, Ollama prefers ROCm.
Setting OLLAMA_LLM_LIBRARY=vulkan didn't change the selection in v0.15.5. The fix was surgical — disable the ROCm library:
sudo mv /usr/lib/ollama/libggml-hip.so /usr/lib/ollama/libggml-hip.so.disabled
sudo systemctl restart ollamaThe result through Ollama (with its Go HTTP wrapper overhead):
| Model | ROCm | Vulkan | Improvement |
|---|---|---|---|
| qwen3:8b | 82.9 tok/s | 91.0 tok/s | +9.8% |
| coder:14b | 55.2 tok/s | 58.3 tok/s | +5.6% |
Not the full +20% from raw llama.cpp (Ollama's wrapper adds ~10% overhead on prompt processing), but a free speed boost with zero workflow changes.
The Pipeline Optimization That Mattered More
While profiling the GPU, I also profiled the scripts that call the GPU. The heartbeat system runs 22 automated tasks daily — dependency monitoring, research radar, vault health checks, commit reviews.
The deps-monitor script was taking 90 seconds. Profiling showed 51 of those seconds were the collector script running npm audit and npm outdated sequentially across 3 projects — 6-9 network calls at 8-15 seconds each, all independent.
The fix was three lines of bash:
# Before: 51 seconds (sequential)
for project in "${PROJECTS[@]}"; do
audit=$(cd "$path" && npm audit --json) # blocks 8-15s
outdated=$(cd "$path" && npm outdated --json) # blocks 8-15s
done
# After: 5 seconds (parallel)
for project in "${PROJECTS[@]}"; do
(cd "$path" && npm audit --json > "$tmp_audit") &
(cd "$path" && npm outdated --json > "$tmp_outdated") &
wait
done51s → 5s. 10x improvement. More practical impact than any GPU tuning.
I applied the same pattern to every script with sequential independent I/O:
| Script | Before | After | Speedup |
|---|---|---|---|
| research-radar (3 GitHub API calls) | 19.8s | 0.45s | 44x |
| industry-scan (30 HN fetches + 2 GitHub) | ~30s | ~8s | 3.8x |
| vault-health (3 Obsidian CLI calls) | ~16s | ~10s | 1.6x |
The pattern is always the same: find sequential independent operations, launch them as background processes, collect results from temp files.
The Three-Model Strategy
With the Vulkan backend stable, I benchmarked additional models to optimize the task routing:
| Model | Speed | Math (347×23) | Reasoning | Best For |
|---|---|---|---|---|
| qwen3:8b | 91 tok/s | thinks* | thinks* | Fast daily tasks |
| gemma3:12b | 50 tok/s | 7981 ✓ | 9 ✓ | Quality reasoning, vision |
| qwen2.5-coder:14b | 58 tok/s | 8021 ✗ | 9 ✓ | Code analysis |
| phi4-reasoning:14b | 48 tok/s | 7981 ✓ | 9 ✓ | Too verbose |
| devstral:24b | 27 tok/s | 8001 ✗ | 8 ✗ | Too slow |
*qwen3 models use chain-of-thought thinking that consumes the token budget on simple questions
The final lineup: qwen3:8b for speed-critical daily tasks, qwen2.5-coder:14b for commit reviews and rule analysis, and gemma3:12b for weekly quality-sensitive tasks and anything that benefits from vision capability. Ollama swaps them in and out of VRAM automatically.
What I Learned
Profile before optimizing. The GPU was at 63% efficiency, not at the hardware ceiling. The pipeline scripts had 10-44x speedups hiding in plain sight. Neither was obvious without measuring.
The right driver matters more than the right flags. I spent 5 optimization loops tuning ROCm environment variables (ROCBLAS_USE_HIPBLASLT, HIP_FORCE_DEV_KERNARG, power profiles). None of them moved the needle. Switching to RADV Vulkan — a different driver entirely — gave +20% for free.
Pipeline optimization beats model optimization. The 44x speedup in research-radar.sh (19.8s → 0.45s) had more daily impact than the GPU tuning. Three lines of bash parallelization outperformed weeks of kernel tuning potential.
Small models can't multiply. Don't test your GPU backend with math questions. Test with the actual workload.