Vache prompts. Claude codes.How it works

Vulkan Beats ROCm: +20% LLM Inference on RDNA 4

·6 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
gpuvulkanrocmllama.cppoptimizationrdna4
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

I run a local AI pipeline on an AMD RX 9070 XT (RDNA 4, 16GB VRAM, 640 GB/s theoretical bandwidth). It handles heartbeat tasks, code review, dependency monitoring, and semantic search — all at zero API cost via Ollama.

The starting point: qwen3:8b at 82 tok/s, qwen2.5-coder:14b at 55 tok/s. Both running through Ollama's ROCm HIP backend with HSA_OVERRIDE_GFX_VERSION=12.0.0 because ROCm doesn't natively support gfx1200 yet.

The question was simple: are we actually at the hardware ceiling, or leaving performance on the table?

The Bandwidth Math

LLM token generation is memory-bandwidth-bound. Each token requires reading all model weights from VRAM once (matrix-vector multiply). The theoretical calculation:

qwen3:8b (Q4_K_M): 4.92 GB model weights
At 82.9 tok/s: 4.92 GB × 82.9 = 407.9 GB/s effective bandwidth
Theoretical max: 640 GB/s
Efficiency: 63.7%

We're at 63-74% of theoretical bandwidth depending on the model. That's 26-37% headroom — not a ceiling at all. Something is leaving bandwidth on the table.

Where the Bandwidth Goes

Context-length scaling benchmarks revealed the first clue. Token generation speed at different context lengths:

Contextqwen3:8b tok/sSlowdown
11 tokens83.6baseline
1,212 tokens82.9-0.8%
4,096 tokens78.5-6.1%
16,384 tokens62.8-24.9%

As context grows, the KV cache grows. KV cache reads achieve only ~250-300 GB/s (strided per-head access across attention heads) vs ~460 GB/s for weight reads (sequential). The GPU's memory controller handles sequential reads well but struggles with the scattered access pattern of multi-head attention.

The Vulkan Discovery

While researching alternatives to Ollama, I found mentions that RADV (Mesa's Vulkan driver for AMD) sometimes outperforms ROCm on consumer RDNA cards. The reasoning: RADV has native gfx1201 shader compilation while ROCm requires an override that targets the wrong microarchitecture variant.

I built llama.cpp from source with both backends and ran a three-way comparison:

# HIP build (ROCm)
cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1200 \
      -DGGML_HIP_ROCWMMA_FATTN=ON ..
cmake --build build-hip -j$(nproc)
 
# Vulkan build (RADV)
cmake -B build-vulkan -DGGML_VULKAN=ON ..
cmake --build build-vulkan -j$(nproc)

The results:

Backendqwen3:8b tgcoder:14b tgImplied Bandwidth
Ollama (ROCm)82.9 tok/s55.2 tok/s406-464 GB/s
llama.cpp HIP84.3 tok/s54.9 tok/s413-462 GB/s
llama.cpp Vulkan100.0 tok/s60.1 tok/s490-505 GB/s

Vulkan (RADV) was 20% faster on the 8B model and 9% faster on the 14B. It achieved 77-79% of theoretical bandwidth vs ROCm's 63-72%.

Why RADV Wins on RDNA 4

Running vulkaninfo revealed the answer:

deviceName = AMD Radeon RX 9070 XT (RADV GFX1201)

RADV reports GFX1201 — the correct silicon variant. ROCm requires HSA_OVERRIDE_GFX_VERSION=12.0.0, telling the compiler to generate code for gfx1200. I tested building for gfx1201 explicitly with HIP — identical performance to gfx1200. The ISA is the same; the performance gap comes from RADV's shader compiler generating better code for the compute workload, not from the target variant.

Getting Ollama to Use Vulkan

Ollama bundles both backends. On Arch Linux, ollama-rocm provides libggml-hip.so and ollama-vulkan provides libggml-vulkan.so. Problem: when both are present, Ollama prefers ROCm.

Setting OLLAMA_LLM_LIBRARY=vulkan didn't change the selection in v0.15.5. The fix was surgical — disable the ROCm library:

sudo mv /usr/lib/ollama/libggml-hip.so /usr/lib/ollama/libggml-hip.so.disabled
sudo systemctl restart ollama

The result through Ollama (with its Go HTTP wrapper overhead):

ModelROCmVulkanImprovement
qwen3:8b82.9 tok/s91.0 tok/s+9.8%
coder:14b55.2 tok/s58.3 tok/s+5.6%

Not the full +20% from raw llama.cpp (Ollama's wrapper adds ~10% overhead on prompt processing), but a free speed boost with zero workflow changes.

The Pipeline Optimization That Mattered More

While profiling the GPU, I also profiled the scripts that call the GPU. The heartbeat system runs 22 automated tasks daily — dependency monitoring, research radar, vault health checks, commit reviews.

The deps-monitor script was taking 90 seconds. Profiling showed 51 of those seconds were the collector script running npm audit and npm outdated sequentially across 3 projects — 6-9 network calls at 8-15 seconds each, all independent.

The fix was three lines of bash:

# Before: 51 seconds (sequential)
for project in "${PROJECTS[@]}"; do
    audit=$(cd "$path" && npm audit --json)    # blocks 8-15s
    outdated=$(cd "$path" && npm outdated --json)  # blocks 8-15s
done
 
# After: 5 seconds (parallel)
for project in "${PROJECTS[@]}"; do
    (cd "$path" && npm audit --json > "$tmp_audit") &
    (cd "$path" && npm outdated --json > "$tmp_outdated") &
    wait
done

51s → 5s. 10x improvement. More practical impact than any GPU tuning.

I applied the same pattern to every script with sequential independent I/O:

ScriptBeforeAfterSpeedup
research-radar (3 GitHub API calls)19.8s0.45s44x
industry-scan (30 HN fetches + 2 GitHub)~30s~8s3.8x
vault-health (3 Obsidian CLI calls)~16s~10s1.6x

The pattern is always the same: find sequential independent operations, launch them as background processes, collect results from temp files.

The Three-Model Strategy

With the Vulkan backend stable, I benchmarked additional models to optimize the task routing:

ModelSpeedMath (347×23)ReasoningBest For
qwen3:8b91 tok/sthinks*thinks*Fast daily tasks
gemma3:12b50 tok/s7981 ✓9 ✓Quality reasoning, vision
qwen2.5-coder:14b58 tok/s8021 ✗9 ✓Code analysis
phi4-reasoning:14b48 tok/s7981 ✓9 ✓Too verbose
devstral:24b27 tok/s8001 ✗8 ✗Too slow

*qwen3 models use chain-of-thought thinking that consumes the token budget on simple questions

The final lineup: qwen3:8b for speed-critical daily tasks, qwen2.5-coder:14b for commit reviews and rule analysis, and gemma3:12b for weekly quality-sensitive tasks and anything that benefits from vision capability. Ollama swaps them in and out of VRAM automatically.

What I Learned

Profile before optimizing. The GPU was at 63% efficiency, not at the hardware ceiling. The pipeline scripts had 10-44x speedups hiding in plain sight. Neither was obvious without measuring.

The right driver matters more than the right flags. I spent 5 optimization loops tuning ROCm environment variables (ROCBLAS_USE_HIPBLASLT, HIP_FORCE_DEV_KERNARG, power profiles). None of them moved the needle. Switching to RADV Vulkan — a different driver entirely — gave +20% for free.

Pipeline optimization beats model optimization. The 44x speedup in research-radar.sh (19.8s → 0.45s) had more daily impact than the GPU tuning. Three lines of bash parallelization outperformed weeks of kernel tuning potential.

Small models can't multiply. Don't test your GPU backend with math questions. Test with the actual workload.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com