Does Vulkan outperform ROCm for LLM inference on AMD GPUs?

Yes — on RDNA 4 hardware (RX 9070 XT), RADV Vulkan outperforms ROCm HIP by 20% on an 8B model and 9% on a 14B model in llama.cpp benchmarks. The gap exists because RADV has native GFX1201 shader compilation for RDNA 4, while ROCm requires an HSA_OVERRIDE_GFX_VERSION workaround targeting the wrong microarchitecture variant. This translates to better effective memory bandwidth utilization — 77-79% of theoretical 640 GB/s versus 63-72% with ROCm.

How do I switch Ollama from ROCm to Vulkan backend on Arch Linux?

Install both ollama-rocm and ollama-vulkan packages, then disable the ROCm shared library so Ollama selects Vulkan automatically. Run "sudo mv /usr/lib/ollama/libggml-hip.so /usr/lib/ollama/libggml-hip.so.disabled" then "sudo systemctl restart ollama". Setting OLLAMA_LLM_LIBRARY=vulkan alone does not work in Ollama v0.15.5 — the library rename is required. After switching, benchmark with "ollama run qwen3:8b" to confirm the speed improvement.

Why is LLM token generation memory-bandwidth-bound instead of compute-bound?

Each generated token requires reading all model weights from VRAM once to perform a matrix-vector multiply. For a 8B Q4_K_M model (4.92 GB), generating at 82.9 tok/s implies 408 GB/s of effective memory bandwidth. Since the compute unit finishes calculations faster than the memory controller can supply data, the bottleneck is memory bandwidth rather than arithmetic throughput. This is why higher-bandwidth VRAM directly translates to faster inference, and why KV cache growth degrades performance at long contexts — strided KV reads achieve only 250-300 GB/s versus 460 GB/s for sequential weight reads.

What is the fastest way to optimize slow bash scripts with sequential I/O?

Launch independent I/O operations as background processes and collect results from temp files. Replace sequential calls (e.g., running npm audit then npm outdated one at a time) with parallel subshells using "&" and collect with "wait". This pattern converted a 51-second dependency monitor to 5 seconds and a 19.8-second research radar script to 0.45 seconds — a 44x speedup. The key requirement is that the operations be independent (no data dependency between them). Write outputs to temp files, then read after wait completes.

Which local LLM should I use for speed versus reasoning quality on RDNA 4?

For RDNA 4 with 16GB VRAM (RX 9070 XT), a three-model strategy works well. Use qwen3:8b at 91 tok/s for speed-critical daily tasks where latency matters more than deep reasoning. Use qwen2.5-coder:14b at 58 tok/s for code analysis and commit review. Use gemma3:12b at 50 tok/s for quality-sensitive tasks and anything needing vision capability — it produces correct math while the others sometimes fail. Avoid qwen3 variants for simple tasks since they burn tokens on chain-of-thought thinking.

Vulkan Beats ROCm: +20% LLM Inference on RDNA 4

I run a local AI pipeline on an AMD RX 9070 XT (RDNA 4, 16GB VRAM, 640 GB/s theoretical bandwidth). It handles heartbeat tasks, code review, dependency monitoring, and semantic search — all at zero API cost via Ollama.

The starting point: qwen3:8b at 82 tok/s, qwen2.5-coder:14b at 55 tok/s. Both running through Ollama's ROCm HIP backend with HSA_OVERRIDE_GFX_VERSION=12.0.0 because ROCm doesn't natively support gfx1200 yet.

The question was simple: are we actually at the hardware ceiling, or leaving performance on the table?

The Bandwidth Math

LLM token generation is memory-bandwidth-bound. Each token requires reading all model weights from VRAM once (matrix-vector multiply). The theoretical calculation:

qwen3:8b (Q4_K_M): 4.92 GB model weights
At 82.9 tok/s: 4.92 GB × 82.9 = 407.9 GB/s effective bandwidth
Theoretical max: 640 GB/s
Efficiency: 63.7%

We're at 63-74% of theoretical bandwidth depending on the model. That's 26-37% headroom — not a ceiling at all. Something is leaving bandwidth on the table.

Where the Bandwidth Goes

Context-length scaling benchmarks revealed the first clue. Token generation speed at different context lengths:

Context	qwen3:8b tok/s	Slowdown
11 tokens	83.6	baseline
1,212 tokens	82.9	-0.8%
4,096 tokens	78.5	-6.1%
16,384 tokens	62.8	-24.9%

As context grows, the KV cache grows. KV cache reads achieve only ~250-300 GB/s (strided per-head access across attention heads) vs ~460 GB/s for weight reads (sequential). The GPU's memory controller handles sequential reads well but struggles with the scattered access pattern of multi-head attention.

The Vulkan Discovery

While researching alternatives to Ollama, I found mentions that RADV (Mesa's Vulkan driver for AMD) sometimes outperforms ROCm on consumer RDNA cards. The reasoning: RADV has native gfx1201 shader compilation while ROCm requires an override that targets the wrong microarchitecture variant.

I built llama.cpp from source with both backends and ran a three-way comparison:

# HIP build (ROCm)
cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1200 \
      -DGGML_HIP_ROCWMMA_FATTN=ON ..
cmake --build build-hip -j$(nproc)
 
# Vulkan build (RADV)
cmake -B build-vulkan -DGGML_VULKAN=ON ..
cmake --build build-vulkan -j$(nproc)

The results:

Backend	qwen3:8b tg	coder:14b tg	Implied Bandwidth
Ollama (ROCm)	82.9 tok/s	55.2 tok/s	406-464 GB/s
llama.cpp HIP	84.3 tok/s	54.9 tok/s	413-462 GB/s
llama.cpp Vulkan	100.0 tok/s	60.1 tok/s	490-505 GB/s

Vulkan (RADV) was 20% faster on the 8B model and 9% faster on the 14B. It achieved 77-79% of theoretical bandwidth vs ROCm's 63-72%.

Why RADV Wins on RDNA 4

Running vulkaninfo revealed the answer:

deviceName = AMD Radeon RX 9070 XT (RADV GFX1201)

RADV reports GFX1201 — the correct silicon variant. ROCm requires HSA_OVERRIDE_GFX_VERSION=12.0.0, telling the compiler to generate code for gfx1200. I tested building for gfx1201 explicitly with HIP — identical performance to gfx1200. The ISA is the same; the performance gap comes from RADV's shader compiler generating better code for the compute workload, not from the target variant.

Getting Ollama to Use Vulkan

Ollama bundles both backends. On Arch Linux, ollama-rocm provides libggml-hip.so and ollama-vulkan provides libggml-vulkan.so. Problem: when both are present, Ollama prefers ROCm.

Setting OLLAMA_LLM_LIBRARY=vulkan didn't change the selection in v0.15.5. The fix was surgical — disable the ROCm library:

sudo mv /usr/lib/ollama/libggml-hip.so /usr/lib/ollama/libggml-hip.so.disabled
sudo systemctl restart ollama

The result through Ollama (with its Go HTTP wrapper overhead):

Model	ROCm	Vulkan	Improvement
qwen3:8b	82.9 tok/s	91.0 tok/s	+9.8%
coder:14b	55.2 tok/s	58.3 tok/s	+5.6%

Not the full +20% from raw llama.cpp (Ollama's wrapper adds ~10% overhead on prompt processing), but a free speed boost with zero workflow changes.

The Pipeline Optimization That Mattered More

While profiling the GPU, I also profiled the scripts that call the GPU. The heartbeat system runs 22 automated tasks daily — dependency monitoring, research radar, vault health checks, commit reviews.

The deps-monitor script was taking 90 seconds. Profiling showed 51 of those seconds were the collector script running npm audit and npm outdated sequentially across 3 projects — 6-9 network calls at 8-15 seconds each, all independent.

The fix was three lines of bash:

# Before: 51 seconds (sequential)
for project in "${PROJECTS[@]}"; do
    audit=$(cd "$path" && npm audit --json)    # blocks 8-15s
    outdated=$(cd "$path" && npm outdated --json)  # blocks 8-15s
done
 
# After: 5 seconds (parallel)
for project in "${PROJECTS[@]}"; do
    (cd "$path" && npm audit --json > "$tmp_audit") &
    (cd "$path" && npm outdated --json > "$tmp_outdated") &
    wait
done

51s → 5s. 10x improvement. More practical impact than any GPU tuning.

I applied the same pattern to every script with sequential independent I/O:

Script	Before	After	Speedup
research-radar (3 GitHub API calls)	19.8s	0.45s	44x
industry-scan (30 HN fetches + 2 GitHub)	~30s	~8s	3.8x
vault-health (3 Obsidian CLI calls)	~16s	~10s	1.6x

The pattern is always the same: find sequential independent operations, launch them as background processes, collect results from temp files.

The Three-Model Strategy

With the Vulkan backend stable, I benchmarked additional models to optimize the task routing:

Model	Speed	Math (347×23)	Reasoning	Best For
qwen3:8b	91 tok/s	thinks*	thinks*	Fast daily tasks
gemma3:12b	50 tok/s	7981 ✓	9 ✓	Quality reasoning, vision
qwen2.5-coder:14b	58 tok/s	8021 ✗	9 ✓	Code analysis
phi4-reasoning:14b	48 tok/s	7981 ✓	9 ✓	Too verbose
devstral:24b	27 tok/s	8001 ✗	8 ✗	Too slow

*qwen3 models use chain-of-thought thinking that consumes the token budget on simple questions

The final lineup: qwen3:8b for speed-critical daily tasks, qwen2.5-coder:14b for commit reviews and rule analysis, and gemma3:12b for weekly quality-sensitive tasks and anything that benefits from vision capability. Ollama swaps them in and out of VRAM automatically.

What I Learned

Profile before optimizing. The GPU was at 63% efficiency, not at the hardware ceiling. The pipeline scripts had 10-44x speedups hiding in plain sight. Neither was obvious without measuring.

The right driver matters more than the right flags. I spent 5 optimization loops tuning ROCm environment variables (ROCBLAS_USE_HIPBLASLT, HIP_FORCE_DEV_KERNARG, power profiles). None of them moved the needle. Switching to RADV Vulkan — a different driver entirely — gave +20% for free.

Pipeline optimization beats model optimization. The 44x speedup in research-radar.sh (19.8s → 0.45s) had more daily impact than the GPU tuning. Three lines of bash parallelization outperformed weeks of kernel tuning potential.

Small models can't multiply. Don't test your GPU backend with math questions. Test with the actual workload.

Vulkan Beats ROCm: +20% LLM Inference on RDNA 4

The Bandwidth Math

Where the Bandwidth Goes

The Vulkan Discovery

Why RADV Wins on RDNA 4

Getting Ollama to Use Vulkan

The Pipeline Optimization That Mattered More

The Three-Model Strategy

What I Learned

Further Reading

Related Articles

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

How a 4B Model Beat a 7B: Fine-Tuning Bake-off on 16GB VRAM

Local AI Optimization: From 148ms to 85ms

About the Author

Vache Sarkissian