Why doesn't bitsandbytes work on AMD RDNA4 (RX 9070 XT)?

bitsandbytes fails on RDNA4 for two independent reasons. The PyPI wheel has no HIP kernels compiled for gfx1200/gfx1201 (RDNA4's ISA targets), producing a hipErrorNoBinaryForGpu error. Building from source resolves that but creates a dependency loop — gfx1200 support requires ROCm 7.2+ headers, but PyTorch ships ROCm 6.3's runtime, causing an undefined symbol error at load time. Both paths are blocked until PyTorch ships a ROCm 7.2 wheel.

What is HQQ and how does it enable QLoRA on AMD GPUs?

HQQ (Half-Quadratic Quantization) is a 4-bit quantization library with a pure PyTorch backend — no CUDA or HIP kernels required. It quantizes model weights to 4-bit on-device and dequantizes them on-the-fly during forward passes using standard torch.matmul. Because it only needs PyTorch ops, it runs on any device PyTorch supports, including RDNA4 GPUs that lack bitsandbytes kernel support.

How much VRAM does 7B QLoRA fine-tuning require with HQQ on a 16GB GPU?

After HQQ quantization, a 7B model occupies about 5.85 GB of VRAM — down from roughly 14 GB for bf16 weights. That leaves around 10 GB for LoRA adapters, activations, and gradients on a 16GB card like the RX 9070 XT. In practice, training fits comfortably with a batch size of 2 and gradient accumulation.

What patches are required to run HQQ QLoRA fine-tuning on RDNA4?

Three patches are needed. First, a dtype fix in hqq/core/quantize.py so matmul casts input tensors to match dequantized weight dtype (bfloat16) — PEFT upcasts layers to float32 which mismatches HQQ's output. Second, a torch.compile monkey-patch before any HQQ import to avoid a Python 3.14 incompatibility where torch.compile raises RuntimeError. Third, a merge strategy fix — save the LoRA adapter separately, reload the base model in bf16 without HQQ, then merge to get clean weight names that llama.cpp and Ollama can parse.

Can AMD RDNA4 compete with NVIDIA for LLM fine-tuning?

For QLoRA fine-tuning specifically, yes with caveats. HQQ eliminates the bitsandbytes dependency that blocks RDNA4, and training throughput is comparable to equivalent NVIDIA consumer GPUs. The main friction is ecosystem maturity — AMD requires three manual patches that NVIDIA users never encounter, and bfloat16 support only arrived in ROCm 6.x. For inference, RDNA4's memory bandwidth is competitive. The gap is largest in custom CUDA kernel workflows that have no HIP equivalent.

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

I fine-tune task-specific LLMs on my RX 9070 XT. Small models (1.5B–3B) train in bf16 with room to spare. But 7B models need ~14GB just for weights — leaving nothing for activations and gradients on a 16GB card. The standard solution is QLoRA: 4-bit quantized weights with LoRA adapters trained in bf16.

The standard QLoRA library is bitsandbytes. It doesn't work on RDNA4.

The bitsandbytes Dead End

Two independent failure paths, both unfixable without upstream changes:

Path 1: PyPI wheel. The pre-built binary doesn't include HIP kernels for gfx1200/gfx1201 (RDNA4's ISA targets). You get hipErrorNoBinaryForGpu — the GPU literally can't find executable code in the binary.

Path 2: Build from source. This is the interesting one. Building bitsandbytes with cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH="gfx1200" compiles successfully. The problem is the ROCm version mismatch:

gfx1200 support requires ROCm 7.2+ headers (ROCm 6.3 doesn't know the architecture exists)
Compiling against ROCm 7.2 produces a binary that needs ROCm 7.2's HSA runtime
PyTorch ships with ROCm 6.3's runtime bundled

The result: undefined symbol: hsa_amd_memory_get_preferred_copy_engine. The symbol exists in ROCm 7.2 but not 6.3. There's no way to satisfy all three constraints — gfx1200 support, ROCm 7.2 headers, ROCm 6.3 runtime — simultaneously.

This is a dependency loop that only resolves when PyTorch ships a ROCm 7.2 wheel.

The Solution: HQQ

HQQ (Half-Quadratic Quantization) provides 4-bit quantization with a pure PyTorch backend. No custom CUDA kernels. No HIP kernels. Just torch.matmul on dequantized weights. It works on any device PyTorch supports.

The approach:

from hqq.core.quantize import HQQLinear, HQQBackend, BaseQuantizeConfig
 
# Load model in bf16 to CPU
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    dtype=torch.bfloat16,
    device_map="cpu",
)
 
# Quantize target layers to 4-bit, move to GPU
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        if any(t in name for t in TARGET_MODULES):
            hqq_layer = HQQLinear(module, quant_config,
                compute_dtype=torch.bfloat16, device="cuda:0")
            # Replace the module in-place
            set_module(model, name, hqq_layer)
 
model = model.to("cuda:0")
HQQLinear.set_backend(HQQBackend.PYTORCH)

The key detail: we don't use HqqConfig from transformers (broken in 5.x). We load the model normally, then manually replace each linear layer with an HQQLinear that quantizes the weights to 4-bit on the GPU. The PYTORCH backend tells HQQ to dequantize weights on-the-fly during forward passes using pure PyTorch ops.

After quantization: 5.85 GB VRAM for a 7B model. Plenty of room for LoRA adapters and training.

Three Patches Nobody Documented

Getting HQQ + PEFT + SFTTrainer to actually work required three fixes that aren't in any documentation or issue tracker.

1. HQQ dtype mismatch

PEFT's prepare_model_for_kbit_training upcasts certain layers to float32 for training stability. But HQQ's matmul() doesn't cast its input tensor to match the dequantized weight dtype. The matmul receives float32 inputs and bfloat16 weights — dtype mismatch error.

Fix: two one-line patches in hqq/core/quantize.py:

# In matmul(): add x = x.to(weight.dtype)
# In forward_pytorch(): change torch.matmul(x, w.t()) to torch.matmul(x.to(w.dtype), w.t())

2. Python 3.14 torch.compile

HQQ uses @torch.compile() as a class-level method decorator. Python 3.14 raises RuntimeError: torch.compile is not supported on Python 3.14+. Since we're not actually using torch.compile for training, the fix is a monkey-patch that returns an identity decorator:

if sys.version_info >= (3, 14):
    _orig = torch.compile
    def _patched(fn=None, *args, **kwargs):
        if fn is None: return lambda f: f
        if callable(fn): return fn
        return _orig(fn, *args, **kwargs)
    torch.compile = _patched

This must run before any HQQ import.

3. Clean LoRA merge

After training, you want to merge the LoRA adapter back into the base model for deployment. But model.merge_and_unload() on an HQQ model produces tensors with HQQ-specific names (W_q, meta, etc.) that downstream tools (llama.cpp, Ollama) can't parse.

The fix: save the LoRA adapter separately, reload the base model in bf16 without HQQ, apply the adapter, then merge:

# Save adapter after training
trainer.save_model("lora-adapter/")
 
# Clean merge: base model (bf16) + LoRA adapter
base = AutoModelForCausalLM.from_pretrained(model_name,
    dtype=torch.bfloat16, device_map="cpu")
peft_model = PeftModel.from_pretrained(base, "lora-adapter/")
merged = peft_model.merge_and_unload()
merged.save_pretrained("merged-model/")

This produces a standard HuggingFace model that converts cleanly to GGUF.

Three Gotchas That Cost Hours

Beyond the patches, three training/deployment bugs wasted entire debugging sessions before we identified them. They're the kind of failures that look like training quality issues but aren't.

4. Ollama Modelfile needs the chat template (silent, catastrophic)

When you register a custom GGUF with ollama create, you must include the model's chat template in the Modelfile. Without it, Ollama sends raw text — the model can't parse system/user/assistant role boundaries and generates input continuations instead of following instructions.

This one is insidious because the symptom looks like a training failure: the model outputs note content or Q&A instead of task-specific output. We spent an entire debugging session (6 iterations of retraining) before realizing the model weights were fine — Ollama just wasn't formatting the prompt correctly.

For Qwen2.5 models, the Modelfile needs the full ChatML template:

FROM your-model.Q8_0.gguf
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

Without this, every fine-tuned model you deploy will appear broken. With it, every one works instantly.

5. completion_only_loss is mandatory for high input/output ratios

If your training data has long inputs and short outputs (ratio > 5:1), you must use completion_only_loss=True in SFTConfig. Without it, the majority of the gradient signal teaches the model to predict input tokens. The model learns "continue the input" instead of "follow instructions and produce output."

Our training data had ratios from 6:1 to 28:1. All models trained without loss masking scored 0/5 on format compliance. The unfine-tuned base model outperformed them.

In TRL, completion_only_loss=True requires your dataset to have prompt and completion columns instead of the standard messages format. We pre-process the data to split out the prompt (system + user messages with chat template applied) from the completion (assistant response).

6. Truncate from the middle, not the ends

When inputs exceed max_length, the default truncation strategies fail:

Right-truncation (TRL default): cuts off the completion — the model trains on incomplete outputs
Left-truncation: removes task framing headers (e.g., === KNOWLEDGE JUDGE DATA ===)

The fix: truncate from the middle of the user content. Keep the first 1/3 (task framing and headers) and the last 2/3 (most recent/relevant content). Remove redundant intermediate content. This preserves both the task structure and the most useful data.

Production Training Results

With these patches in place, I trained five task-specific models from Qwen2.5-7B-Instruct:

Model	Examples	Peak VRAM	Time	max_length	Final Loss	Eval Accuracy
Knowledge judge	67	14.53 GB	20.5 min	2048	1.85	63.9%
Goal planner	99	12.22 GB	13.8 min	1024	1.34	73.4%
Research seeder	64	12.22 GB	8.7 min	1024	1.86	61.8%
Technical analyst	77	15.70 GB	~24 min	2048	1.24	69.5%

Training data was generated by running each task's prompt through Sonnet 4.6 with real vault data (580 API calls, ~$0 on Max subscription). The models were then quantized to Q8_0 GGUF and deployed to Ollama, replacing 9GB system-prompt models with 8.1GB fine-tuned ones that actually follow the expected output format.

The VRAM Constraint

The main knob is max_length. At 7B HQQ with gradient checkpointing:

1024 tokens: 12.2 GB peak, 3.7 GB headroom — reliable, recommended
2048 tokens: 14.5–15.7 GB peak — works most of the time but risks silent hangs at 99.5% utilization
4096 tokens: OOM on 16 GB

For 16GB cards, max_length=1024 is the practical ceiling for 7B training. We learned this the hard way: a training run at max_length=2048 peaked at 15.7/17.1 GB (99.5%), the process hung silently with 0% GPU utilization, and force-killing it locked the GPU completely — requiring a full system reboot. Both HIP/ROCm and Vulkan became unresponsive. The memory allocator appears to deadlock at extreme utilization levels on RDNA4.

Update (March 2026): bitsandbytes now ships pre-built ROCm 7.2 wheels with gfx1200/gfx1201 support. Combined with PyTorch 2.9.1+rocm7.2, this may eliminate the need for HQQ entirely. We haven't tested it yet — AMD marks it as "preview state." See our GitHub repo for details.

What Made This Work

This wasn't a single breakthrough — it was infrastructure meeting opportunity:

The research system. Four parallel Sonnet agents explored every bitsandbytes failure path, mapped the ROCm version dependency loop, and identified HQQ as the alternative. This took ~30 minutes instead of the days it would take manually reading GitHub issues and ROCm docs.

The patches. Three independent fixes, each discovered through actual training failures. The dtype mismatch and clean merge issues don't appear in any existing documentation because HQQ + PEFT + RDNA4 is a combination nobody else has tried.

The pipeline. Training data extraction, Sonnet generation, SFT preparation, training, GGUF conversion, and Ollama deployment — all scripted and repeatable. Five models trained in one session.

Reproducing This

Everything is open source: github.com/vachsark/qlora-rdna4

The repo includes:

smoke_test.py — validates your setup in 2 minutes
train.py — full training script with all patches
patches/apply_hqq_patch.py — auto-patches HQQ
docs/bitsandbytes-failure.md — detailed failure analysis

If you have an RDNA4 GPU and want to fine-tune 7B models, this is the path that works today.

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

The bitsandbytes Dead End

The Solution: HQQ

Three Patches Nobody Documented

1. HQQ dtype mismatch

2. Python 3.14 torch.compile

3. Clean LoRA merge

Three Gotchas That Cost Hours

4. Ollama Modelfile needs the chat template (silent, catastrophic)

5. completion_only_loss is mandatory for high input/output ratios

6. Truncate from the middle, not the ends

Production Training Results

The VRAM Constraint

What Made This Work

Reproducing This

Further Reading

Related Articles

How a 4B Model Beat a 7B: Fine-Tuning Bake-off on 16GB VRAM

Vulkan Beats ROCm: +20% LLM Inference on RDNA 4

Vault Autoresearch: A Personal AI Learns From Itself

About the Author

Vache Sarkissian