Vache prompts. Claude codes.How it works

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

·8 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
gpuqlorardna4amdfine-tuninghqqmachine-learning
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

I fine-tune task-specific LLMs on my RX 9070 XT. Small models (1.5B–3B) train in bf16 with room to spare. But 7B models need ~14GB just for weights — leaving nothing for activations and gradients on a 16GB card. The standard solution is QLoRA: 4-bit quantized weights with LoRA adapters trained in bf16.

The standard QLoRA library is bitsandbytes. It doesn't work on RDNA4.

The bitsandbytes Dead End

Two independent failure paths, both unfixable without upstream changes:

Path 1: PyPI wheel. The pre-built binary doesn't include HIP kernels for gfx1200/gfx1201 (RDNA4's ISA targets). You get hipErrorNoBinaryForGpu — the GPU literally can't find executable code in the binary.

Path 2: Build from source. This is the interesting one. Building bitsandbytes with cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH="gfx1200" compiles successfully. The problem is the ROCm version mismatch:

  • gfx1200 support requires ROCm 7.2+ headers (ROCm 6.3 doesn't know the architecture exists)
  • Compiling against ROCm 7.2 produces a binary that needs ROCm 7.2's HSA runtime
  • PyTorch ships with ROCm 6.3's runtime bundled

The result: undefined symbol: hsa_amd_memory_get_preferred_copy_engine. The symbol exists in ROCm 7.2 but not 6.3. There's no way to satisfy all three constraints — gfx1200 support, ROCm 7.2 headers, ROCm 6.3 runtime — simultaneously.

This is a dependency loop that only resolves when PyTorch ships a ROCm 7.2 wheel.

The Solution: HQQ

HQQ (Half-Quadratic Quantization) provides 4-bit quantization with a pure PyTorch backend. No custom CUDA kernels. No HIP kernels. Just torch.matmul on dequantized weights. It works on any device PyTorch supports.

The approach:

from hqq.core.quantize import HQQLinear, HQQBackend, BaseQuantizeConfig
 
# Load model in bf16 to CPU
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    dtype=torch.bfloat16,
    device_map="cpu",
)
 
# Quantize target layers to 4-bit, move to GPU
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        if any(t in name for t in TARGET_MODULES):
            hqq_layer = HQQLinear(module, quant_config,
                compute_dtype=torch.bfloat16, device="cuda:0")
            # Replace the module in-place
            set_module(model, name, hqq_layer)
 
model = model.to("cuda:0")
HQQLinear.set_backend(HQQBackend.PYTORCH)

The key detail: we don't use HqqConfig from transformers (broken in 5.x). We load the model normally, then manually replace each linear layer with an HQQLinear that quantizes the weights to 4-bit on the GPU. The PYTORCH backend tells HQQ to dequantize weights on-the-fly during forward passes using pure PyTorch ops.

After quantization: 5.85 GB VRAM for a 7B model. Plenty of room for LoRA adapters and training.

Three Patches Nobody Documented

Getting HQQ + PEFT + SFTTrainer to actually work required three fixes that aren't in any documentation or issue tracker.

1. HQQ dtype mismatch

PEFT's prepare_model_for_kbit_training upcasts certain layers to float32 for training stability. But HQQ's matmul() doesn't cast its input tensor to match the dequantized weight dtype. The matmul receives float32 inputs and bfloat16 weights — dtype mismatch error.

Fix: two one-line patches in hqq/core/quantize.py:

# In matmul(): add x = x.to(weight.dtype)
# In forward_pytorch(): change torch.matmul(x, w.t()) to torch.matmul(x.to(w.dtype), w.t())

2. Python 3.14 torch.compile

HQQ uses @torch.compile() as a class-level method decorator. Python 3.14 raises RuntimeError: torch.compile is not supported on Python 3.14+. Since we're not actually using torch.compile for training, the fix is a monkey-patch that returns an identity decorator:

if sys.version_info >= (3, 14):
    _orig = torch.compile
    def _patched(fn=None, *args, **kwargs):
        if fn is None: return lambda f: f
        if callable(fn): return fn
        return _orig(fn, *args, **kwargs)
    torch.compile = _patched

This must run before any HQQ import.

3. Clean LoRA merge

After training, you want to merge the LoRA adapter back into the base model for deployment. But model.merge_and_unload() on an HQQ model produces tensors with HQQ-specific names (W_q, meta, etc.) that downstream tools (llama.cpp, Ollama) can't parse.

The fix: save the LoRA adapter separately, reload the base model in bf16 without HQQ, apply the adapter, then merge:

# Save adapter after training
trainer.save_model("lora-adapter/")
 
# Clean merge: base model (bf16) + LoRA adapter
base = AutoModelForCausalLM.from_pretrained(model_name,
    dtype=torch.bfloat16, device_map="cpu")
peft_model = PeftModel.from_pretrained(base, "lora-adapter/")
merged = peft_model.merge_and_unload()
merged.save_pretrained("merged-model/")

This produces a standard HuggingFace model that converts cleanly to GGUF.

Three Gotchas That Cost Hours

Beyond the patches, three training/deployment bugs wasted entire debugging sessions before we identified them. They're the kind of failures that look like training quality issues but aren't.

4. Ollama Modelfile needs the chat template (silent, catastrophic)

When you register a custom GGUF with ollama create, you must include the model's chat template in the Modelfile. Without it, Ollama sends raw text — the model can't parse system/user/assistant role boundaries and generates input continuations instead of following instructions.

This one is insidious because the symptom looks like a training failure: the model outputs note content or Q&A instead of task-specific output. We spent an entire debugging session (6 iterations of retraining) before realizing the model weights were fine — Ollama just wasn't formatting the prompt correctly.

For Qwen2.5 models, the Modelfile needs the full ChatML template:

FROM your-model.Q8_0.gguf
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

Without this, every fine-tuned model you deploy will appear broken. With it, every one works instantly.

5. completion_only_loss is mandatory for high input/output ratios

If your training data has long inputs and short outputs (ratio > 5:1), you must use completion_only_loss=True in SFTConfig. Without it, the majority of the gradient signal teaches the model to predict input tokens. The model learns "continue the input" instead of "follow instructions and produce output."

Our training data had ratios from 6:1 to 28:1. All models trained without loss masking scored 0/5 on format compliance. The unfine-tuned base model outperformed them.

In TRL, completion_only_loss=True requires your dataset to have prompt and completion columns instead of the standard messages format. We pre-process the data to split out the prompt (system + user messages with chat template applied) from the completion (assistant response).

6. Truncate from the middle, not the ends

When inputs exceed max_length, the default truncation strategies fail:

  • Right-truncation (TRL default): cuts off the completion — the model trains on incomplete outputs
  • Left-truncation: removes task framing headers (e.g., === KNOWLEDGE JUDGE DATA ===)

The fix: truncate from the middle of the user content. Keep the first 1/3 (task framing and headers) and the last 2/3 (most recent/relevant content). Remove redundant intermediate content. This preserves both the task structure and the most useful data.

Production Training Results

With these patches in place, I trained five task-specific models from Qwen2.5-7B-Instruct:

ModelExamplesPeak VRAMTimemax_lengthFinal LossEval Accuracy
Knowledge judge6714.53 GB20.5 min20481.8563.9%
Goal planner9912.22 GB13.8 min10241.3473.4%
Research seeder6412.22 GB8.7 min10241.8661.8%
Technical analyst7715.70 GB~24 min20481.2469.5%

Training data was generated by running each task's prompt through Sonnet 4.6 with real vault data (580 API calls, ~$0 on Max subscription). The models were then quantized to Q8_0 GGUF and deployed to Ollama, replacing 9GB system-prompt models with 8.1GB fine-tuned ones that actually follow the expected output format.

The VRAM Constraint

The main knob is max_length. At 7B HQQ with gradient checkpointing:

  • 1024 tokens: 12.2 GB peak, 3.7 GB headroom — reliable, recommended
  • 2048 tokens: 14.5–15.7 GB peak — works most of the time but risks silent hangs at 99.5% utilization
  • 4096 tokens: OOM on 16 GB

For 16GB cards, max_length=1024 is the practical ceiling for 7B training. We learned this the hard way: a training run at max_length=2048 peaked at 15.7/17.1 GB (99.5%), the process hung silently with 0% GPU utilization, and force-killing it locked the GPU completely — requiring a full system reboot. Both HIP/ROCm and Vulkan became unresponsive. The memory allocator appears to deadlock at extreme utilization levels on RDNA4.

Update (March 2026): bitsandbytes now ships pre-built ROCm 7.2 wheels with gfx1200/gfx1201 support. Combined with PyTorch 2.9.1+rocm7.2, this may eliminate the need for HQQ entirely. We haven't tested it yet — AMD marks it as "preview state." See our GitHub repo for details.

What Made This Work

This wasn't a single breakthrough — it was infrastructure meeting opportunity:

The research system. Four parallel Sonnet agents explored every bitsandbytes failure path, mapped the ROCm version dependency loop, and identified HQQ as the alternative. This took ~30 minutes instead of the days it would take manually reading GitHub issues and ROCm docs.

The patches. Three independent fixes, each discovered through actual training failures. The dtype mismatch and clean merge issues don't appear in any existing documentation because HQQ + PEFT + RDNA4 is a combination nobody else has tried.

The pipeline. Training data extraction, Sonnet generation, SFT preparation, training, GGUF conversion, and Ollama deployment — all scripted and repeatable. Five models trained in one session.

Reproducing This

Everything is open source: github.com/vachsark/qlora-rdna4

The repo includes:

  • smoke_test.py — validates your setup in 2 minutes
  • train.py — full training script with all patches
  • patches/apply_hqq_patch.py — auto-patches HQQ
  • docs/bitsandbytes-failure.md — detailed failure analysis

If you have an RDNA4 GPU and want to fine-tune 7B models, this is the path that works today.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com