Can a smaller model outperform a larger one after fine-tuning?

Yes. Qwen3-4B (4 billion parameters) scored 4.8/5 on format compliance after fine-tuning, beating the production Qwen2.5-7B (3.0/5). The key factor is VRAM headroom — a 4B model in full bf16 leaves 8GB free for gradients and optimizer state, enabling cleaner fine-tuning than a quantized 8B model with barely enough memory.

Why did the 8B model fail to fine-tune properly?

Qwen3-8B in HQQ 4-bit quantization occupied nearly all 16GB VRAM, leaving minimal headroom for training. The quantization artifacts compounded with the model's reasoning-focused architecture, causing format regression. The model kept generating internal chain-of-thought tokens instead of clean structured output.

What is HQQ quantization and why is it needed on AMD GPUs?

HQQ (Half-Quadratic Quantization) is a pure PyTorch quantization method that works on AMD GPUs. bitsandbytes, the more common QLoRA quantization library, is not compatible with AMD RDNA 4 hardware. HQQ achieves comparable quality to bitsandbytes 4-bit quantization while running on AMD Vulkan.

How much does it cost to run 13 fine-tuned models locally?

Zero API cost. All 13 models run on a single AMD RX 9070 XT (16GB VRAM) via Ollama. The only cost is electricity. This is the core motivation — replacing cloud API calls for repetitive structured tasks with local fine-tuned specialists.

How a 4B Model Beat a 7B: Fine-Tuning Bake-off on 16GB VRAM

I run 13 fine-tuned models on a single AMD RX 9070 XT (16GB VRAM). They handle knowledge scoring, goal planning, note creation, quiz generation, spaced repetition — all at zero API cost via Ollama. The models are trained with QLoRA via HQQ (because bitsandbytes doesn't work on RDNA 4) on 42-99 Sonnet-generated examples each.

The current fleet is all Qwen2.5-based: four 7B models (HQQ 4-bit QLoRA), one 3B, and three 1.5B models. When Qwen3 dropped with claims that the 4B model "matches Qwen2.5-72B benchmarks," the question was obvious: can we switch to smaller, better models?

The answer was not what I expected.

The Bake-off: 7 Models, 8 Tasks

Seven candidate models, tested across 8 production tasks with real golden test inputs (not synthetic benchmarks). Each model got the exact system prompt and input data our heartbeat system uses in production.

Candidates: Qwen3-4B, Qwen3-8B, Qwen3-1.7B, Qwen3.5-9B, Qwen2.5-7B, Qwen2.5-3B, Gemma3-12B

Scoring: Format compliance (regex checks against required output structure), content quality (task-appropriate keywords), consistency (same input run 3 times), and inference speed.

Base Model Results (No Fine-Tuning)

Task	Winner	Qwen3-4B	Qwen3-8B	Qwen2.5-7B
Judge	Qwen3-8B (84%)	48% (last)	84%	80%
Planner	Qwen3-1.7B (100%)	40% (last)	100%	100%
Seeder	Qwen3-8B (100%)	19% (last)	100%	100%
Spacer	Qwen3-8B (100%)	83%	100%	100%
Quizzer	Qwen2.5-7B (100%)	100%	100%	100%
Reflector	Qwen3-8B (100%)	100%	100%	75%

Qwen3-4B was dead last on 3 out of 6 scored tasks. The "matches 72B benchmarks" claim is about MMLU and reasoning — not structured output compliance. For format following (markdown tables, specific headers, SUMMARY lines), the 4B base model is terrible. Even with think: false in the Ollama API, it produces reasoning-style output instead of structured format.

Qwen3-8B won the bake-off: 4/7 tasks, strong format compliance, good content quality.

The obvious conclusion: fine-tune the 8B winner.

That's what I did. It didn't work.

The 8B Failure

I fine-tuned Qwen3-8B on all four core tasks using the same Sonnet-generated training data as the production Qwen2.5-7B models. Initial results:

Model	Format Score	What Happened
Judge-8B	0/5	Echoed the template instructions back
Seeder-8B	0/8	Produced 78 tokens total
Reflector-8B	0/4	Echoed the input data
Planner-8B	5/5	Perfect

Only planner worked. Everything else was garbage. I spent two days debugging this:

Bug #1: Completion Truncation

The training pipeline had a critical bug — it was capping completions at max_length // 2 tokens:

# THE BUG
max_completion_tokens = max_len // 2  # 512 for max_length=1024
if len(completion_ids) > max_completion_tokens:
    completion_ids = completion_ids[:max_completion_tokens]  # chop the output!

For the judge task, the median completion is 1,275 tokens. At max_length=1024, the cap was 512. 84% of judge completions were being cut in half. The model was learning from incomplete outputs — it literally never saw a complete evaluation with all five criteria scored and a SUMMARY line.

Task	Median Completion	% That Fit at max_length=1024	% at 2048
Seeder	429 tokens	65%	84%
Judge	1,275 tokens	16%	35%
Reflector	362 tokens	58%	100%

Fix: never truncate completions. Only truncate prompts. Skip examples where the completion alone exceeds max_length.

Bug #2: Ollama Template Missing Think Support

Qwen3 models use <think>...</think> tags for reasoning. The Ollama think: false API parameter only works if the model's template includes IsThinkSet logic. Our custom GGUF modelfile didn't have this — the model was burning all output tokens on thinking, returning empty content.

The fix required copying the official Qwen3 template from Ollama's qwen3:8b model, which injects /no_think into the user message and prepends an empty <think></think> block to the assistant turn.

After Fixes: Still Not Enough

With both bugs fixed, I retrained with more epochs (3→7), higher LoRA rank (16→32), format skeleton examples, and few-shot injection. Results improved but couldn't match production:

Model	Before Fixes	After Fixes	Production Qwen2.5-7B
Judge-8B	0/5	3/5	5/5
Seeder-8B	0/8	3/8	6/8
Reflector-8B	0/4	2/4	3/4

The fundamental problem: Qwen3-8B with HQQ 4-bit quantization on 16GB VRAM is too tight. At max_length=2048, it hits 15.95 GB (out of 16.0 GB). LoRA rank is capped at r=16. The quantization itself introduces noise. The model doesn't have enough room to learn.

At this point I had a conclusion ready to write: "Qwen2.5-7B wins for structured output, Qwen3 is better at reasoning but can't follow formats."

Then I tried the 4B model.

The Plot Twist: 4B Beats Everything

The insight came from staring at the VRAM numbers. Qwen3-4B in bf16 (no quantization) is about 8GB. That leaves 8GB for activations, gradients, and optimizer state. Compare:

	Qwen3-8B (HQQ 4-bit)	Qwen3-4B (full bf16)
Base model VRAM	6.4 GB (quantized)	8 GB (lossless)
Max LoRA rank at 2048	r=16 (r=32 OOMs)	r=32 easily
Peak VRAM at 2048	15.95 GB (razor thin)	13.37 GB (comfortable)
Weight precision	4-bit (lossy)	bf16 (lossless)
Trainable params	43.6M (0.92%)	66.1M (1.62%)
Training speed	~50s/step	~42s/step

The 4B model has lossless weights (no HQQ quantization noise), higher LoRA rank (more learning capacity), and comfortable VRAM margins (no risk of OOM). The 8B model is technically "bigger" but operates under such severe constraints that it can't actually learn as effectively.

Judge-4B Results

Same training data, same hyperparameters (except r=32 instead of r=16, and full LoRA instead of HQQ):

Test Input	Production 7B (Qwen2.5)	Judge-8B (Qwen3)	Judge-4B (Qwen3)
Empty input	0/5	3/5	5/5
Knowledge batch (3 notes)	5/5	3/5	5/5
Output judge format	3/5	3/5	5/5
Short note	4/5	—	5/5
Wrong format input	3/5	—	4/5
Average	3.0/5	3.0/5	4.8/5

The 4B model that scored dead last in the base bake-off is now the best fine-tuned judge I've ever trained.

It produces correct ### note-name — PASS (4.2/5) headers, proper 5-row evaluation tables, accurate SUMMARY: lines — all the format elements that the 8B model couldn't learn because it was too VRAM-constrained to train properly.

Why This Happened

The conventional wisdom is "bigger model = better fine-tuning." But that assumes unlimited resources. On a fixed VRAM budget, the relationship inverts:

Bigger model → more quantization → less VRAM for LoRA → lower rank → noisier gradients → worse fine-tuning

Smaller model → no quantization needed → more VRAM for LoRA → higher rank → cleaner gradients → better fine-tuning

There's a sweet spot where the model is large enough to have the capacity for your task, but small enough to train with lossless weights and high LoRA rank on your hardware. For structured output tasks on 16GB VRAM, that sweet spot appears to be 4B parameters.

The specific factors:

Lossless weights: Full bf16 LoRA preserves the pre-trained knowledge perfectly. HQQ 4-bit quantization introduces noise that the LoRA adapter has to work around.
Higher LoRA rank: r=32 at max_length=2048 gives the model 66M trainable parameters (1.6% of total). The 8B model at r=16 only gets 44M (0.9%). More parameters = more capacity to learn the output format.
No OOM anxiety: At 13.37 GB peak (83% of 16 GB), training is stable. The 8B model at 15.95 GB (99.7%) is one allocation away from a crash. Stale GPU memory from Ollama or a previous run would cause phantom OOMs.
Same training data: Both models used identical Sonnet-generated training examples with the same completion-preservation fix. The only difference was the training conditions.

Training Details

For anyone who wants to reproduce this:

Base model: Qwen/Qwen3-4B
Method: Full LoRA (bf16, no quantization)
LoRA: r=32, alpha=64, dropout=0.05
Target: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max length: 2048
Epochs: 5
Batch: 1, gradient accumulation: 8
LR: 5e-5, cosine schedule, 10 warmup steps
Loss: completion_only_loss=True (CRITICAL — only learn the output format)
NEFTune: alpha=5
Training examples: 63 (14 skipped for completion > max_length)
Peak VRAM: 13.37 GB
Wall clock: 28.4 minutes
GPU: AMD RX 9070 XT (16GB, RDNA 4)
Framework: TRL SFTTrainer + PEFT LoRA

Key implementation details:

Never truncate completions: The completion is what you're teaching the model to produce. If you truncate it, you teach truncated output. Skip examples where len(completion) + len(prompt_skeleton) > max_length.
Middle-truncate prompts: Keep the first 1/3 (task framing headers) and last 2/3 (recent content). Remove the middle.
Ollama template: Custom GGUF models need the official Qwen3 template with IsThinkSet logic for think: false to work. Without it, the model ignores the flag and burns tokens on <think> reasoning.
Clear VRAM before training: curl -s .../api/generate -d '{"model":"...","keep_alive":0}' then stop Ollama. Stale GPU memory causes phantom OOMs.

Conversion pipeline: HuggingFace → llama.cpp convert_hf_to_gguf.py (F16) → llama-quantize (Q8_0) → Ollama modelfile with Qwen3 ChatML template.

Current Fleet

Task	Old Model	New Model	Format Score	Size
Judge	vault-judge-7b-q8 (Qwen2.5, 8.1 GB)	vault-judge-4b-q8 (Qwen3, 4.4 GB)	3.0→4.8/5	-46%
Planner	vault-planner-7b-q8 (Qwen2.5, 8.1 GB)	vault-planner-8b-q8 (Qwen3, 8.7 GB)	5/5→5/5	Same
Seeder	vault-seeder-7b-q8 (Qwen2.5)	Testing 4B next	—	—
Reflector	vault-reflector-3b-q8 (Qwen2.5)	Testing 4B next	—	—

What I'd Like Help With

I'm sharing this because the result surprised me and I'm not sure I fully understand the dynamics yet. If you work with QLoRA/LoRA fine-tuning on consumer GPUs, I'd love to hear:

Is there a known relationship between quantization noise and LoRA learning? My hypothesis is that HQQ 4-bit introduces gradient noise that the LoRA adapter can't compensate for at low rank. But I haven't found papers that quantify this for HQQ specifically (most research uses bitsandbytes NF4).
Optimal model size for a given VRAM budget: Is there a formula or heuristic? Something like "for structured output SFT, use the largest model that fits in bf16 with r≥32 at your target max_length." I arrived at 4B by accident — is there a principled way to pick this?
Better approaches for thinking-mode models: Qwen3's <think> mechanism fights structured output even after fine-tuning. I tried few-shot injection, format skeleton examples, and extra epochs. What actually works to teach a thinking-mode model to produce structured output without reasoning preamble?
Training data efficiency: I'm working with 42-99 Sonnet-generated examples per task. The 4B model learned the format in 5 epochs at r=32. Is there a minimum viable dataset size for format-following SFT? Could I get away with 20 examples?

The full training pipeline, bake-off framework, configs, and golden test sets are in the qlora-rdna4 repo. Everything runs on a single AMD RX 9070 XT with zero cloud costs.

Lessons

Base model benchmarks don't predict fine-tuning quality. Qwen3-4B scored last in our bake-off (19-48% format compliance). After fine-tuning, it beat every other model including the bake-off winner (Qwen3-8B) and our production Qwen2.5-7B. The base model evaluation measures a completely different capability than what fine-tuning teaches.

VRAM headroom matters more than parameter count. On a fixed 16GB budget, a 4B model with lossless weights and r=32 LoRA outperforms an 8B model with 4-bit quantization and r=16. The constraint isn't the model's theoretical capacity — it's the quality of training you can afford within your hardware limits.

Never truncate completions. The completion is the exact output format you're teaching the model. If you cut it, you teach cut outputs. This single bug — capping completions at max_length // 2 — caused weeks of "why won't it learn the format?" debugging.

Clear your VRAM between runs. On AMD RDNA 4, stale GPU memory from Ollama model caching or crashed training processes isn't automatically freed. Always verify rocm-smi --showmemuse shows 0% before training. This alone prevented 3 phantom OOMs.

How a 4B Model Beat a 7B: Fine-Tuning Bake-off on 16GB VRAM

The Bake-off: 7 Models, 8 Tasks

Base Model Results (No Fine-Tuning)

The 8B Failure

Bug #1: Completion Truncation

Bug #2: Ollama Template Missing Think Support

After Fixes: Still Not Enough

The Plot Twist: 4B Beats Everything

Judge-4B Results

Why This Happened

Training Details

Current Fleet

What I'd Like Help With

Lessons

Further Reading

Related Articles

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

Vault Autoresearch: A Personal AI Learns From Itself

Overnight Results: From 3.76 to 4.35

About the Author

Vache Sarkissian