Vache prompts. Claude codes.How it works

How a 4B Model Beat a 7B: Fine-Tuning Bake-off on 16GB VRAM

·10 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
fine-tuningqloraqwen3rdna4hqqlocal-modelsbakeoff
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

I run 13 fine-tuned models on a single AMD RX 9070 XT (16GB VRAM). They handle knowledge scoring, goal planning, note creation, quiz generation, spaced repetition — all at zero API cost via Ollama. The models are trained with QLoRA via HQQ (because bitsandbytes doesn't work on RDNA 4) on 42-99 Sonnet-generated examples each.

The current fleet is all Qwen2.5-based: four 7B models (HQQ 4-bit QLoRA), one 3B, and three 1.5B models. When Qwen3 dropped with claims that the 4B model "matches Qwen2.5-72B benchmarks," the question was obvious: can we switch to smaller, better models?

The answer was not what I expected.

The Bake-off: 7 Models, 8 Tasks

Seven candidate models, tested across 8 production tasks with real golden test inputs (not synthetic benchmarks). Each model got the exact system prompt and input data our heartbeat system uses in production.

Candidates: Qwen3-4B, Qwen3-8B, Qwen3-1.7B, Qwen3.5-9B, Qwen2.5-7B, Qwen2.5-3B, Gemma3-12B

Scoring: Format compliance (regex checks against required output structure), content quality (task-appropriate keywords), consistency (same input run 3 times), and inference speed.

Base Model Results (No Fine-Tuning)

TaskWinnerQwen3-4BQwen3-8BQwen2.5-7B
JudgeQwen3-8B (84%)48% (last)84%80%
PlannerQwen3-1.7B (100%)40% (last)100%100%
SeederQwen3-8B (100%)19% (last)100%100%
SpacerQwen3-8B (100%)83%100%100%
QuizzerQwen2.5-7B (100%)100%100%100%
ReflectorQwen3-8B (100%)100%100%75%

Qwen3-4B was dead last on 3 out of 6 scored tasks. The "matches 72B benchmarks" claim is about MMLU and reasoning — not structured output compliance. For format following (markdown tables, specific headers, SUMMARY lines), the 4B base model is terrible. Even with think: false in the Ollama API, it produces reasoning-style output instead of structured format.

Qwen3-8B won the bake-off: 4/7 tasks, strong format compliance, good content quality.

The obvious conclusion: fine-tune the 8B winner.

That's what I did. It didn't work.

The 8B Failure

I fine-tuned Qwen3-8B on all four core tasks using the same Sonnet-generated training data as the production Qwen2.5-7B models. Initial results:

ModelFormat ScoreWhat Happened
Judge-8B0/5Echoed the template instructions back
Seeder-8B0/8Produced 78 tokens total
Reflector-8B0/4Echoed the input data
Planner-8B5/5Perfect

Only planner worked. Everything else was garbage. I spent two days debugging this:

Bug #1: Completion Truncation

The training pipeline had a critical bug — it was capping completions at max_length // 2 tokens:

# THE BUG
max_completion_tokens = max_len // 2  # 512 for max_length=1024
if len(completion_ids) > max_completion_tokens:
    completion_ids = completion_ids[:max_completion_tokens]  # chop the output!

For the judge task, the median completion is 1,275 tokens. At max_length=1024, the cap was 512. 84% of judge completions were being cut in half. The model was learning from incomplete outputs — it literally never saw a complete evaluation with all five criteria scored and a SUMMARY line.

TaskMedian Completion% That Fit at max_length=1024% at 2048
Seeder429 tokens65%84%
Judge1,275 tokens16%35%
Reflector362 tokens58%100%

Fix: never truncate completions. Only truncate prompts. Skip examples where the completion alone exceeds max_length.

Bug #2: Ollama Template Missing Think Support

Qwen3 models use <think>...</think> tags for reasoning. The Ollama think: false API parameter only works if the model's template includes IsThinkSet logic. Our custom GGUF modelfile didn't have this — the model was burning all output tokens on thinking, returning empty content.

The fix required copying the official Qwen3 template from Ollama's qwen3:8b model, which injects /no_think into the user message and prepends an empty <think></think> block to the assistant turn.

After Fixes: Still Not Enough

With both bugs fixed, I retrained with more epochs (3→7), higher LoRA rank (16→32), format skeleton examples, and few-shot injection. Results improved but couldn't match production:

ModelBefore FixesAfter FixesProduction Qwen2.5-7B
Judge-8B0/53/55/5
Seeder-8B0/83/86/8
Reflector-8B0/42/43/4

The fundamental problem: Qwen3-8B with HQQ 4-bit quantization on 16GB VRAM is too tight. At max_length=2048, it hits 15.95 GB (out of 16.0 GB). LoRA rank is capped at r=16. The quantization itself introduces noise. The model doesn't have enough room to learn.

At this point I had a conclusion ready to write: "Qwen2.5-7B wins for structured output, Qwen3 is better at reasoning but can't follow formats."

Then I tried the 4B model.

The Plot Twist: 4B Beats Everything

The insight came from staring at the VRAM numbers. Qwen3-4B in bf16 (no quantization) is about 8GB. That leaves 8GB for activations, gradients, and optimizer state. Compare:

Qwen3-8B (HQQ 4-bit)Qwen3-4B (full bf16)
Base model VRAM6.4 GB (quantized)8 GB (lossless)
Max LoRA rank at 2048r=16 (r=32 OOMs)r=32 easily
Peak VRAM at 204815.95 GB (razor thin)13.37 GB (comfortable)
Weight precision4-bit (lossy)bf16 (lossless)
Trainable params43.6M (0.92%)66.1M (1.62%)
Training speed~50s/step~42s/step

The 4B model has lossless weights (no HQQ quantization noise), higher LoRA rank (more learning capacity), and comfortable VRAM margins (no risk of OOM). The 8B model is technically "bigger" but operates under such severe constraints that it can't actually learn as effectively.

Judge-4B Results

Same training data, same hyperparameters (except r=32 instead of r=16, and full LoRA instead of HQQ):

Test InputProduction 7B (Qwen2.5)Judge-8B (Qwen3)Judge-4B (Qwen3)
Empty input0/53/55/5
Knowledge batch (3 notes)5/53/55/5
Output judge format3/53/55/5
Short note4/55/5
Wrong format input3/54/5
Average3.0/53.0/54.8/5

The 4B model that scored dead last in the base bake-off is now the best fine-tuned judge I've ever trained.

It produces correct ### note-name — PASS (4.2/5) headers, proper 5-row evaluation tables, accurate SUMMARY: lines — all the format elements that the 8B model couldn't learn because it was too VRAM-constrained to train properly.

Why This Happened

The conventional wisdom is "bigger model = better fine-tuning." But that assumes unlimited resources. On a fixed VRAM budget, the relationship inverts:

Bigger model → more quantization → less VRAM for LoRA → lower rank → noisier gradients → worse fine-tuning

Smaller model → no quantization needed → more VRAM for LoRA → higher rank → cleaner gradients → better fine-tuning

There's a sweet spot where the model is large enough to have the capacity for your task, but small enough to train with lossless weights and high LoRA rank on your hardware. For structured output tasks on 16GB VRAM, that sweet spot appears to be 4B parameters.

The specific factors:

  1. Lossless weights: Full bf16 LoRA preserves the pre-trained knowledge perfectly. HQQ 4-bit quantization introduces noise that the LoRA adapter has to work around.

  2. Higher LoRA rank: r=32 at max_length=2048 gives the model 66M trainable parameters (1.6% of total). The 8B model at r=16 only gets 44M (0.9%). More parameters = more capacity to learn the output format.

  3. No OOM anxiety: At 13.37 GB peak (83% of 16 GB), training is stable. The 8B model at 15.95 GB (99.7%) is one allocation away from a crash. Stale GPU memory from Ollama or a previous run would cause phantom OOMs.

  4. Same training data: Both models used identical Sonnet-generated training examples with the same completion-preservation fix. The only difference was the training conditions.

Training Details

For anyone who wants to reproduce this:

Base model: Qwen/Qwen3-4B
Method: Full LoRA (bf16, no quantization)
LoRA: r=32, alpha=64, dropout=0.05
Target: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max length: 2048
Epochs: 5
Batch: 1, gradient accumulation: 8
LR: 5e-5, cosine schedule, 10 warmup steps
Loss: completion_only_loss=True (CRITICAL — only learn the output format)
NEFTune: alpha=5
Training examples: 63 (14 skipped for completion > max_length)
Peak VRAM: 13.37 GB
Wall clock: 28.4 minutes
GPU: AMD RX 9070 XT (16GB, RDNA 4)
Framework: TRL SFTTrainer + PEFT LoRA

Key implementation details:

  • Never truncate completions: The completion is what you're teaching the model to produce. If you truncate it, you teach truncated output. Skip examples where len(completion) + len(prompt_skeleton) > max_length.
  • Middle-truncate prompts: Keep the first 1/3 (task framing headers) and last 2/3 (recent content). Remove the middle.
  • Ollama template: Custom GGUF models need the official Qwen3 template with IsThinkSet logic for think: false to work. Without it, the model ignores the flag and burns tokens on <think> reasoning.
  • Clear VRAM before training: curl -s .../api/generate -d '{"model":"...","keep_alive":0}' then stop Ollama. Stale GPU memory causes phantom OOMs.

Conversion pipeline: HuggingFace → llama.cpp convert_hf_to_gguf.py (F16) → llama-quantize (Q8_0) → Ollama modelfile with Qwen3 ChatML template.

Current Fleet

TaskOld ModelNew ModelFormat ScoreSize
Judgevault-judge-7b-q8 (Qwen2.5, 8.1 GB)vault-judge-4b-q8 (Qwen3, 4.4 GB)3.0→4.8/5-46%
Plannervault-planner-7b-q8 (Qwen2.5, 8.1 GB)vault-planner-8b-q8 (Qwen3, 8.7 GB)5/5→5/5Same
Seedervault-seeder-7b-q8 (Qwen2.5)Testing 4B next
Reflectorvault-reflector-3b-q8 (Qwen2.5)Testing 4B next

What I'd Like Help With

I'm sharing this because the result surprised me and I'm not sure I fully understand the dynamics yet. If you work with QLoRA/LoRA fine-tuning on consumer GPUs, I'd love to hear:

  1. Is there a known relationship between quantization noise and LoRA learning? My hypothesis is that HQQ 4-bit introduces gradient noise that the LoRA adapter can't compensate for at low rank. But I haven't found papers that quantify this for HQQ specifically (most research uses bitsandbytes NF4).

  2. Optimal model size for a given VRAM budget: Is there a formula or heuristic? Something like "for structured output SFT, use the largest model that fits in bf16 with r≥32 at your target max_length." I arrived at 4B by accident — is there a principled way to pick this?

  3. Better approaches for thinking-mode models: Qwen3's <think> mechanism fights structured output even after fine-tuning. I tried few-shot injection, format skeleton examples, and extra epochs. What actually works to teach a thinking-mode model to produce structured output without reasoning preamble?

  4. Training data efficiency: I'm working with 42-99 Sonnet-generated examples per task. The 4B model learned the format in 5 epochs at r=32. Is there a minimum viable dataset size for format-following SFT? Could I get away with 20 examples?

The full training pipeline, bake-off framework, configs, and golden test sets are in the qlora-rdna4 repo. Everything runs on a single AMD RX 9070 XT with zero cloud costs.

Lessons

Base model benchmarks don't predict fine-tuning quality. Qwen3-4B scored last in our bake-off (19-48% format compliance). After fine-tuning, it beat every other model including the bake-off winner (Qwen3-8B) and our production Qwen2.5-7B. The base model evaluation measures a completely different capability than what fine-tuning teaches.

VRAM headroom matters more than parameter count. On a fixed 16GB budget, a 4B model with lossless weights and r=32 LoRA outperforms an 8B model with 4-bit quantization and r=16. The constraint isn't the model's theoretical capacity — it's the quality of training you can afford within your hardware limits.

Never truncate completions. The completion is the exact output format you're teaching the model. If you cut it, you teach cut outputs. This single bug — capping completions at max_length // 2 — caused weeks of "why won't it learn the format?" debugging.

Clear your VRAM between runs. On AMD RDNA 4, stale GPU memory from Ollama model caching or crashed training processes isn't automatically freed. Always verify rocm-smi --showmemuse shows 0% before training. This alone prevented 3 phantom OOMs.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com