Why is fine-tuning better than prompt engineering for format compliance?

Reasoning models like Qwen3 8B prioritize explanation and thinking over exact output format. Fine-tuning a smaller instruction-following model (Qwen2.5 1.5B) on real examples achieves 100% format compliance in 5 minutes, versus 10% with prompt engineering on the larger model.

How much VRAM does LoRA fine-tuning require?

On AMD RX 9070 XT, 16-bit LoRA fine-tuning requires approximately 5.7 GB peak VRAM. This fits comfortably on consumer GPUs with 6-16GB VRAM, making it accessible without enterprise hardware.

How long does fine-tuning take?

Training a fine-tuned model on 515 examples takes 4-5 minutes on an AMD RX 9070 XT. The complete pipeline from data generation to Ollama-ready model takes about 1 hour.

What format compliance rates did you achieve?

Fine-tuned Qwen2.5 1.5B achieved 100% format compliance across three tasks (topic deepening, quiz generation, spaced repetition scheduling), compared to 10% compliance on the larger reasoning model.

Is QLoRA supported on AMD consumer GPUs?

No, QLoRA is broken on AMD consumer GPUs due to a bitsandbytes compatibility issue. Use 16-bit LoRA instead, which works reliably and requires ~5.7 GB VRAM.

Fine-Tuning Local Models: 10% to 100% Format Compliance in 5 Minutes

Fine-tuning smaller instruction-following models (1.5B parameters) on structured output achieves higher format compliance than prompt-engineering larger reasoning models (8B parameters), while using one-fifth the VRAM and costing $0 on consumer AMD GPUs.

The problem: My vault automation system runs 20+ daily tasks requiring structured output (YAML-like lists, spaced repetition schedules, quiz formats). A general-purpose 8-billion-parameter reasoning model (Qwen 3) hit only 10 percent format compliance because reasoning models optimize for explanation and reasoning transparency, not output format adherence. Nine out of ten outputs included unwanted preamble, markdown fences, or explanations that broke downstream parsers. Prompt engineering couldn't fix this: the model was "thinking too much."

The solution: Fine-tune a small instruction-following model (Qwen 2.5 1.5B) on 515 real-world examples from Claude, generated in the exact output format the parser expects. Training took under five minutes per model on an RX 9070 XT with 6GB VRAM. Result: 100 percent format compliance across three fine-tuned models (deepener, quiz generator, spaced repetition scheduler). The lesson: for format-constrained tasks, smaller models trained on exact outputs outperform larger models trained to reason.

The Problem

My topic-deepener task takes a knowledge note skeleton and generates subtopic suggestions in a specific structured format: parent slug, bullet list with exact formatting, no preamble, no explanation. The parser expects this exact shape or it throws the output away.

qwen3:8b is a reasoning model. When you give it a structured output instruction, it thinks about it. It considers alternatives. It adds helpful context. It wraps the output in markdown code fences. It explains why it chose certain subtopics. All of this is poison for a format-constrained pipeline.

I tried prompt engineering. I tried few-shot examples. I tried system prompts that said "OUTPUT ONLY THE FOLLOWING FORMAT." The best I got was about 10% compliance on a 52-example eval set. The model understood what I wanted — it just couldn't stop itself from editorializing.

The same pattern showed up on two other tasks: spaced repetition scheduling (0% compliance with qwen3:8b) and quiz generation (80% compliance). Reasoning models want to reason. That's their strength and, for format tasks, their fatal flaw.

The Approach

Instead of fighting the 8B reasoning model, I went smaller and dumber. Qwen2.5-1.5B-Instruct — not the reasoning variant, the plain instruction-following base. A model small enough that LoRA fine-tuning fits comfortably in 6GB of VRAM.

The training data came from Claude. I used Sonnet to generate 515 training examples from real vault notes for the deepener task. The process: extract note skeletons from the vault, feed each skeleton to Claude with the exact prompt the heartbeat task uses, collect the structured output. Six parallel agents, batches of ~86 notes each. The Claude Max subscription meant this was effectively free.

Each training example is a chat-format triplet:

{
  "messages": [
    {"role": "system", "content": "You are a topic-deepener..."},
    {"role": "user", "content": "## Note skeleton\ntitle: cs--attention-mechanisms\n..."},
    {"role": "assistant", "content": "parent: cs--attention-mechanisms\n- cs--cross-attention | Cross-Attention | ..."}
  ]
}

The assistant response is the exact output format the parser expects. No explanation, no preamble, no code fences. Just the structured data. Train on enough of these and the model learns the format as a reflex, not a reasoning task.

Training Setup

Hardware: AMD RX 9070 XT with 16GB VRAM, running PyTorch 2.9.1+rocm6.3. One important constraint — QLoRA is broken on AMD consumer GPUs due to a bitsandbytes issue. So it's 16-bit LoRA only, which uses more VRAM but works reliably.

The training config:

# LoRA configuration
lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling factor
    target_modules=[         # all linear layers
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
 
# Training arguments
training_args = SFTConfig(
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch size: 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    max_length=1024,
    bf16=True,                       # RDNA 4 supports bfloat16
    logging_steps=10,
)

Training the deepener model: 87 steps, 4 minutes 44 seconds. Peak VRAM: 5.68 GB. Final train loss: 0.49, validation loss: 0.52 — close enough to rule out overfitting.

The pipeline after training: merge LoRA weights into the base model, convert to GGUF with llama.cpp, quantize to Q8_0, create an Ollama Modelfile, and register it. The whole flow from raw training data to a running Ollama model is about an hour including data generation.

One gotcha: Q4_K_M quantization destroyed format compliance on the 1.5B model. At 1.5B parameters, you don't have enough redundancy to survive aggressive quantization. Q8_0 (1.6GB on disk) works perfectly. For comparison, the full FP16 model is 3.1GB — so Q8_0 is roughly half the size with no measurable quality loss.

Results

Three models trained with the same pipeline, different training data:

Model	Task	Examples	Before	After
vault-deepener-q8	Topic deepening	515	10% format rate	100%
vault-spacer-q8	Spaced repetition	107	0% format rate	100%
vault-quizzer-q8	Quiz generation	93	80% format rate	100%

The deepener eval is the most thorough — 52 held-out examples with content quality metrics:

Metric	vault-deepener-q8	qwen3:8b
Format perfect	100%	10%
Correct parent slug	100%	21%
Bullets in range	100%	12%
ROUGE-L vs gold	0.353	0.088
Novelty rate	98%	--

The ROUGE-L score matters here. It's not just that the model follows the format — the content is 4x more aligned with what Claude would produce. And 98% novelty means it's not memorizing the training set.

Performance

This is where the economics get interesting:

Metric	vault-deepener-q8	qwen3:8b	Improvement
Tokens/sec	159.6	73.9	2.2x
Latency (typical task)	~0.8s	~7.4s	9.2x
VRAM	~1.2 GB	~5.5 GB	4.6x
Disk	1.6 GB	4.9 GB	3.1x

The 9.2x latency improvement comes from two factors: faster token generation (smaller model = less memory bandwidth needed per token) and shorter outputs (the fine-tuned model doesn't waste tokens on preamble and explanation). The 8B model would generate 200+ tokens of reasoning before getting to the actual output. The 1.5B model goes straight to the structured data.

VRAM reduction is the other win. At 1.2GB, I can run the fine-tuned model alongside Ollama's other loaded models without swapping. The 8B model would compete for the same VRAM pool, causing model loads and unloads that add latency to other tasks.

What Didn't Work

Before landing on 1.5B, I tried the obvious middle ground: 3B reasoning models. Qwen3:3b, various other small reasoning variants. All failed.

The pattern was consistent. Reasoning-capable models, regardless of size, want to reason. They produce chain-of-thought even when told not to. They wrap output in code blocks. They add "Here's my analysis" headers. The reasoning capability is deeply embedded in the model weights — you can't prompt it away, and fine-tuning a reasoning model to not reason is fighting the architecture.

Base instruction-following models don't have this problem. Qwen2.5-1.5B-Instruct doesn't try to reason about the task. It just follows the pattern it learned during fine-tuning. For format-constrained tasks, this is exactly what you want: a model that executes a learned pattern without editorial intervention.

The Insight

There are two fundamentally different categories of LLM tasks in my automation pipeline:

Reasoning tasks: code review, ecosystem analysis, morning briefings. These need a model that can think, synthesize, and make judgments. qwen3:8b or qwen2.5-coder:14b, running on the same local GPU. (See how different local models compare across task types.)

Format tasks: topic deepening, spaced repetition scheduling, quiz generation, structured data extraction. These need a model that produces exact output shapes reliably. Fine-tuned 1.5B models.

Using a reasoning model for format tasks is like hiring a philosopher to fill out forms. They'll do it, eventually, in their own way, with extensive annotations about the nature of forms. Using a fine-tuned small model for reasoning tasks would be equally wrong — they'd produce correctly formatted nonsense.

The three fine-tuned models now run across six daily heartbeat slots (three for the deepener alone) at zero marginal cost. The combined VRAM footprint of all three is less than what qwen3:8b uses alone. Training each new model takes under an hour end-to-end, including data generation.

The economics of this are hard to argue with. Frontier models for reasoning, fine-tuned small models for format-heavy automation. The 5x cost difference between Sonnet and Opus matters for API calls. The infinity-x cost difference between local inference and any API matters even more — especially when you're running tasks every few hours, indefinitely.

Fine-Tuning Local Models: 10% to 100% Format Compliance in 5 Minutes

The Problem

The Approach

Training Setup

Results

Performance

What Didn't Work

The Insight

Related Articles

7B QLoRA Fine-Tuning on AMD RDNA4: The HQQ Path

Building a Fashion Trend Intelligence Pipeline for $3/Month

Private AI for Legal Work

Sources

About the Author

Vache Sarkissian