Vache prompts. Claude codes.How it works
Open Source

Personal AI Agents on AMD RDNA4

8 fine-tuned models running 50+ autonomous tasks daily on a single $550 consumer GPU. No cloud dependencies. No API costs. No NVIDIA required.

8
Fine-tuned models
$0
Training cost
50+
Daily tasks
~90 min
Total train time
The Problem

The standard QLoRA library (bitsandbytes) doesn't work on AMD's newest GPU architecture. Two independent failure paths, both unfixable without upstream changes: PyPI wheels crash with hipErrorNoBinaryForGpu, and building from source creates a ROCm version mismatch that produces undefined symbol: hsa_amd_memory_get_preferred_copy_engine.

Without 4-bit quantization, a 7B model in bf16 needs ~14GB just for weights — leaving nothing for activations and gradients on a 16GB card.

The Solution: HQQ

HQQ (Half-Quadratic Quantization) provides 4-bit quantization with a pure PyTorch backend — no custom CUDA/HIP kernels. Works on any device PyTorch supports, including RDNA4.

5.85 GB
Base model VRAM
12.2 GB
Peak training VRAM
~21 s/step
Training speed
Trained Models
Model
Task
Examples
VRAM
Eval Loss
judge-7b
Quality evaluation
67
14.5 GB
1.84
planner-8b
Goal planning
99
12.2 GB
1.35
seeder-7b
Research synthesis
64
12.2 GB
1.96
analyst-7b
Technical analysis
77
15.7 GB
1.32
reflector-3b
Session reflection
42
11.5 GB
deepener-1.5b
Topic exploration
515
8.0 GB
spacer-1.5b
Spaced repetition
107
8.0 GB
quizzer-1.5b
Quiz generation
120
8.0 GB

Training data generated by Sonnet 4.6 distillation on real vault data. Models quantized to Q8_0 GGUF and deployed to Ollama.

On these metrics: Eval loss measures next-token prediction on held-out examples. Lower is better, but it only tells us the model learned the output patterns — not whether it performs the task well. A model with good loss could still produce poorly calibrated scores or miss edge cases. We're building proper task-level evaluation: running each model alongside Sonnet on identical inputs and comparing output quality. Until that data is in, treat these as training diagnostics, not performance benchmarks.

The Autonomous System

These models don't sit idle. They run in a heartbeat system — an automated task scheduler that executes 50+ tasks daily on a 15-minute timer. The system self-improves: analysts find opportunities, planners create goals, implementers execute them, and judges evaluate the results.

02:00
Knowledge seeder generates research notes
Sonnet + web search
04:00
Judge evaluates note quality (1-5 scoring)
vault-judge-7b-q8
05:00
Auto-implement executes pending goals
Claude Code
13:00
Deepener explores topics with follow-up questions
vault-deepener-q8
13:30
Spacer runs spaced repetition review
vault-spacer-q8
22:00
Blog writer drafts posts from research
gemma3:12b
23:15
Output judge reviews all task outputs
vault-judge-7b-q8
23:45
Goal planner prioritizes findings
vault-planner-8b-q8
Six Lessons (Three Patches, Three Gotchas)
1HQQ dtype mismatch

prepare_model_for_kbit_training upcasts layers — HQQ doesn't cast inputs to match

2Python 3.14 torch.compile

HQQ uses @torch.compile() as decorator — breaks on 3.14+, needs monkey-patch

3Clean LoRA merge

HQQ tensor names break llama.cpp — reload base bf16, apply adapter, merge clean

4Ollama chat template

Custom GGUF without ChatML template = model generates input continuations, not output

5completion_only_loss

Input/output ratio > 5:1 without loss masking = model learns to predict inputs

6Middle truncation

Keep start 1/3 (task framing) + end 2/3 (recent data), drop redundant middle

Pushing the Frontier
KTO Training

Kahneman-Tversky optimization for preference alignment with unpaired data. No topic-matching needed — just good and bad examples.

TIES Merging

Combining 4 specialist LoRAs into one unified model. Zero VRAM swapping between tasks. Route via system prompt.

Local Eval

Prometheus 2 as a fully local evaluation judge. Quality assurance without API costs. IRT-based test optimization.

Stack
AMD RX 9070 XTROCm 7.2PyTorch 2.9.1HQQPEFTTRLQwen2.5 / Qwen3Ollamallama.cppPython 3.14Arch Linux

Everything is open source. If you have an AMD GPU and want your own AI agents:

github.com/vachsark/qlora-rdna4