Personal AI Agents on AMD RDNA4
8 fine-tuned models running 50+ autonomous tasks daily on a single $550 consumer GPU. No cloud dependencies. No API costs. No NVIDIA required.
The standard QLoRA library (bitsandbytes) doesn't work on AMD's newest GPU architecture. Two independent failure paths, both unfixable without upstream changes: PyPI wheels crash with hipErrorNoBinaryForGpu, and building from source creates a ROCm version mismatch that produces undefined symbol: hsa_amd_memory_get_preferred_copy_engine.
Without 4-bit quantization, a 7B model in bf16 needs ~14GB just for weights — leaving nothing for activations and gradients on a 16GB card.
HQQ (Half-Quadratic Quantization) provides 4-bit quantization with a pure PyTorch backend — no custom CUDA/HIP kernels. Works on any device PyTorch supports, including RDNA4.
Training data generated by Sonnet 4.6 distillation on real vault data. Models quantized to Q8_0 GGUF and deployed to Ollama.
On these metrics: Eval loss measures next-token prediction on held-out examples. Lower is better, but it only tells us the model learned the output patterns — not whether it performs the task well. A model with good loss could still produce poorly calibrated scores or miss edge cases. We're building proper task-level evaluation: running each model alongside Sonnet on identical inputs and comparing output quality. Until that data is in, treat these as training diagnostics, not performance benchmarks.
These models don't sit idle. They run in a heartbeat system — an automated task scheduler that executes 50+ tasks daily on a 15-minute timer. The system self-improves: analysts find opportunities, planners create goals, implementers execute them, and judges evaluate the results.
prepare_model_for_kbit_training upcasts layers — HQQ doesn't cast inputs to match
HQQ uses @torch.compile() as decorator — breaks on 3.14+, needs monkey-patch
HQQ tensor names break llama.cpp — reload base bf16, apply adapter, merge clean
Custom GGUF without ChatML template = model generates input continuations, not output
Input/output ratio > 5:1 without loss masking = model learns to predict inputs
Keep start 1/3 (task framing) + end 2/3 (recent data), drop redundant middle
Kahneman-Tversky optimization for preference alignment with unpaired data. No topic-matching needed — just good and bad examples.
Combining 4 specialist LoRAs into one unified model. Zero VRAM swapping between tasks. Route via system prompt.
Prometheus 2 as a fully local evaluation judge. Quality assurance without API costs. IRT-based test optimization.
Everything is open source. If you have an AMD GPU and want your own AI agents:
github.com/vachsark/qlora-rdna4