DevSecOps-5090 — GPU Training Pipeline on Kubernetes
DevSecOps-5090 — GPU Training Pipeline on Kubernetes
Challenge
Fine-tuning Large Language Models typically requires cloud GPU instances (expensive) or complex local setups. Need a production-ready, self-hosted training pipeline that:
- Runs on local RTX 5090 (32GB VRAM)
- Deploys via Kubernetes (k3s homelab)
- Uses pre-built images (no runtime pip installs)
- Supports QLoRA for memory efficiency
Solution Architecture
Pipeline Overview
┌─────────────────────────────────────────────────────────────┐
│ K3s Cluster (Homelab) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Pod (GPU) │ │
│ │ ┌─────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Init Containers │ │ Main Container │ │ │
│ │ │ - Verify GPU │ │ - qwen_qlora_trainer.py │ │ │
│ │ │ - Check deps │ │ - HuggingFace ecosystem │ │ │
│ │ │ - Mount PVCs │ │ - bitsandbytes (4-bit) │ │ │
│ │ └─────────────────┘ └─────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Sidecar │ │ │
│ │ │ metrics-exporter │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Volumes: │
│ ├── /mnt/models (RO) - Model cache │
│ ├── /mnt/data (RO) - Training data │
│ ├── /mnt/checkpoints (RW) - Output checkpoints │
│ └── /mnt/training-logs (RW) - Logs + TensorBoard │
└─────────────────────────────────────────────────────────────┘
Pre-Built Training Image
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
# Pre-install all dependencies (no runtime downloads)
RUN pip install --no-cache-dir \
transformers==4.47.0 \
peft==0.14.0 \
trl==0.13.0 \
bitsandbytes==0.45.0 \
datasets==3.2.0 \
accelerate==1.2.1 \
safetensors \
sentencepiece \
protobuf
Result: 11.4 GB image, ~2 min boot (vs. 15+ min with runtime pip)
•
LLM Fine-tuning
QLoRA
GPU
Kubernetes
Self-Hosted
ML Infrastructure