GPU

DevSecOps-5090 — GPU Training Pipeline on Kubernetes

DevSecOps-5090 — GPU Training Pipeline on Kubernetes

Challenge

Fine-tuning Large Language Models typically requires cloud GPU instances (expensive) or complex local setups. Need a production-ready, self-hosted training pipeline that:

  • Runs on local RTX 5090 (32GB VRAM)
  • Deploys via Kubernetes (k3s homelab)
  • Uses pre-built images (no runtime pip installs)
  • Supports QLoRA for memory efficiency

Solution Architecture

Pipeline Overview

┌─────────────────────────────────────────────────────────────┐
│                    K3s Cluster (Homelab)                     │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Training Pod (GPU)                      │   │
│  │  ┌─────────────────┐  ┌─────────────────────────┐   │   │
│  │  │ Init Containers │  │    Main Container       │   │   │
│  │  │ - Verify GPU    │  │ - qwen_qlora_trainer.py │   │   │
│  │  │ - Check deps    │  │ - HuggingFace ecosystem │   │   │
│  │  │ - Mount PVCs    │  │ - bitsandbytes (4-bit)  │   │   │
│  │  └─────────────────┘  └─────────────────────────┘   │   │
│  │                                                      │   │
│  │  ┌─────────────────────────────────────────────┐    │   │
│  │  │              Sidecar                         │    │   │
│  │  │         metrics-exporter                     │    │   │
│  │  └─────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Volumes:                                                   │
│  ├── /mnt/models      (RO) - Model cache                   │
│  ├── /mnt/data        (RO) - Training data                 │
│  ├── /mnt/checkpoints (RW) - Output checkpoints            │
│  └── /mnt/training-logs (RW) - Logs + TensorBoard          │
└─────────────────────────────────────────────────────────────┘

Pre-Built Training Image

FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime

# Pre-install all dependencies (no runtime downloads)
RUN pip install --no-cache-dir \
    transformers==4.47.0 \
    peft==0.14.0 \
    trl==0.13.0 \
    bitsandbytes==0.45.0 \
    datasets==3.2.0 \
    accelerate==1.2.1 \
    safetensors \
    sentencepiece \
    protobuf

Result: 11.4 GB image, ~2 min boot (vs. 15+ min with runtime pip)

LLM Fine-tuning QLoRA GPU Kubernetes Self-Hosted ML Infrastructure