DevSecOps-5090 — GPU Training Pipeline on Kubernetes
DevSecOps-5090 — GPU Training Pipeline on Kubernetes
Challenge
Fine-tuning Large Language Models typically requires cloud GPU instances (expensive) or complex local setups. Need a production-ready, self-hosted training pipeline that:
- Runs on local RTX 5090 (32GB VRAM)
- Deploys via Kubernetes (k3s homelab)
- Uses pre-built images (no runtime pip installs)
- Supports QLoRA for memory efficiency
Solution Architecture
Pipeline Overview
┌─────────────────────────────────────────────────────────────┐
│ K3s Cluster (Homelab) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Pod (GPU) │ │
│ │ ┌─────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Init Containers │ │ Main Container │ │ │
│ │ │ - Verify GPU │ │ - qwen_qlora_trainer.py │ │ │
│ │ │ - Check deps │ │ - HuggingFace ecosystem │ │ │
│ │ │ - Mount PVCs │ │ - bitsandbytes (4-bit) │ │ │
│ │ └─────────────────┘ └─────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Sidecar │ │ │
│ │ │ metrics-exporter │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Volumes: │
│ ├── /mnt/models (RO) - Model cache │
│ ├── /mnt/data (RO) - Training data │
│ ├── /mnt/checkpoints (RW) - Output checkpoints │
│ └── /mnt/training-logs (RW) - Logs + TensorBoard │
└─────────────────────────────────────────────────────────────┘
Pre-Built Training Image
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
# Pre-install all dependencies (no runtime downloads)
RUN pip install --no-cache-dir \
transformers==4.47.0 \
peft==0.14.0 \
trl==0.13.0 \
bitsandbytes==0.45.0 \
datasets==3.2.0 \
accelerate==1.2.1 \
safetensors \
sentencepiece \
protobuf
Result: 11.4 GB image, ~2 min boot (vs. 15+ min with runtime pip)
Key Features
1. QLoRA Training Script
# qwen_qlora_trainer.py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# LoRA config for efficient fine-tuning
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
2. Helm Deployment
# helm/values.yaml
training:
enabled: true
image:
repository: 192.168.2.30:5000/devsecops-training
tag: v0.1.0
resources:
limits:
nvidia.com/gpu: 1
memory: 14Gi
cpu: 4
requests:
memory: 12Gi
cpu: 2
securityContext:
runAsNonRoot: true
runAsUser: 1001
readOnlyRootFilesystem: true
3. Container Security
# Non-root execution
securityContext:
runAsUser: 1001 (trainuser)
runAsGroup: 1001
# Read-only filesystem with exceptions
volumes:
- /tmp (writable) - Triton cache
- /dev/shm (4Gi) - PyTorch DataLoader
- /mnt/checkpoints (writable) - Model output
4. GPU Resource Management
RTX 5090 (32GB VRAM):
├── Model (4-bit quantized): ~8GB
├── Gradients + Optimizer: ~4GB
├── Batch activations: ~2GB
└── Available for larger batches: ~18GB
Training Configuration
Supported Models
| Model | Parameters | 4-bit VRAM | Use Case |
|---|---|---|---|
| GPT-2 | 124M | ~1GB | Testing/validation |
| Qwen-7B | 7B | ~4GB | General fine-tuning |
| Qwen-14B | 14B | ~8GB | Higher quality |
| DeepSeek-R1-32B | 32B | ~18GB | Maximum capability |
Example Config
# qlora-8b-devsecops.yaml
model:
name: "Qwen/Qwen2.5-7B-Instruct"
quantization: "4bit"
training:
epochs: 3
batch_size: 4
learning_rate: 2e-4
max_seq_length: 2048
lora:
r: 16
alpha: 32
dropout: 0.05
output:
checkpoint_dir: "/mnt/checkpoints"
save_steps: 500
Deployment
Quick Start
# 1. Build and push image
docker buildx build -f Dockerfile.training \
-t 192.168.2.30:5000/devsecops-training:v0.1.0 .
docker push 192.168.2.30:5000/devsecops-training:v0.1.0
# 2. Deploy via Helm
helm upgrade --install devsecops ./helm \
--namespace devsecops \
--values helm/values.yaml
# 3. Monitor training
kubectl logs -f -n devsecops -l app=devsecops-training
Expected Output
=== Verifying environment ===
PyTorch: 2.6.0+cu124
CUDA: True
GPU: NVIDIA GeForce RTX 5090
=== Starting QLoRA Training ===
Loading model: Qwen/Qwen2.5-7B-Instruct
Applying 4-bit quantization...
Creating LoRA adapters...
Training: /mnt/checkpoints
[100/1000] loss: 2.45
[200/1000] loss: 1.87
...
Model saved to /mnt/checkpoints/final_model
Training complete!
Results & Benefits
Performance Gains
Image Optimization:
├── Boot time: 2 min (vs. 15+ min)
├── Image size: 11.4 GB (optimized)
├── Zero runtime downloads
└── No SSL/network errors
Resource Efficiency:
├── 4-bit quantization: 4x memory reduction
├── LoRA adapters: <1% trainable params
└── Full RTX 5090 VRAM utilization
Cost Comparison
| Approach | Cost per training run |
|---|---|
| AWS p4d.24xlarge | ~$32/hour |
| GCP A100 | ~$25/hour |
| Lambda Labs | ~$10/hour |
| Self-hosted RTX 5090 | ~$0.15/hour (electricity) |
Architecture Decisions
- Pre-built images: Eliminates runtime dependency resolution failures
- QLoRA over full fine-tuning: 4x memory reduction, comparable quality
- k3s over Docker Compose: Production-grade scheduling, resource limits
- Non-root containers: Security hardening, CIS benchmark compliance
- Separate metrics sidecar: Clean separation, Prometheus-native