Universal Knowledge Extractor — LLM Training Data Pipeline
Universal Knowledge Extractor — LLM Training Data Pipeline
Challenge
Fine-tuning LLMs on domain-specific knowledge requires structured, high-quality instruction-response pairs. Manual curation doesn’t scale, and raw content from code repos, docs, and social media needs significant preprocessing before it’s usable for training.
Build a pipeline that:
- Extracts knowledge from diverse sources (code, docs, Telegram, LinkedIn)
- Automatically discovers content taxonomy
- Produces ChatML-formatted JSONL ready for Axolotl/LLaMA-Factory
- Runs fully offline with local Ollama
Solution Architecture
Pipeline Flow
┌─────────────────────────────────────────────────────────────────┐
│ Content Sources │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ Filesystem │ GitHub │ Telegram │ LinkedIn │
│ (.md, .py)│ (repos) │ (channels) │ (exports) │
└─────┬──────┴─────┬──────┴─────┬──────┴────────────┬─────────────┘
│ │ │ │
└────────────┴────────────┴───────────────────┘
│
▼
┌──────────────────────────┐
│ Taxonomy Discovery │
│ (Unsupervised ML) │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ LLM Extraction │
│ (Ollama qwen3.5:35b) │
│ → Instruction-Response │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Quality Assurance │
│ - Schema validation │
│ - Token length check │
│ - Language detection │
│ - Credential scanning │
│ - Semantic dedup │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Data Augmentation │
│ - Paraphrase │
│ - Reformulation │
└────────────┬─────────────┘
│
▼
output/training-data.jsonl (ChatML)
Multi-Source Support
| Source | Adapter | Content Types |
|---|---|---|
| Filesystem | fs | Markdown, Python, YAML, JSON |
| GitHub | github | Repos, READMEs, code, issues |
| Telegram | telegram | Channel messages, media captions |
linkedin | Profile exports, posts, articles |
Key Features
1. Automatic Taxonomy Discovery
# No manual topic lists required
taxonomy:
method: unsupervised
algorithm: kmeans # or hdbscan
num_clusters: auto # discovers optimal count
Uses embedding-based clustering to automatically categorize content into topics.