Training Data

Universal Knowledge Extractor — LLM Training Data Pipeline

Universal Knowledge Extractor — LLM Training Data Pipeline

Challenge

Fine-tuning LLMs on domain-specific knowledge requires structured, high-quality instruction-response pairs. Manual curation doesn’t scale, and raw content from code repos, docs, and social media needs significant preprocessing before it’s usable for training.

Build a pipeline that:

  • Extracts knowledge from diverse sources (code, docs, Telegram, LinkedIn)
  • Automatically discovers content taxonomy
  • Produces ChatML-formatted JSONL ready for Axolotl/LLaMA-Factory
  • Runs fully offline with local Ollama

Solution Architecture

Pipeline Flow

┌─────────────────────────────────────────────────────────────────┐
│                      Content Sources                             │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ Filesystem │   GitHub   │  Telegram  │        LinkedIn          │
│  (.md, .py)│  (repos)   │ (channels) │       (exports)          │
└─────┬──────┴─────┬──────┴─────┬──────┴────────────┬─────────────┘
      │            │            │                   │
      └────────────┴────────────┴───────────────────┘
                            │
                            ▼
              ┌──────────────────────────┐
              │    Taxonomy Discovery    │
              │  (Unsupervised ML)       │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   LLM Extraction         │
              │  (Ollama qwen3.5:35b)    │
              │  → Instruction-Response  │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   Quality Assurance      │
              │  - Schema validation     │
              │  - Token length check    │
              │  - Language detection    │
              │  - Credential scanning   │
              │  - Semantic dedup        │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   Data Augmentation      │
              │  - Paraphrase            │
              │  - Reformulation         │
              └────────────┬─────────────┘
                           │
                           ▼
         output/training-data.jsonl (ChatML)

Multi-Source Support

SourceAdapterContent Types
FilesystemfsMarkdown, Python, YAML, JSON
GitHubgithubRepos, READMEs, code, issues
TelegramtelegramChannel messages, media captions
LinkedInlinkedinProfile exports, posts, articles

Key Features

1. Automatic Taxonomy Discovery

# No manual topic lists required
taxonomy:
  method: unsupervised
  algorithm: kmeans  # or hdbscan
  num_clusters: auto  # discovers optimal count

Uses embedding-based clustering to automatically categorize content into topics.

LLM Fine-tuning Training Data ChatML Ollama Data Pipeline NLP