NLP | GantmanBiz — AI-First Software Company

Universal Knowledge Extractor — LLM Training Data Pipeline

Challenge

Fine-tuning LLMs on domain-specific knowledge requires structured, high-quality instruction-response pairs. Manual curation doesn’t scale, and raw content from code repos, docs, and social media needs significant preprocessing before it’s usable for training.

Build a pipeline that:

Extracts knowledge from diverse sources (code, docs, Telegram, LinkedIn)
Automatically discovers content taxonomy
Produces ChatML-formatted JSONL ready for Axolotl/LLaMA-Factory
Runs fully offline with local Ollama

Solution Architecture

Pipeline Flow

┌─────────────────────────────────────────────────────────────────┐
│                      Content Sources                             │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ Filesystem │   GitHub   │  Telegram  │        LinkedIn          │
│  (.md, .py)│  (repos)   │ (channels) │       (exports)          │
└─────┬──────┴─────┬──────┴─────┬──────┴────────────┬─────────────┘
      │            │            │                   │
      └────────────┴────────────┴───────────────────┘
                            │
                            ▼
              ┌──────────────────────────┐
              │    Taxonomy Discovery    │
              │  (Unsupervised ML)       │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   LLM Extraction         │
              │  (Ollama qwen3.5:35b)    │
              │  → Instruction-Response  │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   Quality Assurance      │
              │  - Schema validation     │
              │  - Token length check    │
              │  - Language detection    │
              │  - Credential scanning   │
              │  - Semantic dedup        │
              └────────────┬─────────────┘
                           │
                           ▼
              ┌──────────────────────────┐
              │   Data Augmentation      │
              │  - Paraphrase            │
              │  - Reformulation         │
              └────────────┬─────────────┘
                           │
                           ▼
         output/training-data.jsonl (ChatML)

Multi-Source Support

Source	Adapter	Content Types
Filesystem	`fs`	Markdown, Python, YAML, JSON
GitHub	`github`	Repos, READMEs, code, issues
Telegram	`telegram`	Channel messages, media captions
LinkedIn	`linkedin`	Profile exports, posts, articles

Key Features

1. Automatic Taxonomy Discovery

# No manual topic lists required
taxonomy:
  method: unsupervised
  algorithm: kmeans  # or hdbscan
  num_clusters: auto  # discovers optimal count

Uses embedding-based clustering to automatically categorize content into topics.

April 16, 2026 •

LLM Fine-tuning Training Data ChatML Ollama Data Pipeline NLP

Esla — AI-Powered Real Estate Platform

Challenge

Build a complete real estate platform for the Belarusian market that enables property search via natural language (Russian/Belarusian), provides automated property valuations, connects buyers with verified realtors, and runs entirely as a Telegram Mini-App — all without recurring cloud API costs.

Solution Architecture

Platform Overview

A modular monolith deployed via Docker Compose with 22 services covering the full real estate lifecycle: search, valuation, listings, realtor marketplace, and admin operations.

April 15, 2026 •

Real Estate NLP AVM Telegram Mini-App Local LLM Zero API Costs