Ollama Decomposition Agent — Intelligent LLM Orchestration
Challenge
Large Language Models have context window limits and increasing latency with prompt size. When analyzing large codebases, documents, or complex multi-part questions, single-shot prompts either exceed context limits or produce slow, unfocused responses.
Build an agent that:
- Intelligently splits large prompts into semantic sub-tasks
- Executes sub-tasks in parallel for speed
- Synthesizes results into coherent responses
- Runs entirely on local Ollama (zero API costs)
Solution Architecture
Workflow
User Prompt (large)
↓
[Analyze] - Count tokens, identify boundaries
↓
[Decide] - Decompose or single call?
├─→ Small (<6K tokens) → Single Ollama call → Return
└─→ Large (>6K tokens) → Decomposition
↓
[Split] - Semantic decomposition (headings, lists, paragraphs)
↓
[Execute] - Parallel sub-task execution (3 concurrent)
↓
[Aggregate] - Result synthesis (auto-strategy selection)
↓
[Return] - Final coherent response
Components
| Component | Responsibility |
|---|
| OllamaDecompositionAgent | Main orchestrator, workflow management |
| TokenManager | Tiktoken counting, chunking with overlap |
| PromptSplitter | Semantic decomposition at natural boundaries |
| OllamaClient | Async HTTP with retry/backoff |
| ResultAggregator | Multi-strategy synthesis |
Key Features
1. Intelligent Prompt Decomposition
# Identifies natural breakpoints
sections = splitter.split(prompt)
# → [heading1_content, list_items, paragraph_block, ...]
# Preserves shared context across sub-tasks
for section in sections:
subtask = f"{shared_context}\n\n{section}"
2. Parallel Execution with Controlled Concurrency
async with asyncio.Semaphore(3): # Max 3 concurrent
results = await asyncio.gather(*subtask_calls)
3. Multi-Strategy Aggregation
| Sub-task Count | Strategy | Method |
|---|
| 1-3 | Concatenate | Join all + final synthesis |
| 4-10 | Sequential | Summarize each, then synthesize |
| 10+ | Hierarchical | Tree-based summarization |
| Optimization | Improvement |
|---|
| Expert identification toggle | 25-30% latency reduction |
| Token count caching | 40-60% cache hits, 60-180ms savings |
| Async file I/O | ~160ms improvement |
| Parallel token counting | 140ms+ savings on large prompts |
| Result caching (LRU) | 8-12s+ savings on repeated prompts |
Execution Times (DeepSeek-R1:32b)
| Prompt Size | Strategy | Sub-tasks | Duration | Tokens |
|---|
| < 6K | Single call | 1 | 5-15s | ~6K |
| 6K-12K | Semantic | 2-3 | 15-25s | ~12K |
| 12K-24K | Semantic | 4-5 | 25-40s | ~24K |
| 24K+ | Hierarchical | 6-10 | 40-90s | 30K+ |
Cost Comparison
| Approach | Cost per 100K tokens |
|---|
| OpenAI GPT-4 | ~$3.00 |
| Anthropic Claude | ~$2.40 |
| Local Ollama | $0.00 |
Usage
Python API
from ml.agents import OllamaDecompositionAgent, AgentConfig
config = AgentConfig(
ollama_host="192.168.2.2:11434",
ollama_model="deepseek-r1:32b",
max_parallel_tasks=3,
aggregation_strategy="auto"
)
agent = OllamaDecompositionAgent(config)
result = await agent.process(large_prompt)
print(f"Response: {result.final_response}")
print(f"Tokens: {result.total_tokens_used}")
print(f"Duration: {result.total_duration_seconds:.2f}s")
# Simple prompt
python ml/agents/examples/cli_tool.py "Your prompt here"
# Load from file
python ml/agents/examples/cli_tool.py @large-document.txt
# Custom configuration
python ml/agents/examples/cli_tool.py @prompt.txt \
--model deepseek-r1:32b \
--max-tokens 16384 \
--parallel 4 \
--aggregation-strategy hierarchical
Configuration
Core Settings
AgentConfig(
# Ollama
ollama_host="192.168.2.2:11434",
ollama_model="deepseek-r1:32b",
# Context Management
max_context_tokens=8192,
response_reserve_tokens=2048,
chunk_overlap_tokens=200,
# Execution
max_parallel_tasks=3,
timeout_seconds=300,
# Performance (v2.0+)
enable_token_count_cache=True,
enable_async_file_io=True,
enable_result_caching=False,
)
Results & Benefits
Technical Outcomes
Performance:
├── Latency reduction: 30-50% (with optimizations)
├── Cache hit rate: 40-60%
├── Backward compatibility: 100%
└── API costs: $0
Use Cases
- Security Audits: Analyze large codebases in parallel
- Document Analysis: Process long reports with coherent synthesis
- Code Review: Multi-file reviews with context preservation
- Research: Complex multi-part questions with structured responses
Architecture Decisions
- Tiktoken over custom counting: OpenAI-standard accuracy, battle-tested
- Semantic over fixed-size splitting: Preserves meaning, better coherence
- Async over threading: Better I/O performance, cleaner code
- LRU caching over persistent: Session-scoped, no stale data issues