Pinky: A Compound AI System on Consumer Hardware
A conversational orchestrator that fans out to a roster of specialist agents over a single time-shared 20B brain — with a decorrelated grounding verifier riding co-resident on the same 16 GB GPU. Fully local, $0/month. Built because no single model is reliably metacognitive about its own ability to answer.
60% lenient · 5.4M tokens pushed in one benchmark run · 1.35× memory oversubscription · zero crashes, zero context overflows across 727 brain calls · fully offline. — Pinky-alpha1, GAIA Level 1, 31 May 2026.
TL;DR
Pinky started life on April 15 as a persona chatbot in Open WebUI talking to a single 20B model through Ollama. Six weeks later it is a fully-local compound AI system that scored 53% on GAIA Level 1 — one orchestrator, a roster of specialist agents, one brain time-shared across seven roles, and a decorrelated grounding verifier riding shotgun on the same GPU. Every architectural shift along the way was driven by the same lesson, learned harder each time: no single model is reliably metacognitive about its own ability to answer.
Six weeks, five architectures
Each one solved the previous version's stated problem and uncovered the next problem underneath it. This is the whole arc, in order.
-
Apr 15
One model in a costume
The earliest commits stood up a single 20B model behind a persona prompt, with SearXNG for web search and a vector lookup against an Obsidian vault. The "multi-agent" part was a regex-based router in a pipelines shim that pattern-matched the user message and string-templated a delegation to another persona-flavored prompt — running on the same model.
One model, one prompt, light tool use. The "agents" were costumes the same model wore; the router was a regex, not a classifier. It worked well enough to demo. It did not work well enough to be honest.
-
Phase 1
Real microservices
The persona-swap inside one model bled — Pinky would slip into another agent's voice mid-response — and the pipelines shim could not run real async loops; it was a streaming text proxy with extra steps. The move: tear out the shim and give every agent its own FastAPI + LangGraph service with its own ReAct loop, tool surface, and model assignment. Pinky became a real orchestrator with a planner LLM call instead of a regex.
This phase ended with fourteen microservice agents, a FalkorDB knowledge graph, a Qdrant vector store, and an Obsidian-as-memory pattern with the Archivist owning every write. More correct than the regex shim — and more agents than the planner could keep straight.
-
Phase 2
A designed compound system
This is where it stopped being "more agents" and started being a system designed around an orchestration pattern.
- Fourteen agents consolidated to nine. Duplicate fan-outs collapsed into mode-switched specialists: QA and Scrum Manager merged into Critic, Code Writer and Release Manager into Developer, Tron and Rinzler into Security. The Surrealist was promoted as a deliberate source of entropy — the engine behind the Mucha-swerve. Decomposition is not a virtue when the planner cannot tell two near-duplicates apart.
- The Mucha-line. The old concurrency model used an
is_freepre-dispatch check that livelocked under fan-out. Replaced with unconditional async dispatch, a per-agent outbox, and content-aware absorb. This async core survived every later rewrite — it is still the spine of the live system. - RAG as a loop, not a tool. Section-aware chunking, a Qdrant collection with hybrid BM25-plus-vector retrieval fused by Reciprocal Rank Fusion, a
bge-reranker-v2-m3cross-encoder, and a grounding validator that checks the answer against a different model family.
The shape of a compound system was visible. But the orchestrator was still a "chief-of-staff": Pinky's own LLM decided when to self-handle versus dispatch, and a
closed_reasonpath let it short-circuit lookups when it thought it already knew the answer. That last part is where the next failure lived. -
Phase 3
The chief-of-staff wound
The triggering failure was a GAIA L1 question: "What writer is quoted by Merriam-Webster on 2022-06-27?" Full-pipeline runs got it right — Annie Levin. The
closed_reasonshortcut, where Pinky's planner decided "I know this, no lookup needed," confidently fabricated Oscar Wilde. Same model, different routing decision, opposite outcome.That failure sat on top of weeks of failures with the same shape:
closed_reasonhallucinations on lookup-shaped questions.- Over-delegation: three parallel Surrealist calls for one prompt.
- Under-delegation: dropping the Developer dispatch on "make a hello-world C app."
- Four rebenches hunting for a "magic chief" model that would route correctly. None converged.
The diagnosis, codified in ADR-002: the chief role itself is the wound.
No single LLM is reliably right about its own ability to answer. Swapping the chief swaps the failure mode; it does not remove it.
— ADR-002
-
The fork
Staged-router: prototyped, then parked
ADR-002's proposed fix (18 May) was a staged-router: semantic cache → tiny classifier → dispatcher → specialist ensemble → decorrelated verifier, so no single model is ever asked to be metacognitive about its own answer. I built it — all five stages as runnable code, twenty-nine integration tests green — and then never wired it in as the default path.
Eight days later it was superseded. A working prototype lost to a better lever: a greenfield rewrite of the brain itself. The classifier alias is the only artifact that remains, sitting unused. Building it was not wasted — it proved the diagnosis. The fix that shipped kept the idea (a decorrelated verifier) and dropped the apparatus: instead of a five-stage router, one capable brain behind a hard grounding gate.
-
Shipped
The muchador-line
Pinky-alpha1, the live system, is a greenfield rewrite (the
muchador/package) built on the surviving Mucha-line async core. One brain —gpt-oss-20bat Q4_K_XL — is time-shared across seven roles: planner, researcher, OSINT, developer, archivist, studio, reasoner. Three decorrelated roles (critic, security, ideaman) run onGemma-4on the CPU; the Surrealist divergence engine runs on a Llama-3.2 fork.Every synthesized answer is gated through a
Granite-Guardiangrounding verifier riding co-resident on the same 16 GB GPU as the brain — about 0.08 seconds per verdict. The LiteLLM gateway that briefly sat in front of the models was ripped out; agents speak tollama-serverdirectly. Orchestration is the Mucha-line: fully async, first-result-wins, with content-aware stall-cancel replacing every elapsed-clock timeout. That is the system that scored 53%.
The live architecture
USER ── Matrix / Telegram ──▶ PINKY orchestrator — the only surface you talk to
│
│ plan ▸ dispatch ▸ synthesize (one voice)
│ mucha-line: async · first-result-wins · stall-cancel
▼
dispatch across the agent roster, grouped by backend
─────────────────────────────────────────────────────
the brain, time-shared gpt-oss-20b @ RTX 16 GB
pinky · researcher · osint · developer
· archivist · studio · reasoner
decorrelated checks Gemma-4 @ CPU
critic · security · ideaman
divergence / swerve Llama-3.2 fork @ CPU
surrealist
│
▼
synthesis
│
▼
GROUNDING GATE ── Granite-Guardian, decorrelated,
co-resident with the brain on the 16 GB GPU (~0.08s/verdict)
│
┌──────────┴──────────┐
▼ ▼
grounded ──▶ USER no source ──▶ escalate
tools & data ─ SearXNG search · sandboxed code-runner · Obsidian vault
(Archivist-gated) · FalkorDB knowledge graph · Qdrant vectors
Hardware
| Tier | Hardware | Role |
|---|---|---|
| GPU | RTX 5060 Ti, 16 GB | The brain (gpt-oss-20b, ~11.6 GB) and the Granite-Guardian grounding verifier (~2.8 GB), co-resident. ~14.4 / 16 GB in use. |
| CPU | host, 29 GB RAM | Gemma-4 (critic / security / ideaman), the Surrealist (load-on-demand), embeddings (nomic + BGE-large), reranker (bge-reranker-v2-m3), vision (Qwen2.5-VL-3B, as-needed). |
| iGPU | Radeon 780M | Retired. Sidelined after May 18 — wedges on vision load, GTT-only memory. The CPU tier proved more reliable. |
One box. 29 GB RAM, one 16 GB GPU, no cloud burst. That is the whole rig.
Why compound AI, distilled
Metacognition is the wound
A single model deciding whether it knows the answer is the failure surface every chief-of-staff pattern hits. Staging the decision across classifier ≠ answerer ≠ verifier removes the self-doubt loop.
Decorrelation buys real entropy
A checker that shares the answerer's weights is theater — same priors, same blind spots. Gemma checks gpt-oss; the Surrealist swerves on a third family; the gate is a purpose-trained Guardian. Chosen for decorrelated error, not individual benchmarks.
The grounding gate kills hallucination cliffs
Every synthesized answer traces through Granite-Guardian; a closed_reason answer with no source escalates instead of shipping. This is the structural fix for the Oscar Wilde class of failure — true fabrications are now 2 in 53.
One brain, time-shared
Cold-swapping specialist models dominated wall-clock per question. One warm brain serving seven roles via prompt and tool set removes the swap; differentiation lives at the prompt layer, not the model layer. Critical on a 16 GB GPU.
Stall-cancel beats the clock
Wallclock timeouts kill slow-but-progressing work and wait forever on wedged calls. Cancel on the absence of content progress instead; let real progress run to completion regardless of wall time.
Reasoning > planning > architecture
Understanding what is actually wanted beats clean delegation beats the delivery mechanism. A planning error re-routes; a reasoning error is confidently-executed wrongness. The architecture is scaffolding — it cannot outrank what it holds up.
Rejected: "just find a better chief"
Four weeks of benches proved that swapping the chief swaps the failure mode rather than removing it. ADR-002 says so explicitly, so future me does not waste a fifth week.
What's shipped, parked, and next
Shipped
- Pinky-alpha1 at 53% GAIA L1.
- Native tool-calling across all agents.
- A blocking grounding gate (Granite-Guardian, co-resident).
- Hybrid RAG: BM25 + vector, RRF fusion, cross-encoder rerank.
- The Mucha-line: async, first-result-wins, content-aware stall-cancel.
- A durable vision-language service.
- 5.4M tokens in one bench run — zero crashes, zero context overflows.
Parked
- The staged-router (5 stages built, 29 tests green — never the live path).
- The
is_freeconcurrency model. - The LiteLLM gateway (ripped out).
- The iGPU tier.
- Sequenced pipelines not yet ported to the Mucha-line.
Next
- Synthesis-side answer-format extraction (~5 questions of headroom).
- A planner guardrail for temp=0 routing non-determinism.
- Porting sequenced pipelines onto park-and-resume.
What this should tell you about how I work
Pinky began as a persona over a single 20B model and ended up — by repeatedly losing the same fight — as one capable brain behind a decorrelated grounding gate, fanning out across a roster of agents on a single consumer GPU. The compound shape is not a design preference. It is the structural answer to "no single LLM is reliably honest about what it does not know."
The interesting part is not the architecture diagram. It is that the diagram is the fifth one — and that one of the discarded four was a working, tested staged-router I chose not to ship. Each version solved the previous version's stated problem and revealed the next problem underneath it. A working offline benchmark beat every confident design I had. That is what good architecture work actually feels like.
Built by Hunter Cobbs. Last updated May 31, 2026 · Pinky-alpha1 (GAIA L1 53%).