Pinky: A Compound AI System on Consumer Hardware

A conversational orchestrator that fans out to a roster of specialist agents over a single time-shared 20B brain, with a decorrelated grounding verifier riding co-resident on the same 16 GB GPU. Fully local, $0/month. Built because no single model is reliably metacognitive about its own ability to answer.

53% GAIA Level 1, gold: 28/53 exact-match

81% Answered honestly: correct or abstained, no fabrication (43/53)

2 Invented facts in 53 hard questions: a 4% fabrication rate

$0 Marginal cost per month: no cloud, no API calls

16GB One consumer GPU: the whole rig, plus 29 GB RAM

1 Brain: time-shared across seven roles, beside its own verifier

60% lenient · 5.4M tokens pushed in one benchmark run · 1.35× memory oversubscription · zero crashes, zero context overflows across 727 brain calls · fully offline. Pinky-alpha1, GAIA Level 1, 31 May 2026.

TL;DR

Pinky started life on April 15 as a persona chatbot in Open WebUI talking to a single 20B model through Ollama. Six weeks later it is a fully-local compound AI system that scored 53% on GAIA Level 1: one orchestrator, a roster of specialist agents, one brain time-shared across seven roles, and a decorrelated grounding verifier riding shotgun on the same GPU. Every architectural shift along the way was driven by the same lesson, learned harder each time: no single model is reliably metacognitive about its own ability to answer.

Six weeks, five architectures

Each one solved the previous version's stated problem and uncovered the next problem underneath it. This is the whole arc, in order.

Apr 15
One model in a costume

The earliest commits stood up a single 20B model behind a persona prompt, with SearXNG for web search and a vector lookup against an Obsidian vault. The "multi-agent" part was a regex-based router in a pipelines shim that pattern-matched the user message and string-templated a delegation to another persona-flavored prompt, running on the same model.

One model, one prompt, light tool use. The "agents" were costumes the same model wore; the router was a regex, not a classifier. It worked well enough to demo. It did not work well enough to be honest.
Phase 1
Real microservices

The persona-swap inside one model bled (Pinky would slip into another agent's voice mid-response) and the pipelines shim could not run real async loops; it was a streaming text proxy with extra steps. The move: tear out the shim and give every agent its own FastAPI + LangGraph service with its own ReAct loop, tool surface, and model assignment. Pinky became a real orchestrator with a planner LLM call instead of a regex.

This phase ended with fourteen microservice agents, a FalkorDB knowledge graph, a Qdrant vector store, and an Obsidian-as-memory pattern with the Archivist owning every write. More correct than the regex shim, and more agents than the planner could keep straight.
Phase 2
A designed compound system

This is where it stopped being "more agents" and started being a system designed around an orchestration pattern.
- Fourteen agents consolidated to nine. Duplicate fan-outs collapsed into mode-switched specialists: QA and Scrum Manager merged into Critic, Code Writer and Release Manager into Developer, Tron and Rinzler into Security. The Surrealist was promoted as a deliberate source of entropy: the engine behind the Mucha-swerve. Decomposition is not a virtue when the planner cannot tell two near-duplicates apart.
- The Mucha-line. The old concurrency model used an is_free pre-dispatch check that livelocked under fan-out. Replaced with unconditional async dispatch, a per-agent outbox, and content-aware absorb. This async core survived every later rewrite; it is still the spine of the live system.
- RAG as a loop, not a tool. Section-aware chunking, a Qdrant collection with hybrid BM25-plus-vector retrieval fused by Reciprocal Rank Fusion, a bge-reranker-v2-m3 cross-encoder, and a grounding validator that checks the answer against a different model family.
The shape of a compound system was visible. But the orchestrator was still a "chief-of-staff": Pinky's own LLM decided when to self-handle versus dispatch, and a closed_reason path let it short-circuit lookups when it thought it already knew the answer. That last part is where the next failure lived.
Phase 3
The chief-of-staff wound

The triggering failure was a GAIA L1 question: "What writer is quoted by Merriam-Webster on 2022-06-27?" Full-pipeline runs got it right: Annie Levin. The closed_reason shortcut, where Pinky's planner decided "I know this, no lookup needed," confidently fabricated Oscar Wilde. Same model, different routing decision, opposite outcome.

That failure sat on top of weeks of failures with the same shape:
- closed_reason hallucinations on lookup-shaped questions.
- Over-delegation: three parallel Surrealist calls for one prompt.
- Under-delegation: dropping the Developer dispatch on "make a hello-world C app."
- Four rebenches hunting for a "magic chief" model that would route correctly. None converged.
The diagnosis, codified in ADR-002: the chief role itself is the wound.

No single LLM is reliably right about its own ability to answer. Swapping the chief swaps the failure mode; it does not remove it.
ADR-002

The fork
Staged-router: prototyped, then parked

ADR-002's proposed fix (18 May) was a staged-router: semantic cache → tiny classifier → dispatcher → specialist ensemble → decorrelated verifier, so no single model is ever asked to be metacognitive about its own answer. I built it (all five stages as runnable code, twenty-nine integration tests green) and then never wired it in as the default path.

Eight days later it was superseded. A working prototype lost to a better lever: a greenfield rewrite of the brain itself. The classifier alias is the only artifact that remains, sitting unused. Building it was not wasted; it proved the diagnosis. The fix that shipped kept the idea (a decorrelated verifier) and dropped the apparatus: instead of a five-stage router, one capable brain behind a hard grounding gate.
Shipped
The muchador-line

Pinky-alpha1, the live system, is a greenfield rewrite (the muchador/ package) built on the surviving Mucha-line async core. One brain (gpt-oss-20b at Q4_K_XL) is time-shared across seven roles: planner, researcher, OSINT, developer, archivist, studio, reasoner. Three decorrelated roles (critic, security, ideaman) run on Gemma-4 on the CPU; the Surrealist divergence engine runs on a Llama-3.2 fork.

Every synthesized answer is gated through a Granite-Guardian grounding verifier co-resident with the brain on the same 16 GB GPU, about 0.08 seconds per verdict. The LiteLLM gateway that briefly sat in front of the models was ripped out; agents speak to llama-server directly. Orchestration is the Mucha-line: fully async, first-result-wins, with content-aware stall-cancel replacing every elapsed-clock timeout. That is the system that scored 53%.

The live architecture

USER ── Matrix / Telegram ──▶ PINKY     orchestrator: the only surface you talk to
                                │
                                │   plan ▸ dispatch ▸ synthesize (one voice)
                                │   mucha-line: async · first-result-wins · stall-cancel
                                ▼
                        dispatch across the agent roster, grouped by backend
                        ─────────────────────────────────────────────────────
                        the brain, time-shared   gpt-oss-20b @ RTX 16 GB
                          pinky · researcher · osint · developer
                          · archivist · studio · reasoner
                        decorrelated checks      Gemma-4 @ CPU
                          critic · security · ideaman
                        divergence / swerve      Llama-3.2 fork @ CPU
                          surrealist
                                │
                                ▼
                          synthesis
                                │
                                ▼
                  GROUNDING GATE ── Granite-Guardian, decorrelated,
                  co-resident with the brain on the 16 GB GPU (~0.08s/verdict)
                                │
                     ┌──────────┴──────────┐
                     ▼                     ▼
               grounded ──▶ USER     no source ──▶ escalate

  tools & data ─ SearXNG search · sandboxed code-runner · Obsidian vault
                 (Archivist-gated) · FalkorDB knowledge graph · Qdrant vectors

Hardware

Tier	Hardware	Role
GPU	RTX 5060 Ti, 16 GB	The brain (`gpt-oss-20b`, ~11.6 GB) and the Granite-Guardian grounding verifier (~2.8 GB), co-resident. ~14.4 / 16 GB in use.
CPU	host, 29 GB RAM	Gemma-4 (critic / security / ideaman), the Surrealist (load-on-demand), embeddings (nomic + BGE-large), reranker (bge-reranker-v2-m3), vision (Qwen2.5-VL-3B, as-needed).
iGPU	Radeon 780M	Retired. Sidelined after May 18: wedges on vision load, GTT-only memory. The CPU tier proved more reliable.

One box. 29 GB RAM, one 16 GB GPU, no cloud burst. That is the whole rig.

Why compound AI, distilled

Metacognition is the wound

A single model deciding whether it knows the answer is the failure surface every chief-of-staff pattern hits. Staging the decision across classifier ≠ answerer ≠ verifier removes the self-doubt loop.

Decorrelation buys real entropy

A checker that shares the answerer's weights is theater: same priors, same blind spots. Gemma checks gpt-oss; the Surrealist swerves on a third family; the gate is a purpose-trained Guardian. Chosen for decorrelated error, not individual benchmarks.

The grounding gate kills hallucination cliffs

Every synthesized answer traces through Granite-Guardian; a closed_reason answer with no source escalates instead of shipping. This is the structural fix for the Oscar Wilde class of failure: true fabrications are now 2 in 53.

One brain, time-shared

Cold-swapping specialist models dominated wall-clock per question. One warm brain serving seven roles via prompt and tool set removes the swap; differentiation lives at the prompt layer, not the model layer. Critical on a 16 GB GPU.

Stall-cancel beats the clock

Wallclock timeouts kill slow-but-progressing work and wait forever on wedged calls. Cancel on the absence of content progress instead; let real progress run to completion regardless of wall time.

Reasoning > planning > architecture

Understanding what is actually wanted beats clean delegation beats the delivery mechanism. A planning error re-routes; a reasoning error is confidently-executed wrongness. The architecture is scaffolding: it cannot outrank what it holds up.

Rejected: "just find a better chief"

Four weeks of benches proved that swapping the chief swaps the failure mode rather than removing it. ADR-002 says so explicitly, so future me does not waste a fifth week.

What's shipped, parked, and next

Shipped

Pinky-alpha1 at 53% GAIA L1.
Native tool-calling across all agents.
A blocking grounding gate (Granite-Guardian, co-resident).
Hybrid RAG: BM25 + vector, RRF fusion, cross-encoder rerank.
The Mucha-line: async, first-result-wins, content-aware stall-cancel.
A durable vision-language service.
5.4M tokens in one bench run: zero crashes, zero context overflows.

Parked

The staged-router (5 stages built, 29 tests green, never the live path).
The is_free concurrency model.
The LiteLLM gateway (ripped out).
The iGPU tier.
Sequenced pipelines not yet ported to the Mucha-line.

Synthesis-side answer-format extraction (~5 questions of headroom).
A planner guardrail for temp=0 routing non-determinism.
Porting sequenced pipelines onto park-and-resume.

What this should tell you about how I work

Pinky began as a persona over a single 20B model and ended up, by repeatedly losing the same fight, as one capable brain behind a decorrelated grounding gate, fanning out across a roster of agents on a single consumer GPU. The compound shape is not a design preference. It is the structural answer to "no single LLM is reliably honest about what it does not know."

The interesting part is not the architecture diagram. It is that the diagram is the fifth one, and that one of the discarded four was a working, tested staged-router I chose not to ship. Every version closed the gap the last one left and opened the one beneath it. A working offline benchmark beat every confident design I had. That is what good architecture work actually feels like.

Source & full docs → Contribution breakdown →

Built by Hunter Cobbs. Last updated May 31, 2026 · Pinky-alpha1 (GAIA L1 53%).

TL;DR

Six weeks, five architectures

One model in a costume

Real microservices

A designed compound system

The chief-of-staff wound

Staged-router: prototyped, then parked

The muchador-line