A local-first desktop AI app. Start with simple chat, pick the API or local model that powers it, then summon full rites when the work needs memory, tools, debate, critique, and saved artifacts.
download → github ↗In April 2026, OpenAI published Where the goblins came from, explaining how a reward signal trained for a "Nerdy" personality leaked across all of GPT-5.5's outputs and produced a noticeable surge in creature metaphors. Codex shipped with a hardcoded ban list — goblins, gremlins, raccoons, trolls, ogres, pigeons.
This project takes that ban list as a roster.
| Creature | Job |
|---|---|
| Goblin | Worker. Cheap, high-temperature, dispatched in packs. Each pack member gets a different personality; an optional debate round lets them revise after seeing each other's proposals. |
| Gremlin | Adversarial. Per-goblin chaos pass — finds edge cases, off-by-ones, prompt-injection vectors. |
| Raccoon | Scavenger. Returns only the facts a task actually needs. Also loads relevant prior Artifacts when memory is enabled. |
| Troll | Reviewer. Default-rejects. JSON verdict. May invoke verifier tools (json.parse, regex.match, http.head, and enabled add-on tools) before scoring. |
| Ogre | Heavyweight. Last resort — only summoned if the pack and the Specialists both fail. |
| Pigeon | Carrier and Scribe. Routes artifacts between Warrens (federation) and distills every completed rite into a typed Artifact (memory). |
| Specialist | Goblin variant. Spawned when the pack fails Troll review — each one targets a single dominant failure mode identified by clustering the gremlin's critiques. |
A unit test pins the roster to the OpenAI ban list, so it can't drift quietly. The Specialist is a Goblin variant — same kind, focused system prompt — so the ban-list invariant still holds.
▄█▄ ▄█▄
███ ███
▀████████████▀
█ ▀▄ ▄▀ █
█ ● ● █
█ ▾▾ █
█▄▄▄▄▄▄▄▄▄▄█
█▌ █ █ ▐█
▀▀ ▀ ▀ ▀▀
▀▄ ▄▀ ▀▄ ▄▀
▀█▄▄█▄▄█▀
█████████
█ ◉ ◉ █
█ ╳ █
█ ╲╱╲╱╲ █
▀█████▀
█ █
▀▀ ▀▀
▄█▄ ▄█▄
███ ███
▀████████████▀
█▌ ●▔ ▔● ▐█
█ ▾ █
█▄▄▄▄▄▄▄▄▄▄▄▄█
█▌█ █▐█
▀▀▀ ▀▀▀
▄ ▄ ▄ ▄
█ █ █ █
▄████████████▄
█ ● ● █
█ ▾▾▾▾ █
█ ────────── █
████████████████
█▌ ▐█
█▌ ▐█
████ ████
▄▄▄▄▄▄▄▄▄▄
████████████
██ ▀▀ ▀▀ ██
█ ● ● █
█ ▽ █
█▄ ▼▼▼▼▼▼▼▼ ▄█
████████████
██████████████
██ ██
██ ██
▄██▄
██ ●█
█▌ █▶▶▶
██████████
█▀▀▀▀▀▀▀▀█
████████
█ █
█ █
▀▀ ▀▀
The full pipeline. Every step writes a Loot drop to the Hoard with parent links to its inputs — a Rite is fully reconstructible from the Hoard alone. The Pigeon-Scribe also emits a typed Artifact (claims, evidence, open questions, next steps) that future rites can cite.
optional ─────────────────────────────────────────────────────
┌──────────┐ │
│ Planner │ DAG of sub-rites; recursive replan on failure │
└────┬─────┘ │
▼ │
┌──────────┐ facts + ┌────────────┐ N parallel ┌──────────┐
│ Raccoon │ prior ▶│ Goblin │═════════════▶│ Goblins │
│ + memory │ artifacts │ pack │ per-goblin │ output │
└──────────┘ │ │ personality └────┬─────┘
└────────────┘ │
optional debate │
(peers see peers, │
revise) ◀───────────────┘
│
▼
┌─────────────┐
│ Gremlin │ per-goblin
│ chaos pass │ attack
└──────┬──────┘
▼
┌─────────────┐ optional
│ Troll │ verifier tool-use
│ review │ (json/regex/http)
└──────┬──────┘
│
any pass ──┴── all fail
│ │
│ ▼
│ ┌───────────────┐
│ │ Cluster fails │
│ │ (1 LLM call) │
│ └───────┬───────┘
│ ▼
│ ┌───────────────┐
│ │ Specialists │ 1-3 focused
│ │ + re-judge │ recovery workers
│ └───────┬───────┘
│ │
│ passed/improved over seed
│ ▼
│ ┌────────────┐
│ │ Ogre │ last resort
│ │ fallback │ (heavyweight)
│ └─────┬──────┘
│ │
▼ ▼
winner ◀──────┘
│
▼
┌─────────────┐
│ Pigeon — │ distills the rite into
│ Scribe │ a typed Artifact (memory)
└─────────────┘
One agent invocation, content-addressed by sha256(model || prompt || output).
Lightweight: Goblin pack + Troll arbitration.
Full pipeline: Raccoon → pack → (debate?) → Gremlin → Troll → Specialists → Ogre fallback → Scribe.
A DAG of sub-rites the Planner emits for complex tasks. Topologically executed, parent artifacts feed into dependent nodes, recursive replan on node failure.
Typed JSON summary of a completed Rite: claims, evidence, open questions, next steps, parent-artifact links. Future rites cite or auto-load relevant artifacts.
A dominant failure mode (e.g. null-handling, off-by-one) identified by a clustering LLM call. Each cluster spawns one focused Specialist Goblin.
File-backed store under .goblintown/hoard/. Loot, rites, quests, artifacts, inbox, outbox. Every drop is auditable.
Per-project root, found by walking up from cwd.
Reward = troll score − cross-creature drift penalty + pass bonus, clamped 0..1. Override with .goblintown/reward.mjs.
Cross-creature word frequency. A Goblin output mentioning raccoons unprompted is the signal we measure.
Full run history, exportable to the academic LLM-MAS Orchestration Trace schema for compatibility with research tooling.
Pigeons route artifacts to other Warrens — file or HTTP, signed, optional HMAC.
The desktop app opens directly into chat, asks which API or local model should power Goblintown, and keeps setup to a few guided clicks. One-click installers, no build step.
macOS: open the DMG, drag Goblintown to Applications. Windows: run the
installer. Linux: mark the AppImage executable and run it. Every asset and
SHA256SUMS.txt lives on the
v0.7.0-beta.1 release.
These beta builds aren't code-signed yet, so macOS may need right-click →
Open and Windows may show a SmartScreen "More info → Run anyway" prompt.
Prefer the command line? Install from npm:
npm install -g goblintown
goblintown serve # opens the GUI at http://localhost:7777/
The app opens into Goblin Mode: one prompt and a Single Goblin / Goblintown switch. Single Goblin is one worker and one answer — fast chat. Goblintown turns the prompt into a planner DAG with the full pack, memory, and self-correction, streaming progress as it runs.
Flip on the Tank for the live diorama: each creature has a home, tokens stream into per-creature thinking bubbles, the DAG panel lights up node-by-node, and the result panel slides up with the winning output.
Everything else lives behind Settings — API provider and per-creature model routing, voice, imported context, group chats and country collaboration, mail, add-ons, onchain lookup, sentiment sources, cloud sign-in, and reset. An interrupted run resumes from the Tank's recovery prompt after a restart.
The package still ships a CLI for development and automation
(goblintown serve, rite, plan,
thesis, context, …). It is no longer the primary
surface — run goblintown --help for the full list.
Decompose complex tasks into a DAG of sub-rites with dynamic spawning (planner picks pack size and personality per node). Recursive replan on node failure, max depth 2.
Each goblin in a pack draws a different personality from {nerdy, cynical, chipper, stoic, feral, goblin_mode} — a real debate team, not three copies of the same worker.
Optional round where each goblin sees its peers' outputs and may revise. Closes the O3 communication gap from the LLM-MAS-RL survey. Training-free.
Every output gets attacked by a gremlin before review. Edge cases, off-by-ones, prompt injection.
Optional troll tool-use round: json.parse, regex.match, http.head, and enabled add-on tools. Real ground-truth signals beyond LLM-as-judge.
When the pack all-fails, cluster failure modes (1 LLM call) and spawn focused Specialist Goblins that surgically fix one mode each. Wins on pass OR improvement over seed; only escalates to Ogre if specialists also fail.
Heavyweight last-resort. Synthesizes from the best parts of the failed pack. Only summoned when nothing else holds.
Pigeon-Scribe distills every rite into a typed JSON Artifact (claims, evidence, open questions, next steps). Future rites cite or auto-load relevant artifacts. Local context ingestion imports old conversations and project folders as file-backed Artifacts. Chat Hoard Import Mode scans Codex sessions and ChatGPT exports, stores root/chunk DAG artifacts, and keeps chat memory pre-vectorized when embeddings are configured. AI summaries are opt-in.
As the Hoard accumulates, goblintown fold merges related older artifacts into higher-level summaries. Recursive abstraction.
Ogre, every goblin, and every specialist stream tokens. Live "thinking" bubbles in the tank UI show what each creature is producing as it produces it.
A one-prompt surface at / with Single Goblin and Goblintown modes, slash commands, and a Tank checkbox for planner runs. The desktop app opens here by default.
A tamagotchi-style live diorama at /tank: default creature sprites, a centered wordmark, village tiers, DAG progress, resumable run views, Onchain lookup, the Tank Sentiment tool, and first-run API plus Local Only / Goblintown Cloud choices.
Country, Mail, Add-ons, Onchain, Settings Sentiment Sources, API Provider, Account cloud mode, and Reset are gathered into one scrollable menu. Reset buries Asteroid Mode behind an extra step before local or cloud data can be destroyed.
Enable read-only onchain investigator tools for address profiles, activity, parsed transactions, token mint/account data, SOL balances, account metadata, SPL token accounts, recent signatures, and RPC health. The Tank Onchain panel and goblintown addon solana <address> can turn a lookup into a Goblintown analysis rite. No keys, signing, swaps, or transaction submission.
Build project-quality memos about a team, product, protocol, repository, or decision. It weighs quality and advantages, evidence gaps, risks, and invalidation triggers, can scan repo files, and labels missing evidence Unknown / Unverified. It is not buyability or a buy/sell recommendation.
Free/no-key Alternative.me and GDELT baselines ship by default. Project checks keep query-specific signals separate from broad market context. Optional CoinGecko, Dune, Neynar, Santiment, CryptoPanic, and LunarCrush keys can be saved locally in .goblintown/secrets.json or supplied with env vars like COINGECKO_API_KEY and NEYNAR_API_KEY.
Run local-only by default, or sign in with the bundled Firebase project for SSO, friend codes, discovery, mail, and country metadata. Forks can override Firebase env vars without making every user bring their own keys.
Choose OpenAI, OpenRouter, Ollama, LM Studio, Groq, Together, Mistral, DeepSeek, Anthropic, Gemini, or a custom OpenAI-compatible endpoint from Settings. Keys stay local.
Answer-producing calls can be constrained to Markdown or a single JSON object. JSON output is locally validated and repaired once if malformed.
Export any run to the academic LLM-MAS Orchestration Trace schema (10 event types, 8 edge types, topology classification).
Every Loot drop is keyed by hash. Full causal graph of who-produced-what is recoverable. Audit + graph commands traverse artifact lineage across rites.
Goblintown measures cross-creature word frequency. High drift = your reward signal is leaking.
Per-rite token budgets, per-call max-output, in-process semaphore for in-flight API calls. SDK retries 429/5xx automatically.
Pigeons deliver compressed artifacts between Warrens, over filesystem or HTTP. Signed bodies, optional HMAC.
Drop a .goblintown/reward.mjs to override the default scoring.
Re-run a rite with same params; export to markdown; diff two rites; print artifact lineage.
Pure-function coverage on every protocol invariant: drift, reward, hoard hashing, federation HMAC, audit, graph, planner DAG validation, debate prompt construction, tool dispatch, add-on registry, Solana read-only RPC shaping, address profiles, activity, transactions, token summaries, thesis prompt/evidence construction, Goblin Mode shell wiring, Tank thesis wiring, sentiment sources, query-specific project sentiment, market context separation, source failure normalization, local context ingestion, previous chat import, pre-vectorized chat memory, and Tank/Settings wiring, embeddings ranking math, provider routing, cloud mode, Settings reset flows, sprite assets, output formatting, fold clustering, trace-export schema mapping. No OpenAI calls during testing.
Drop a .goblintown/reward.mjs in your Warren to override the default scoring.
// .goblintown/reward.mjs
export default function (loot, verdict) {
return verdict.passed
? 0.8 + (1 - loot.drift.driftRate) * 0.2
: verdict.score * 0.5;
}
Result is clamped to [0, 1].
Goblintown is an engineering project, not a research paper, but the design of its later phases is opinionated by what's working in current LLM-multi-agent-systems work. We deliberately stay in the prompted, training-free slice of the literature so everything runs with just an OpenAI-compatible API key — no GPUs, no fine-tuning, no RL infra.
export-trace output format.
The bulk of the LLM-MAS-RL literature uses post-training methods (MAGRPO, MARFT, MAPoRL, Dr. MAS, SHARP, DEPART, MarsRL, MALT, MARSHAL, SPIRAL, …). These produce stronger orchestrators in benchmark settings but require GPUs, datasets, and RL infra. Goblintown is built for engineers shipping multi-agent features on existing API endpoints.