A multi-model AI council running as a reproducible OpenFactory Linux image

Build Your Own LLM Council on OpenFactory

Karpathy's LLM Council pattern asks multiple models, anonymizes peer review, and synthesizes a final answer. Here's how to build your own council as a reproducible OpenFactory image.

May 30, 2026

← Back to Blog

The useful thing about an LLM council is not that it makes models sound more dramatic. It is that it turns a single answer into a structured process: independent answers, anonymous peer review, aggregate ranking, and final synthesis.

Andrej Karpathy's llm-council repo gave that pattern a clean, hackable form. Instead of asking one favorite model for an answer, the app sends the same prompt to several models through OpenRouter, has them review anonymized answers from the others, then asks a chairman model to produce the final response.

That is a good local app. But if you want to run an LLM council as a repeatable lab, a team tool, or an internal decision assistant, you eventually want more than a folder on your laptop. You want a pinned operating system, documented runtime secrets, services that come up after boot, validation checks, and a VM you can rebuild when the stack changes. That is exactly the shape OpenFactory is built for.

What Karpathy's LLM Council Actually Does

The reference implementation is intentionally small: a FastAPI backend, a React + Vite frontend, JSON conversation storage, and OpenRouter as the provider abstraction. The important part is the workflow.

Stage 1: first opinions. Every council model gets the same user question and produces its own answer in parallel.
Stage 2: anonymous peer review. The answers are renamed as Response A, Response B, and so on. Each model reviews the others without seeing brand labels, then ranks them for quality.
Stage 3: chairman synthesis. A designated model gets the first answers, the reviews, and the rankings, then writes one final answer.

The repo's technical notes call out two details that matter in practice: the system continues when one model fails, and the UI exposes the raw reviews and parsed rankings so the user can inspect how the final answer was produced.

A council turns one question into a structured process: parallel first opinions, blind peer review and ranking, then a single synthesized answer you can audit.

Why The Pattern Works

A single LLM can be brilliant and still be overconfident, biased toward its own style, or weak on a specific task. A council gives you diversity of failure modes. One model might be better at mathematical rigor, another at product judgment, another at finding missing assumptions, and another at writing the final answer clearly.

This is not only a hobbyist instinct. The Language Model Council paper frames the same problem from an evaluation angle: using a single LLM judge can introduce intra-model bias, especially on subjective tasks, while a panel of models can produce rankings that are more reliable and closer to human preferences in their case study.

The quality gains show up on hard benchmarks too. In the Mixture-of-Agents work, layering several open-source models so each one sees its peers' drafts reached 65.1% on AlpacaEval 2.0, ahead of GPT-4o's 57.5%, using only open weights. A council is a close cousin of that idea: independent drafts plus cross-model review beat any single member, even when no individual model is the strongest. The same authors are candid about the catch, though: the layered approach raises time-to-first-token, so you trade latency and tokens for quality.

Why Build It On OpenFactory

The first version of an LLM council should be a weekend experiment. The second version should be reproducible. OpenFactory turns the council from an app install into an image build:

Repeatable base system. Pick Ubuntu, Debian, Fedora, or another supported base and capture the full package set in the recipe.
Runtime secret boundary. API keys stay out of the ISO and get injected through an environment file after deployment.
Service supervision. Backend, frontend, and nginx come up through systemd instead of a forgotten terminal tab.
Validation after boot. OpenFactory can check that the UI responds, ports listen, and the backend fails clearly if secrets are missing.
Room for local inference. Start with OpenRouter, then add Ollama, vLLM, LiteLLM, or a private endpoint as your council matures.

The OpenFactory Build Prompt

Paste this into console.openfactory.tech to build a first LLM council appliance. It keeps the app close to Karpathy's reference implementation while adding the operational pieces you need for a real VM.

Build a single-node lab image named llm-council-workstation.

Goal: create a reproducible local LLM Council appliance inspired by karpathy/llm-council. The image should boot into a developer-ready environment that runs the council backend, the React frontend, and validation checks. Treat API keys as runtime secrets, not build-time secrets.

Base image: ubuntu-24.04
Architecture: x86_64
Features: desktop, ssh, docker, nodejs, python, git, firewall
Packages: git, curl, jq, ca-certificates, build-essential, python3, python3-venv, python3-pip, nodejs, npm, nginx, supervisor, ufw
User: council, sudo-enabled, passwordless sudo for lab use

Runtime layout:
- Package the pinned llm-council appliance payload into /opt/llm-council from upstream commit 92e1fccb1bdcf1bab7221aa9ed90f9dc72529131, and record that provenance in /usr/lib/llm-council-appliance/PINNED_COMMIT. Do not rely on shipping a live .git checkout in the image.
- Bootstrap uv without modifying the system Python: create /opt/llm-council/.uv-bootstrap with python3 -m venv, install uv with that venv's pip, then run uv sync --locked from the project root. The application environment must be /opt/llm-council/.venv.
- Create /etc/llm-council/env.example with OPENROUTER_API_KEY=replace-me.
- Create /etc/llm-council/models.json as an operator-facing copy of the council and chairman model choices. The runbook must state that this pinned upstream commit reads model choices from backend/config.py unless an operator adapts it to the JSON file.
- Create /var/lib/llm-council/conversations for persisted conversation JSON and symlink /opt/llm-council/data/conversations to it.
- Do not bake any real API keys into the image.

Backend:
- Run /opt/llm-council/.venv/bin/python -m backend.main as user council from /opt/llm-council. The pinned app listens on 0.0.0.0:8001.
- Add and enable llm-council-backend.service with EnvironmentFile=-/etc/llm-council/env so a missing optional file does not stop startup.
- If OPENROUTER_API_KEY is missing, the service must still start and its upstream GET / health endpoint must return {"status":"ok","service":"LLM Council API"}. The status helper must clearly report that model requests need the runtime secret.

Frontend:
- Run npm ci and npm run build under /opt/llm-council/frontend.
- Add and enable llm-council-frontend.service running npm run preview -- --host 127.0.0.1 --port 5173 from that directory.
- Configure nginx to expose the UI on http://0.0.0.0:8080 and proxy /api/ to 127.0.0.1:8001.

Security and operations:
- Enable SSH.
- Enable ufw with ports 22 and 8080 open.
- Add /root/llm-council-runbook.md explaining how to add OPENROUTER_API_KEY at /etc/llm-council/env, choose council models, restart services, and switch to local inference later.
- Add a shell helper /usr/local/bin/llm-council-status that prints service status, listening ports, and whether the API key is configured.

Validation:
- Confirm /usr/lib/llm-council-appliance/PINNED_COMMIT records commit 92e1fccb1bdcf1bab7221aa9ed90f9dc72529131.
- Confirm llm-council-backend.service and llm-council-frontend.service are enabled and active.
- Confirm ports 8001 and 8080 listen after boot.
- Confirm curl -fsS http://localhost:8080 returns HTML.
- Confirm curl -fsS http://localhost:8001/ returns JSON with status ok.
- Confirm /usr/local/bin/llm-council-status succeeds and reports whether OPENROUTER_API_KEY is configured.

Output:
- Produce one bootable ISO.
- Include the recipe, service files, nginx config, runbook, and validation results in the build artifacts.

Model Mix: Do Not Make A Choir Of Clones

The point of a council is not to ask five nearly identical models to agree with one another. Pick models that disagree productively:

One strong general reasoning model for the chairman role.
One model that tends to be terse and skeptical.
One model that writes clearly and catches user-facing ambiguity.
One model from a different provider family to reduce shared blind spots.
Optional: one local model, even if weaker, to test what your private inference path contributes.

Going Local: What It Costs To Self-Host The Council

OpenRouter is the fastest way to get a council talking, but the reason to bake the appliance into a reproducible image is that you can later swap the provider for hardware you own. That matters when the questions involve private data you would rather not send to a third party. Two serving stacks dominate in 2026: Ollama (a friendly wrapper over llama.cpp, ideal for a single workstation) and vLLM (a throughput-oriented server with PagedAttention for concurrent requests). A thin router like LiteLLM lets the council code keep speaking the same OpenAI-style API whether a member is local or hosted.

The hardware math is approachable. A useful rule of thumb is roughly 2 GB of VRAM per billion parameters at FP16, and 4-bit quantization (Q4_K_M) cuts that by about 75% with little quality loss. In practice an 8B-class model such as Llama 3.1 8B or Qwen3 8B fits comfortably in 8–12 GB and runs north of 40 tokens per second on a single consumer GPU, while a 70B model at Q4 needs around 40 GB. That makes a practical split easy: run one or two small, fast local members for privacy and cheap first opinions, and reserve a larger hosted or quantized 70B model for the chairman seat where final quality matters most.

Because the council is an OpenFactory image, “add local inference” becomes a recipe edit rather than a weekend of yak-shaving: add the GPU drivers and the serving runtime to the build, point models.json at the local endpoint, and rebuild. The validation step still proves the UI comes up and the backend reports a clear status if a key or endpoint is missing.

When An LLM Council Is Worth The Cost

Do not use a council for every autocomplete, summary, or low-stakes chat. Use it where the extra latency and token spend are buying you something: architectural decisions, incident reviews, legal or policy drafts that still get human review, product strategy, model evaluation, and high-value writing where the final answer benefits from adversarial feedback.

The nice thing about building the council as an OpenFactory image is that the cost boundary is explicit. You can ship one small council for experimentation, one larger council for final review, and one local-only council for private data.

Where To Go Next

If you want the fastest path, open the OpenFactory console and paste the prompt above. If you want to adapt the image first, start with the custom Linux ISO builder guide or the GitHub-to-ISO workflow. The council pattern is young, but it already has the shape of a real operator tool: multiple opinions, visible disagreement, and a final answer you can audit.

Frequently asked questions

What is an LLM council?

An LLM council is a multi-model workflow where several language models answer the same prompt independently, review or rank each other's anonymized answers, and then a final chair model synthesizes the result.

What did Karpathy's LLM Council popularize?

Karpathy's llm-council repo made the pattern concrete as a local web app: Stage 1 collects first opinions, Stage 2 asks models to review anonymized responses, and Stage 3 has a chairman model compile the final answer.

Why build an LLM council with OpenFactory?

OpenFactory turns the council into a reproducible bootable image with pinned packages, repeatable services, runtime secret placeholders, validation checks, and VM deployment instead of a one-off laptop setup.

Do I need to use paid hosted models?

No. Karpathy's reference app uses OpenRouter for easy access to multiple providers, but the same architecture can be adapted to local models through Ollama, vLLM, LiteLLM, or a private inference endpoint.

What hardware do I need to run the council locally?

A rough rule of thumb is about 2 GB of VRAM per billion parameters at FP16, and 4-bit quantization cuts that by roughly 75%. An 8B model fits in 8-12 GB on a single consumer GPU and runs at 40+ tokens/second; a 70B model at Q4 needs around 40 GB. A common split is small fast local members for cheap first opinions plus a larger model in the chairman seat.

Is a council always better than one strong model?

No. A council buys reliability and reduced single-model bias at the cost of extra latency and tokens, and research like Mixture-of-Agents notes the layered approach increases time-to-first-token. Use a council for high-stakes work like architecture decisions, incident reviews, and evaluation, not for autocomplete or low-stakes chat.

Ready to ship this in production?

OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.

See pricing →Book a demo