OpenFactory Proxmox 3-node cluster lab with Corosync ring and a QDevice tie-breaker

Build a 3-Node Proxmox HA Cluster on OpenFactory

Four VMs from one prompt: three PVE nodes + a QDevice witness, Corosync wiring baked in

May 9, 2026

← Back to Blog

A single-node Proxmox box is the entry point. The day you stop being willing to lose state when that one box goes down, you need three nodes — the minimum for reliable Corosync quorum and the first shape that survives a node loss with VMs still running. The Proxmox docs are blunt: “If you are interested in High Availability, you need to have at least three nodes for reliable quorum” (Cluster Manager). It is vote math, not marketing: Corosync needs a majority to stay writable, a two-node cluster has none after one failure, and both halves freeze — the split-brain problem this shape dodges.

This post walks through that exact shape as an OpenFactory build prompt: four buildable Debian Trixie VMs — three PVE nodes plus a QDevice witness — from a single prompt, with /etc/pve/corosync.conf already shaped, the cluster nodelist agreed across all three, and a mock PVE API reporting quorate: 1. Real pvecm create / pvecm add is the deploy-time step on top; you rehearse the quorum wiring before any hardware is racked.

What you'll build

  • pve-1, pve-2, pve-3 (10.81.0.11–13:8006) — three PVE hosts each with a per-node /etc/pve/corosync.conf listing all three ring0 addresses + the QDevice tie-breaker, mock :5404 and :5405 Corosync listeners, and a mock cluster-status JSON reporting quorate: 1.
  • pbs-witness (10.81.0.20:5403) — QDevice / corosync-qnetd shape that breaks the tie on 2-node split-brain scenarios. Comes with a runbook explaining the real pvecm qdevice setup deploy step.
  • Replicated-storage intent /etc/pve/storage.cfg on each PVE node pre-populated with zfspool: rpool-data as the replication target. Swap to Ceph via the next post in the series if you want full HA storage.

Why build it on OpenFactory

  • The ISO is the spec. The whole corosync.conf — nodelist, quorum-device config, transport — is baked into all three PVE node ISOs. No copy-paste between hosts at deploy.
  • Scenario assertions ride along. The build group fails closed if any node reports a different cluster name, if the nodelist drifts, or if the QDevice isn't reachable from any PVE host. You don't deploy and discover quorum is broken at 3am.
  • QDevice in the recipe, not an afterthought. Two-out-of-three quorum is a split-brain magnet. The witness is wired from day one.
  • Mesh reachability proved at build. Each PVE node verifies it can reach both peers on :8006 before the group reports built.

Topology

Three PVE nodes in a row, QDevice below. PVE↔PVE on :8006 (API) and :5404/:5405 (Corosync ring); all three to the QDevice on :5403 for tie-breaking. Lab subnet 10.81.0.0/24. Every vote and ring link in the diagram is a port the build group actually checks.

Proxmox 3-node HA cluster with Corosync ring and a QDevice tie-breaker on 10.81.0.0/24cluster pve-cluster · 10.81.0.0/24 · quorum: majority of 4 votescorosync ring0 :5404/:5405PVE API mesh :8006pve-110.81.0.11nodeid 1 · 1 votepve-210.81.0.12nodeid 2 · 1 votepve-310.81.0.13nodeid 3 · 1 voteqdevice vote :5403pbs-witness10.81.0.20corosync-qnetd :5403tie-breaker · 1 vote/etc/pve/corosync.conf nodelist agreed on all 3 nodesstorage: zfspool rpool-data (intent: replicated ZFS)
Three PVE nodes carrying one vote each plus a QDevice witness — a 4-vote quorum where any single node loss still leaves a majority. Solid links are the Corosync ring; dashed lines are the witness votes.

The quorum math, spelled out

Corosync assigns one vote per node and demands a strict majority to keep /etc/pve writable. Walk the cases:

  • 3 nodes, no QDevice: 3 votes, majority 2. Lose one node → 2 survive → still quorate. Lose two → read-only. This is the clean, recommended baseline.
  • 2 nodes: 2 votes, majority 2. Lose one → 1 left, not a majority → both sides freeze. Two servers is not HA — it is two single points of failure that also take each other down.
  • 4 votes (3 nodes + QDevice), as this lab models: majority 3. The witness is the safety net for any even node count.

One honest caveat: for an odd cluster that already has natural majority, the QDevice is generally unnecessary — it shines on 2-node setups where it supplies the third vote (quorum explained). We wire it in so you can test the witness path; with three healthy nodes you may simply drop it.

Why the cluster network is its own thing

Corosync is sensitive to latency and jitter, not bandwidth. The official requirement is latencies under 5 ms (LAN performance) between every node, and Proxmox recommends a dedicated physical NIC for cluster traffic — a plain 1 Gbit link is enough (Cluster Manager). Skip it and the failure mode is nasty: a backup or migration saturates the shared link, Corosync packets miss their timeout, nodes think a peer vanished, and a healthy cluster fences itself. So split link0 onto its own NIC plus a redundant link1 over knet, so one cable pull doesn't drop the ring.

What actually happens when a node dies

Quorum keeps the cluster writable; it does not move your VMs on its own. For a guest to restart elsewhere it must be enrolled in HA (ha-manager add) and its disk has to exist on another node, via Ceph or ZFS replication. When a node drops, the HA stack waits out the fence timeout, then restarts the affected HA-managed VMs on a survivor from the replicated disk. As of Proxmox VE 9.2 (May 2026) the cluster resource scheduler also runs a dynamic load balancer that live-migrates HA-enrolled guests to even out CPU and memory pressure, closing the last big gap to vSphere DRS — though it only touches guests already in HA.

The prompt

Paste this verbatim into the chat builder at console.openfactory.tech. Nothing above or below it — the builder expects the prompt body to start at the “Build a compact multi-node lab…” line.

Build a compact multi-node lab named `proxmox-3node-cluster`.

Output discipline: keep the plan small. Use one startup script per node, about 25 shell lines or less. Do not install `pve-manager`, `corosync`, `pve-cluster`, `pmxcfs`, or any Proxmox apt repos at build time. The cluster shape is mocked via deployment-time config templates and Python stdlib listeners — real `pvecm create` / `pvecm add` runs at provisioning on top of installed Proxmox VE hosts. Write deployment-time config examples and tiny Python stdlib or shell compatibility stubs only. The goal is a buildable preparation lab, not a production Proxmox install.

## Topology

Create 4 buildable `debian-trixie` nodes, all `x86_64`, SSH enabled, DHCP/default route intact with lab aliases, firewall disabled, DNS `1.1.1.1` and `8.8.8.8`, user `ops` password `pve-cluster-ops` in `sudo`. Every recipe must set top-level `test_config` to `{ "enabled": false, "tests": [] }`.

- `pve-1`: role `pve-host`, 4 GB RAM, 32 GB disk, alias `10.81.0.11/24`, x `110`, y `100`
- `pve-2`: role `pve-host`, 4 GB RAM, 32 GB disk, alias `10.81.0.12/24`, x `350`, y `100`
- `pve-3`: role `pve-host`, 4 GB RAM, 32 GB disk, alias `10.81.0.13/24`, x `590`, y `100`
- `pbs-witness`: role `qdevice-witness`, 2 GB RAM, 16 GB disk, alias `10.81.0.20/24`, x `350`, y `280`

Connections: `pve-1`, `pve-2`, `pve-3` to each other on `:8006` (PVE API) and `:5404/5405` (Corosync ring); all three to `pbs-witness:5403` (QDevice tie-breaker).

## Common Recipe Requirements

All nodes: features `headless`, `ssh`; packages `openssh-server`, `python3`, `curl`, `jq`, `iproute2`, `netcat-openbsd`, `ca-certificates`. Each startup script adds the alias with `IFACE=$(ip route show default | awk '{print $5; exit}')`, `ip link set "$IFACE" up || true`, and `ip addr add <alias> dev "$IFACE" || true`. If `os.startup_scripts[].after` is present, it must be the string `"network-online.target"`, not an array. Do not install `pve-manager`, `proxmox-backup-server`, `ceph`, `truenas-scale`, or any related apt packages — they are source-ISO deploys handled at provisioning time, not at build time.

## Node Requirements

All three `pve-1`, `pve-2`, `pve-3` share the same compatibility-service shape with different identity payloads. Each:

- Creates `/etc/pve/{nodes/<self>,storage,qemu-server,lxc,priv}` mode `0750 ops:ops`.
- Writes `/etc/pve/corosync.conf` with `totem { version: 2, cluster_name: pve-cluster, transport: knet, interface { linknumber: 0 } }`, `quorum { provider: corosync_votequorum, expected_votes: 4, device { model: net, votes: 1, net { tls: on, host: 10.81.0.20, algorithm: ffsplit } } }`, and `nodelist { node { name: pve-1, nodeid: 1, ring0_addr: 10.81.0.11 } node { name: pve-2, nodeid: 2, ring0_addr: 10.81.0.12 } node { name: pve-3, nodeid: 3, ring0_addr: 10.81.0.13 } }`.
- Writes `/etc/pve/storage.cfg` with `zfspool: rpool-data\n  pool rpool/data\n  content images,rootdir\n  sparse 1` (intent: replicated ZFS).
- Adds a Python stdlib HTTP service on `0.0.0.0:8006` exposing:
  - `GET /api2/json/version` -> `200 {"data":{"version":"compat-1.0","release":"pve-compat","repoid":"<node-id>"}}`
  - `GET /api2/json/cluster/status` -> `200 {"data":[{"type":"cluster","name":"pve-cluster","nodes":3,"quorate":1},{"type":"node","name":"pve-1","online":1,"id":"node/pve-1","nodeid":1},{"type":"node","name":"pve-2","online":1,"id":"node/pve-2","nodeid":2},{"type":"node","name":"pve-3","online":1,"id":"node/pve-3","nodeid":3}]}`
  - `GET /api2/json/nodes/<self>/status` -> `200 {"data":{"uptime":3600,"loadavg":["0.05","0.05","0.05"],"cpu":0.05}}`
- Adds Python stdlib TCP listeners on `0.0.0.0:5404` and `0.0.0.0:5405` accepting connections (no Corosync protocol needed; just proves the ports listen).
- Registers `pve-compat.service`.

`pbs-witness`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:5403` accepting TCP connections (mock QDevice / corosync-qnetd) plus an HTTP `:9165/metrics` listener returning `qdevice_compat_up 1`. Register `qdevice-compat.service`. Write `/root/qdevice-runbook.md` documenting that real deployment installs `corosync-qnetd` from the Debian repos and registers each PVE node via `pvecm qdevice setup 10.81.0.20`.

## Scenario

Emit exactly one group scenario named `proxmox-3node-cluster-validation`. Put `custom_tests[].assertions[]` inside the scenario entry; leave `scenarios[].tests` empty. Every assertion needs `on_vm`. Use only `port_listening`, `command_output`, and `http_responds`; do not emit `vm_boots`, `network_reachable`, or `service_running`.

- `Cluster ports listen`: `port_listening` for `:8006`, `:5404`, `:5405` on each of `pve-1`, `pve-2`, `pve-3`; `port_listening` for `pbs-witness:5403`.
- `All three nodes report quorate`: on each pve-* node, `curl -fsS http://localhost:8006/api2/json/cluster/status | jq -e '.data[] | select(.type == "cluster") | .quorate == 1' >/dev/null && echo quorate`.
- `Per-node status`: on `pve-1`, `curl -fsS http://localhost:8006/api2/json/nodes/pve-1/status | jq -e '.data.uptime | type == "number"' >/dev/null && echo pve-1-status-ok` and similarly for `pve-2` and `pve-3`.
- `corosync.conf has all three nodes`: on each pve-* node, `grep -c 'ring0_addr: 10.81.0' /etc/pve/corosync.conf | awk '{exit ($1>=3)?0:1}' && echo corosync-nodelist`.
- `All nodes reach the QDevice`: on each pve-* node, `nc -z -w 5 10.81.0.20 5403 && echo qdevice-reachable`.
- `Mesh reachability`: on `pve-1`, `nc -z -w 5 10.81.0.12 8006 && nc -z -w 5 10.81.0.13 8006 && echo peers-reachable`.

Preserve warnings that real Proxmox VE installs on each node, `pvecm create pve-cluster` and `pvecm add 10.81.0.11` on the joining members, Corosync redundant ring (link0/link1) on a dedicated cluster network, QDevice TLS keys via `pvecm qdevice setup`, shared or replicated storage (Ceph or ZFS replication) so HA can fail VMs over, `ha-manager add` per VM, real fencing, dedicated NICs for cluster vs VM vs migration traffic, and `10.81.0.0/24` lab aliasing are deployment-time concerns.

Running it

  1. Open the chat builder at console.openfactory.tech and paste the prompt into a new conversation.
  2. Review the streamed build plan. You'll see the topology, per-node recipes, and the scenario assertions that will run after boot. Edit the prompt and re-run if anything is off.
  3. Click Build group. OpenFactory fans the plan out to per-node ISO builds. When every ISO reaches built, boot the group on the runner network from the same UI.
  4. Exercise the stack. The scenario assertions run automatically against the live VMs. From the host you can also hit the service ports directly to confirm end-to-end behavior.

Driving OpenFactory from an AI agent instead of the browser? The same flow is exposed through the OpenFactory MCP server — submit the prompt programmatically, get the build-plan preview back, and call create_build / start_vm on the resulting recipes. Single-image builds go straight through the openfactory CLI.

What's still your responsibility

The prompt produces a buildable preparation lab — the right topology, the right ports listening, deployment-time config templates dropped in the right places, and tiny compatibility services that prove the wiring works. A few things still sit outside the recipe and need operator attention before this carries real load:

  • Real Proxmox VE on each node. Boot the PVE installer; the corosync.conf shape, storage config, and ring layout are ready to drop onto a real /etc/pve/corosync.conf after pvecm create pve-cluster / pvecm add 10.81.0.11.
  • Dedicated cluster network. The lab puts everything on one /24; production should isolate link0 on its own NIC plus a redundant link1.
  • Replicated or shared storage. Corosync gives you quorum; HA fail-over needs the VM disks to exist on more than one node. ZFS replication (pvesr) is the cheapest; Ceph is the most resilient.
  • ha-manager add per VM. HA isn't automatic per-VM; you opt each guest into the HA group with ha-manager add vm:200 --group ha-default.
  • Fencing. Real HA needs hardware fencing (IPMI / iLO / iDRAC) so a stuck node can be killed cleanly. Out of scope of the lab; document yours.
  • Real QDevice install. corosync-qnetd on the witness host; pvecm qdevice setup 10.81.0.20 on each PVE node. TLS keys are generated during that setup.

Where to go next

If the next thing you want is real HA storage — survive a node loss with zero data motion at fail-over — see the Proxmox + Ceph cluster post. If backup is the bigger gap, the Proxmox + PBS post wires deduplicated incremental backups to a dedicated PBS target with off-site sync. Coming back from the entry point? See the single-node Proxmox lab.

Quick questions

Why not two nodes plus the QDevice? That gives a 3rd vote and survives one failure — but two nodes plus a witness keep quorum while leaving only one place to land guests. Three real nodes give HA somewhere to actually restart the VMs: the first shape that is HA in practice, not just on paper.

When quorum does break, pvecm status shows total and expected votes and which nodes are visible; the last resort, pvecm expected 1, forces one survivor writable — dangerous if the “dead” nodes are alive on a split network, since you have just authorized two writers. Rehearsing the healthy nodelist here means you know what good looks like before you are staring at broken.

Rolling this out across a regulated or multi-site fleet? The Enterprise & GxP page covers fleet rollouts and audit trails, pricing lays out the tiers, and the prompt builds the cluster shape now at console.openfactory.tech.

Ready to ship this in production?

OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.