
A five-VM document archive: Paperless + Tika + Gotenberg + Postgres + Redis, from one prompt
May 20, 2026
Paperless-ngx is the document archive every paperless office wishes it had: scan-or-drop a PDF and Paperless OCRs it, tags it, files it by correspondent, and makes the full text searchable in seconds. It's the sixth-most-deployed app in the 2026 r/selfhosted survey.
This post walks through the full Paperless stack on OpenFactory: five buildable VMs — Paperless, Apache Tika for content extraction, Gotenberg for PDF generation, Postgres, and Redis — from one prompt, shipped as bootable ISOs.
paperless (10.75.0.10:8000) — the app server with consume / media / data directories already on disk.tika (10.75.0.20:9998) — Apache Tika for extracting text from Office docs, emails, and everything that isn't already a PDF.gotenberg (10.75.0.21:3000) — Chromium- and LibreOffice-backed PDF generation, so Paperless can render every input to a uniform archival format.postgres (10.75.0.30:5432) — the document and tag metadata store.redis (10.75.0.31:6379) — queue for the consume + OCR pipeline.Five Debian Trixie VMs on 10.75.0.0/24. Paperless is the only VM that talks to all the others; Tika, Gotenberg, Postgres, and Redis are subnet-only.
Paste this verbatim into the chat builder at console.openfactory.tech. Nothing above or below it — the builder expects the prompt body to start at the “Build a compact multi-node lab…” line.
Build a compact multi-node lab named `paperless-document-lab`.
Output discipline: keep the plan small. Use one startup script per node, about 25 shell lines or less. Do not install paperless-ngx, Apache Tika, Gotenberg, Tesseract OCR data, or PDF/OCR binaries at build time. Do not pull large language model files. Write deployment-time config examples and tiny Python stdlib or shell compatibility stubs only. The goal is a buildable preparation lab, not a production deployment.
## Topology
Create 5 buildable `debian-trixie` nodes, all `x86_64`, SSH enabled, DHCP/default route intact with lab aliases, firewall disabled, DNS `1.1.1.1` and `8.8.8.8`, user `ops` password `paperless-ops` in `sudo`. Every recipe must set top-level `test_config` to `{ "enabled": false, "tests": [] }`.
- `paperless`: role `doc-app`, 3 GB RAM, 24 GB disk, alias `10.75.0.10/24`, x `230`, y `60`
- `tika`: role `parser`, 2 GB RAM, 12 GB disk, alias `10.75.0.20/24`, x `110`, y `220`
- `gotenberg`: role `pdf-gen`, 2 GB RAM, 12 GB disk, alias `10.75.0.21/24`, x `350`, y `220`
- `postgres`: role `database`, 2 GB RAM, 16 GB disk, alias `10.75.0.30/24`, x `110`, y `380`
- `redis`: role `queue`, 1 GB RAM, 8 GB disk, alias `10.75.0.31/24`, x `350`, y `380`
Connections: `paperless` to `postgres:5432`, `redis:6379`, `tika:9998`, `gotenberg:3000`.
## Common Recipe Requirements
All nodes: features `headless`, `ssh`; packages `openssh-server`, `python3`, `curl`, `jq`, `iproute2`, `netcat-openbsd`, `ca-certificates`. Each startup script adds the alias with `IFACE=$(ip route show default | awk '{print $5; exit}')`, `ip link set "$IFACE" up || true`, and `ip addr add <alias> dev "$IFACE" || true`. If `os.startup_scripts[].after` is present, it must be the string `"network-online.target"`, not an array.
## Node Requirements
`paperless`: features `headless`, `ssh`. Write `/etc/paperless/paperless.env` with `PAPERLESS_PORT=8000`, `PAPERLESS_DBHOST=10.75.0.30`, `PAPERLESS_REDIS=redis://10.75.0.31:6379`, `PAPERLESS_TIKA_ENABLED=1`, `PAPERLESS_TIKA_ENDPOINT=http://10.75.0.20:9998`, `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://10.75.0.21:3000`. Create `/var/lib/paperless/{consume,media,data}` mode `0750 ops:ops`. Add a Python stdlib service on `0.0.0.0:8000` exposing:
- `GET /api/` -> `200 {"correspondents":"/api/correspondents/","documents":"/api/documents/"}`
- `GET /api/statistics/` -> `200 {"documents_total":0,"documents_inbox":0}`
- `GET /metrics` -> `paperless_compat_up 1`
Register `paperless-compat.service`.
`tika`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:9998` exposing `GET /version` -> `200 "Apache Tika compat-1.0"` (`text/plain`), `GET /tika` -> `200 {"status":"ok"}`, `GET /metrics` with `tika_compat_up 1`. Register `tika-compat.service`.
`gotenberg`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:3000` exposing `GET /health` -> `200 {"status":"up","modules":{"chromium":{"status":"compat"},"libreoffice":{"status":"compat"}}}`, `GET /metrics` with `gotenberg_compat_up 1`. Register `gotenberg-compat.service`.
`postgres`: features `headless`, `ssh`, `postgresql`; packages `postgresql`, `postgresql-client`. Listen on `0.0.0.0:5432`, best-effort create role/database `paperless` password `paperless`, allow `10.75.0.0/24` in `pg_hba.conf`. Expose `:9187/metrics` with `pg_compat_up 1`.
`redis`: features `headless`, `ssh`, `redis`; packages `redis-server`. Bind to localhost plus `10.75.0.31`. Expose `:9121/metrics` with `redis_compat_up 1`.
## Scenario
Emit exactly one group scenario named `paperless-document-lab-validation`. Put `custom_tests[].assertions[]` inside the scenario entry; leave `scenarios[].tests` empty. Every assertion needs `on_vm`. Use only `port_listening`, `command_output`, and `http_responds`; do not emit `vm_boots`, `network_reachable`, or `service_running`.
- `Stack ports listen`: `port_listening` for `paperless:8000`, `tika:9998`, `gotenberg:3000`, `postgres:5432`, `redis:6379`.
- `Paperless API`: on `paperless`, `curl -fsS http://localhost:8000/api/ | jq -e '.documents == "/api/documents/"' >/dev/null && echo paperless-ok`.
- `Tika version`: on `tika`, `curl -fsS http://localhost:9998/version | grep -qi 'Apache Tika' && echo tika-ok`.
- `Gotenberg health`: on `gotenberg`, `curl -fsS http://localhost:3000/health | jq -e '.status == "up"' >/dev/null && echo gotenberg-ok`.
- `Paperless reaches backends`: on `paperless`, `nc -z -w 5 10.75.0.30 5432 && nc -z -w 5 10.75.0.31 6379 && nc -z -w 5 10.75.0.20 9998 && nc -z -w 5 10.75.0.21 3000 && echo backends-reachable`.
Preserve warnings that real paperless-ngx app binary, Tesseract OCR data files for the chosen language set, real Apache Tika and Gotenberg Java/Chromium binaries, consume folder mount strategy, document encryption at rest, off-host backups, mail-rule ingestion, and `10.75.0.0/24` aliasing are deployment-time concerns.Driving OpenFactory from an AI agent instead of the browser? The same flow is exposed through the OpenFactory MCP server — submit the prompt programmatically, get the build-plan preview back, and call create_build / start_vm on the resulting recipes. Single-image builds go straight through the openfactory CLI.
The prompt produces a buildable preparation lab — the right topology, the right ports listening, deployment-time config templates dropped in the right places, and tiny compatibility services that prove the wiring works. A few things still sit outside the recipe and need operator attention before this carries real load:
/var/lib/paperless/consume is ready — mount it from your scanner share (SMB / NFS / IMAP folder) and Paperless picks up what lands.Paper is one half of personal data; photos are the other. The Immich photo vault post builds the same shape for images. For the kernel integrity story under regulated archives, see the runtime attestation post. And the Enterprise & GxP page covers compliance-grade rollouts.
OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.