
A five-VM document archive: Paperless + Tika + Gotenberg + Postgres + Redis, from one prompt
March 24, 2026
Paperless-ngx is the document archive every paperless office wishes it had: scan-or-drop a PDF and Paperless OCRs it, tags it, files it by correspondent, and makes the full text searchable in seconds. It's the sixth-most-deployed app in the 2026 r/selfhosted survey, and once it's ingested a year of mail you stop filing paper by hand and start searching for it.
Under the hood Paperless is not one program but a pipeline. The 2026 reference stack pins Paperless-ngx 2.20.11 against PostgreSQL 18 and Redis 8, with Gotenberg 8.15 and Apache Tika 3.1.0 handling the formats Paperless can't parse on its own (AiCybr setup guide, 2026). A file lands in the consume folder, a Celery worker detects its type, office documents are routed through Gotenberg and Tika, image-only PDFs get Tesseract OCR through the ocrmypdf wrapper, and Paperless keeps both the original and a searchable PDF while indexing the full text in Whoosh (Paperless-ngx docs).
This post walks through that full pipeline on OpenFactory: five buildable VMs — Paperless, Apache Tika for content extraction, Gotenberg for PDF generation, Postgres, and Redis — from one prompt, shipped as bootable ISOs. The lab gives you the topology and the wiring; this post is about what runs on top of it and what you owe the archive once it holds your real paper.
paperless (10.75.0.10:8000) — the app server with consume / media / data directories already on disk.tika (10.75.0.20:9998) — Apache Tika for extracting text from Office docs, emails, and everything that isn't already a PDF.gotenberg (10.75.0.21:3000) — Chromium- and LibreOffice-backed PDF generation, so Paperless can render every input to a uniform archival format.postgres (10.75.0.30:5432) — the document and tag metadata store.redis (10.75.0.31:6379) — the message broker for the Celery task queue. Every consume job, OCR run, and scheduled workflow flows through here, which is why it earns its own node even though it is the smallest box in the lab.Five Debian Trixie VMs on 10.75.0.0/24. Paperless is the only VM that talks to all the others; Tika, Gotenberg, Postgres, and Redis are subnet-only. Read the diagram as a pipeline rather than a star: a scan lands in the consume folder, Paperless enqueues a job on Redis, the worker pulls text out through Tika and renders to PDF through Gotenberg, and the resulting metadata and full-text index land in Postgres.
Why the parsers are separate boxes. Tika is a JVM service and Gotenberg bundles a headless Chromium plus LibreOffice — both are memory- and CPU-heavy, and both only get exercised when a non-PDF (a .docx, an .eml email, a spreadsheet) enters the pipeline. Pinning them to their own VMs means a burst of office documents can't evict the Paperless web worker or the Postgres page cache.
Paste this verbatim into the chat builder at console.openfactory.tech. Nothing above or below it — the builder expects the prompt body to start at the “Build a compact multi-node lab…” line.
Build a compact multi-node lab named `paperless-document-lab`.
Output discipline: keep the plan small. Use one startup script per node, about 25 shell lines or less. Do not install paperless-ngx, Apache Tika, Gotenberg, Tesseract OCR data, or PDF/OCR binaries at build time. Do not pull large language model files. Write deployment-time config examples and tiny Python stdlib or shell compatibility stubs only. The goal is a buildable preparation lab, not a production deployment.
## Topology
Create 5 buildable `debian-trixie` nodes, all `x86_64`, SSH enabled, DHCP/default route intact with lab aliases, firewall disabled, DNS `1.1.1.1` and `8.8.8.8`, user `ops` password `paperless-ops` in `sudo`. Every recipe must set top-level `test_config` to `{ "enabled": false, "tests": [] }`.
- `paperless`: role `doc-app`, 3 GB RAM, 24 GB disk, alias `10.75.0.10/24`, x `230`, y `60`
- `tika`: role `parser`, 2 GB RAM, 12 GB disk, alias `10.75.0.20/24`, x `110`, y `220`
- `gotenberg`: role `pdf-gen`, 2 GB RAM, 12 GB disk, alias `10.75.0.21/24`, x `350`, y `220`
- `postgres`: role `database`, 2 GB RAM, 16 GB disk, alias `10.75.0.30/24`, x `110`, y `380`
- `redis`: role `queue`, 1 GB RAM, 8 GB disk, alias `10.75.0.31/24`, x `350`, y `380`
Connections: `paperless` to `postgres:5432`, `redis:6379`, `tika:9998`, `gotenberg:3000`.
## Common Recipe Requirements
All nodes: features `headless`, `ssh`; packages `openssh-server`, `python3`, `curl`, `jq`, `iproute2`, `netcat-openbsd`, `ca-certificates`. Each startup script adds the alias with `IFACE=$(ip route show default | awk '{print $5; exit}')`, `ip link set "$IFACE" up || true`, and `ip addr add <alias> dev "$IFACE" || true`. If `os.startup_scripts[].after` is present, it must be the string `"network-online.target"`, not an array.
## Node Requirements
`paperless`: features `headless`, `ssh`. Write `/etc/paperless/paperless.env` with `PAPERLESS_PORT=8000`, `PAPERLESS_DBHOST=10.75.0.30`, `PAPERLESS_REDIS=redis://10.75.0.31:6379`, `PAPERLESS_TIKA_ENABLED=1`, `PAPERLESS_TIKA_ENDPOINT=http://10.75.0.20:9998`, `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://10.75.0.21:3000`. Create `/var/lib/paperless/{consume,media,data}` mode `0750 ops:ops`. Add a Python stdlib service on `0.0.0.0:8000` exposing:
- `GET /api/` -> `200 {"correspondents":"/api/correspondents/","documents":"/api/documents/"}`
- `GET /api/statistics/` -> `200 {"documents_total":0,"documents_inbox":0}`
- `GET /metrics` -> `paperless_compat_up 1`
Register `paperless-compat.service`.
`tika`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:9998` exposing `GET /version` -> `200 "Apache Tika compat-1.0"` (`text/plain`), `GET /tika` -> `200 {"status":"ok"}`, `GET /metrics` with `tika_compat_up 1`. Register `tika-compat.service`.
`gotenberg`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:3000` exposing `GET /health` -> `200 {"status":"up","modules":{"chromium":{"status":"compat"},"libreoffice":{"status":"compat"}}}`, `GET /metrics` with `gotenberg_compat_up 1`. Register `gotenberg-compat.service`.
`postgres`: features `headless`, `ssh`, `postgresql`; packages `postgresql`, `postgresql-client`. Listen on `0.0.0.0:5432`, best-effort create role/database `paperless` password `paperless`, allow `10.75.0.0/24` in `pg_hba.conf`. Expose `:9187/metrics` with `pg_compat_up 1`.
`redis`: features `headless`, `ssh`, `redis`; packages `redis-server`. Bind to localhost plus `10.75.0.31`. Expose `:9121/metrics` with `redis_compat_up 1`.
## Scenario
Emit exactly one group scenario named `paperless-document-lab-validation`. Put `custom_tests[].assertions[]` inside the scenario entry; leave `scenarios[].tests` empty. Every assertion needs `on_vm`. Use only `port_listening`, `command_output`, and `http_responds`; do not emit `vm_boots`, `network_reachable`, or `service_running`.
- `Stack ports listen`: `port_listening` for `paperless:8000`, `tika:9998`, `gotenberg:3000`, `postgres:5432`, `redis:6379`.
- `Paperless API`: on `paperless`, `curl -fsS http://localhost:8000/api/ | jq -e '.documents == "/api/documents/"' >/dev/null && echo paperless-ok`.
- `Tika version`: on `tika`, `curl -fsS http://localhost:9998/version | grep -qi 'Apache Tika' && echo tika-ok`.
- `Gotenberg health`: on `gotenberg`, `curl -fsS http://localhost:3000/health | jq -e '.status == "up"' >/dev/null && echo gotenberg-ok`.
- `Paperless reaches backends`: on `paperless`, `nc -z -w 5 10.75.0.30 5432 && nc -z -w 5 10.75.0.31 6379 && nc -z -w 5 10.75.0.20 9998 && nc -z -w 5 10.75.0.21 3000 && echo backends-reachable`.
Preserve warnings that real paperless-ngx app binary, Tesseract OCR data files for the chosen language set, real Apache Tika and Gotenberg Java/Chromium binaries, consume folder mount strategy, document encryption at rest, off-host backups, mail-rule ingestion, and `10.75.0.0/24` aliasing are deployment-time concerns.Driving OpenFactory from an AI agent instead of the browser? The same flow is exposed through the OpenFactory MCP server — submit the prompt programmatically, get the build-plan preview back, and call create_build / start_vm on the resulting recipes. Single-image builds go straight through the openfactory CLI.
The prompt produces a buildable preparation lab — the right topology, the right ports listening, deployment-time config templates dropped in the right places, and tiny compatibility services that prove the wiring works. A few things still sit outside the recipe and need operator attention before this carries real load:
/var/lib/paperless/consume is ready — mount it from your scanner share (SMB / NFS / IMAP folder) and Paperless picks up what lands. Two things bite people here: the consumer process must own (or be able to read and delete in) that directory, and a network share that writes files in chunks can trip the watcher before the upload finishes — set PAPERLESS_CONSUMER_POLLING when the share doesn't deliver clean inotify events.media/. The supported path is Paperless's own document_exporter management command, which writes documents, thumbnails, metadata, and a database dump to one folder and can update an existing export, so incremental rsync backups just work (Paperless-ngx administration docs). Run it when the consumer is idle, then apply the 3-2-1 rule: three copies, two media, one off-site.Do I really need both Tika and Gotenberg? Only if you ingest anything other than PDFs and images. PDFs and photos go straight through Tesseract. The moment you point an email account or a folder of Office documents at Paperless, Gotenberg renders them to PDF and Tika pulls the text — so for a real mail-and-invoices workflow, yes, keep both.
SQLite or Postgres? Paperless can run on SQLite for a single user, but the consume pipeline is concurrent and the full-text index grows with every document. This lab uses Postgres 18 from the start so the archive scales past one person without a migration later.
Where do the original scans live? On the paperless VM under /var/lib/paperless/media. Paperless never throws away the source file — it stores the original next to the OCR'd searchable copy, which is exactly why your backup has to cover the disk, not just the database.
Paper is one half of personal data; photos are the other. The Immich photo vault post builds the same shape for images, and the Nextcloud cloud-stack post gives you the files-and-sync layer those scanners can drop into. For the kernel integrity story under regulated archives, see the runtime attestation post. The Enterprise & GxP page covers compliance-grade rollouts, and pricing has the plan tiers if you want OpenFactory to build and host the fleet for you.
OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.