OpenFactory Paperless-ngx document lab with Tika parser, Gotenberg PDF generator, Postgres and Redis

Build a Paperless-ngx Document Lab on OpenFactory

A five-VM document archive: Paperless + Tika + Gotenberg + Postgres + Redis, from one prompt

March 24, 2026

Paperless-ngx is the document archive every paperless office wishes it had: scan-or-drop a PDF and Paperless OCRs it, tags it, files it by correspondent, and makes the full text searchable in seconds. It's the sixth-most-deployed app in the 2026 r/selfhosted survey, and once it's ingested a year of mail you stop filing paper by hand and start searching for it.

Under the hood Paperless is not one program but a pipeline. The 2026 reference stack pins Paperless-ngx 2.20.11 against PostgreSQL 18 and Redis 8, with Gotenberg 8.15 and Apache Tika 3.1.0 handling the formats Paperless can't parse on its own (AiCybr setup guide, 2026). A file lands in the consume folder, a Celery worker detects its type, office documents are routed through Gotenberg and Tika, image-only PDFs get Tesseract OCR through the ocrmypdf wrapper, and Paperless keeps both the original and a searchable PDF while indexing the full text in Whoosh (Paperless-ngx docs).

This post walks through that full pipeline on OpenFactory: five buildable VMs — Paperless, Apache Tika for content extraction, Gotenberg for PDF generation, Postgres, and Redis — from one prompt, shipped as bootable ISOs. The lab gives you the topology and the wiring; this post is about what runs on top of it and what you owe the archive once it holds your real paper.

What you'll build

paperless (10.75.0.10:8000) — the app server with consume / media / data directories already on disk.
tika (10.75.0.20:9998) — Apache Tika for extracting text from Office docs, emails, and everything that isn't already a PDF.
gotenberg (10.75.0.21:3000) — Chromium- and LibreOffice-backed PDF generation, so Paperless can render every input to a uniform archival format.
postgres (10.75.0.30:5432) — the document and tag metadata store.
redis (10.75.0.31:6379) — the message broker for the Celery task queue. Every consume job, OCR run, and scheduled workflow flows through here, which is why it earns its own node even though it is the smallest box in the lab.

Why build it on OpenFactory

The ISO is the spec. Consume folder, OCR config, Tika and Gotenberg endpoints all baked in. Boot, point the scanner, watch documents land.
Heavy parsers split out. Tika and Gotenberg get their own VMs so OCR pressure doesn't starve the web UI.
Scenario assertions ride along. The build fails closed if Paperless can't reach any of its four backends.
Reproducible across machines. Same ISO, same archive shape, whether you're a solo desk user or running document intake for a team.
The expensive part is isolated. OCR and PDF rendering are the CPU-hungry steps, and they run on the Celery queue fed by Redis. Keeping that queue and its two parsers on their own VMs means a 500-page scan batch never makes the search UI stutter.

Topology

Five Debian Trixie VMs on 10.75.0.0/24. Paperless is the only VM that talks to all the others; Tika, Gotenberg, Postgres, and Redis are subnet-only. Read the diagram as a pipeline rather than a star: a scan lands in the consume folder, Paperless enqueues a job on Redis, the worker pulls text out through Tika and renders to PDF through Gotenberg, and the resulting metadata and full-text index land in Postgres.

The Paperless app server is the only node with outbound links; the four backends accept connections only from the lab subnet.

Why the parsers are separate boxes. Tika is a JVM service and Gotenberg bundles a headless Chromium plus LibreOffice — both are memory- and CPU-heavy, and both only get exercised when a non-PDF (a .docx, an .eml email, a spreadsheet) enters the pipeline. Pinning them to their own VMs means a burst of office documents can't evict the Paperless web worker or the Postgres page cache.

The prompt

Paste this verbatim into the chat builder at console.openfactory.tech. Nothing above or below it — the builder expects the prompt body to start at the “Build a compact multi-node lab…” line.

Build a compact multi-node lab named `paperless-document-lab`.

Output discipline: keep the plan small. Use one startup script per node, about 25 shell lines or less. Do not install paperless-ngx, Apache Tika, Gotenberg, Tesseract OCR data, or PDF/OCR binaries at build time. Do not pull large language model files. Write deployment-time config examples and tiny Python stdlib or shell compatibility stubs only. The goal is a buildable preparation lab, not a production deployment.

## Topology

Create 5 buildable `debian-trixie` nodes, all `x86_64`, SSH enabled, DHCP/default route intact with lab aliases, firewall disabled, DNS `1.1.1.1` and `8.8.8.8`, user `ops` password `paperless-ops` in `sudo`. Every recipe must set top-level `test_config` to `{ "enabled": false, "tests": [] }`.

- `paperless`: role `doc-app`, 3 GB RAM, 24 GB disk, alias `10.75.0.10/24`, x `230`, y `60`
- `tika`: role `parser`, 2 GB RAM, 12 GB disk, alias `10.75.0.20/24`, x `110`, y `220`
- `gotenberg`: role `pdf-gen`, 2 GB RAM, 12 GB disk, alias `10.75.0.21/24`, x `350`, y `220`
- `postgres`: role `database`, 2 GB RAM, 16 GB disk, alias `10.75.0.30/24`, x `110`, y `380`
- `redis`: role `queue`, 1 GB RAM, 8 GB disk, alias `10.75.0.31/24`, x `350`, y `380`

Connections: `paperless` to `postgres:5432`, `redis:6379`, `tika:9998`, `gotenberg:3000`.

## Common Recipe Requirements

All nodes: features `headless`, `ssh`; packages `openssh-server`, `python3`, `curl`, `jq`, `iproute2`, `netcat-openbsd`, `ca-certificates`. Each startup script adds the alias with `IFACE=$(ip route show default | awk '{print $5; exit}')`, `ip link set "$IFACE" up || true`, and `ip addr add <alias> dev "$IFACE" || true`. If `os.startup_scripts[].after` is present, it must be the string `"network-online.target"`, not an array.

## Node Requirements

`paperless`: features `headless`, `ssh`. Write `/etc/paperless/paperless.env` with `PAPERLESS_PORT=8000`, `PAPERLESS_DBHOST=10.75.0.30`, `PAPERLESS_REDIS=redis://10.75.0.31:6379`, `PAPERLESS_TIKA_ENABLED=1`, `PAPERLESS_TIKA_ENDPOINT=http://10.75.0.20:9998`, `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://10.75.0.21:3000`. Create `/var/lib/paperless/{consume,media,data}` mode `0750 ops:ops`. Add a Python stdlib service on `0.0.0.0:8000` exposing:
- `GET /api/` -> `200 {"correspondents":"/api/correspondents/","documents":"/api/documents/"}`
- `GET /api/statistics/` -> `200 {"documents_total":0,"documents_inbox":0}`
- `GET /metrics` -> `paperless_compat_up 1`
Register `paperless-compat.service`.

`tika`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:9998` exposing `GET /version` -> `200 "Apache Tika compat-1.0"` (`text/plain`), `GET /tika` -> `200 {"status":"ok"}`, `GET /metrics` with `tika_compat_up 1`. Register `tika-compat.service`.

`gotenberg`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:3000` exposing `GET /health` -> `200 {"status":"up","modules":{"chromium":{"status":"compat"},"libreoffice":{"status":"compat"}}}`, `GET /metrics` with `gotenberg_compat_up 1`. Register `gotenberg-compat.service`.

`postgres`: features `headless`, `ssh`, `postgresql`; packages `postgresql`, `postgresql-client`. Listen on `0.0.0.0:5432`, best-effort create role/database `paperless` password `paperless`, allow `10.75.0.0/24` in `pg_hba.conf`. Expose `:9187/metrics` with `pg_compat_up 1`.

`redis`: features `headless`, `ssh`, `redis`; packages `redis-server`. Bind to localhost plus `10.75.0.31`. Expose `:9121/metrics` with `redis_compat_up 1`.

## Scenario

Emit exactly one group scenario named `paperless-document-lab-validation`. Put `custom_tests[].assertions[]` inside the scenario entry; leave `scenarios[].tests` empty. Every assertion needs `on_vm`. Use only `port_listening`, `command_output`, and `http_responds`; do not emit `vm_boots`, `network_reachable`, or `service_running`.

- `Stack ports listen`: `port_listening` for `paperless:8000`, `tika:9998`, `gotenberg:3000`, `postgres:5432`, `redis:6379`.
- `Paperless API`: on `paperless`, `curl -fsS http://localhost:8000/api/ | jq -e '.documents == "/api/documents/"' >/dev/null && echo paperless-ok`.
- `Tika version`: on `tika`, `curl -fsS http://localhost:9998/version | grep -qi 'Apache Tika' && echo tika-ok`.
- `Gotenberg health`: on `gotenberg`, `curl -fsS http://localhost:3000/health | jq -e '.status == "up"' >/dev/null && echo gotenberg-ok`.
- `Paperless reaches backends`: on `paperless`, `nc -z -w 5 10.75.0.30 5432 && nc -z -w 5 10.75.0.31 6379 && nc -z -w 5 10.75.0.20 9998 && nc -z -w 5 10.75.0.21 3000 && echo backends-reachable`.

Preserve warnings that real paperless-ngx app binary, Tesseract OCR data files for the chosen language set, real Apache Tika and Gotenberg Java/Chromium binaries, consume folder mount strategy, document encryption at rest, off-host backups, mail-rule ingestion, and `10.75.0.0/24` aliasing are deployment-time concerns.

Running it

Open the chat builder at console.openfactory.tech and paste the prompt into a new conversation.
Review the streamed build plan. You'll see the topology, per-node recipes, and the scenario assertions that will run after boot. Edit the prompt and re-run if anything is off.
Click Build group. OpenFactory fans the plan out to per-node ISO builds. When every ISO reaches built, boot the group on the runner network from the same UI.
Exercise the stack. The scenario assertions run automatically against the live VMs. From the host you can also hit the service ports directly to confirm end-to-end behavior.

Driving OpenFactory from an AI agent instead of the browser? The same flow is exposed through the OpenFactory MCP server — submit the prompt programmatically, get the build-plan preview back, and call create_build / start_vm on the resulting recipes. Single-image builds go straight through the openfactory CLI.

What's still your responsibility

The prompt produces a buildable preparation lab — the right topology, the right ports listening, deployment-time config templates dropped in the right places, and tiny compatibility services that prove the wiring works. A few things still sit outside the recipe and need operator attention before this carries real load:

Real Paperless-ngx app. Install from the upstream container or PyPI; the env file already points at the right backends.
Tesseract OCR data files. The tessdata for the languages you ingest — the stack runs without them but you won't get searchable text.
Real Tika and Gotenberg. Java Tika and the Chromium-bundled Gotenberg both ship as upstream containers; swap the compatibility services out at deploy.
Consume folder source. The directory at /var/lib/paperless/consume is ready — mount it from your scanner share (SMB / NFS / IMAP folder) and Paperless picks up what lands. Two things bite people here: the consumer process must own (or be able to read and delete in) that directory, and a network share that writes files in chunks can trip the watcher before the upload finishes — set PAPERLESS_CONSUMER_POLLING when the share doesn't deliver clean inotify events.
Encryption at rest. Document volumes hold legal, medical, financial paper; Paperless supports field-level encryption if you enable it.
Off-host backups. A Postgres dump alone is not a backup — the original files live on disk under media/. The supported path is Paperless's own document_exporter management command, which writes documents, thumbnails, metadata, and a database dump to one folder and can update an existing export, so incremental rsync backups just work (Paperless-ngx administration docs). Run it when the consumer is idle, then apply the 3-2-1 rule: three copies, two media, one off-site.
Workflows and matching rules. The lab wires the pipeline; the filing logic is yours to define. Paperless 2026 ships a workflow engine with four trigger types — Consumption Started, Document Added, Document Updated, and Scheduled — and actions that assign tags or correspondents, fire a webhook, or send mail (Paperless workflow system). One gotcha worth knowing up front: a Document Added trigger that filters on document content fires before OCR text exists, so content-based rules belong on a Consumption Started or Scheduled trigger instead.

Where to go next

Quick answers

Do I really need both Tika and Gotenberg? Only if you ingest anything other than PDFs and images. PDFs and photos go straight through Tesseract. The moment you point an email account or a folder of Office documents at Paperless, Gotenberg renders them to PDF and Tika pulls the text — so for a real mail-and-invoices workflow, yes, keep both.

SQLite or Postgres? Paperless can run on SQLite for a single user, but the consume pipeline is concurrent and the full-text index grows with every document. This lab uses Postgres 18 from the start so the archive scales past one person without a migration later.

Where do the original scans live? On the paperless VM under /var/lib/paperless/media. Paperless never throws away the source file — it stores the original next to the OCR'd searchable copy, which is exactly why your backup has to cover the disk, not just the database.

Paper is one half of personal data; photos are the other. The Immich photo vault post builds the same shape for images, and the Nextcloud cloud-stack post gives you the files-and-sync layer those scanners can drop into. For the kernel integrity story under regulated archives, see the runtime attestation post. The Enterprise & GxP page covers compliance-grade rollouts, and pricing has the plan tiers if you want OpenFactory to build and host the fleet for you.

Ready to ship this in production?

OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.

See pricing →Book a demo