OpenFactory Paperless-ngx document lab with Tika parser, Gotenberg PDF generator, Postgres and Redis

Build a Paperless-ngx Document Lab on OpenFactory

A five-VM document archive: Paperless + Tika + Gotenberg + Postgres + Redis, from one prompt

May 20, 2026

← Back to Blog

Paperless-ngx is the document archive every paperless office wishes it had: scan-or-drop a PDF and Paperless OCRs it, tags it, files it by correspondent, and makes the full text searchable in seconds. It's the sixth-most-deployed app in the 2026 r/selfhosted survey.

This post walks through the full Paperless stack on OpenFactory: five buildable VMs — Paperless, Apache Tika for content extraction, Gotenberg for PDF generation, Postgres, and Redis — from one prompt, shipped as bootable ISOs.

What you'll build

  • paperless (10.75.0.10:8000) — the app server with consume / media / data directories already on disk.
  • tika (10.75.0.20:9998) — Apache Tika for extracting text from Office docs, emails, and everything that isn't already a PDF.
  • gotenberg (10.75.0.21:3000) — Chromium- and LibreOffice-backed PDF generation, so Paperless can render every input to a uniform archival format.
  • postgres (10.75.0.30:5432) — the document and tag metadata store.
  • redis (10.75.0.31:6379) — queue for the consume + OCR pipeline.

Why build it on OpenFactory

  • The ISO is the spec. Consume folder, OCR config, Tika and Gotenberg endpoints all baked in. Boot, point the scanner, watch documents land.
  • Heavy parsers split out. Tika and Gotenberg get their own VMs so OCR pressure doesn't starve the web UI.
  • Scenario assertions ride along. The build fails closed if Paperless can't reach any of its four backends.
  • Reproducible across machines. Same ISO, same archive shape, whether you're a solo desk user or running document intake for a team.

Topology

Five Debian Trixie VMs on 10.75.0.0/24. Paperless is the only VM that talks to all the others; Tika, Gotenberg, Postgres, and Redis are subnet-only.

The prompt

Paste this verbatim into the chat builder at console.openfactory.tech. Nothing above or below it — the builder expects the prompt body to start at the “Build a compact multi-node lab…” line.

Build a compact multi-node lab named `paperless-document-lab`.

Output discipline: keep the plan small. Use one startup script per node, about 25 shell lines or less. Do not install paperless-ngx, Apache Tika, Gotenberg, Tesseract OCR data, or PDF/OCR binaries at build time. Do not pull large language model files. Write deployment-time config examples and tiny Python stdlib or shell compatibility stubs only. The goal is a buildable preparation lab, not a production deployment.

## Topology

Create 5 buildable `debian-trixie` nodes, all `x86_64`, SSH enabled, DHCP/default route intact with lab aliases, firewall disabled, DNS `1.1.1.1` and `8.8.8.8`, user `ops` password `paperless-ops` in `sudo`. Every recipe must set top-level `test_config` to `{ "enabled": false, "tests": [] }`.

- `paperless`: role `doc-app`, 3 GB RAM, 24 GB disk, alias `10.75.0.10/24`, x `230`, y `60`
- `tika`: role `parser`, 2 GB RAM, 12 GB disk, alias `10.75.0.20/24`, x `110`, y `220`
- `gotenberg`: role `pdf-gen`, 2 GB RAM, 12 GB disk, alias `10.75.0.21/24`, x `350`, y `220`
- `postgres`: role `database`, 2 GB RAM, 16 GB disk, alias `10.75.0.30/24`, x `110`, y `380`
- `redis`: role `queue`, 1 GB RAM, 8 GB disk, alias `10.75.0.31/24`, x `350`, y `380`

Connections: `paperless` to `postgres:5432`, `redis:6379`, `tika:9998`, `gotenberg:3000`.

## Common Recipe Requirements

All nodes: features `headless`, `ssh`; packages `openssh-server`, `python3`, `curl`, `jq`, `iproute2`, `netcat-openbsd`, `ca-certificates`. Each startup script adds the alias with `IFACE=$(ip route show default | awk '{print $5; exit}')`, `ip link set "$IFACE" up || true`, and `ip addr add <alias> dev "$IFACE" || true`. If `os.startup_scripts[].after` is present, it must be the string `"network-online.target"`, not an array.

## Node Requirements

`paperless`: features `headless`, `ssh`. Write `/etc/paperless/paperless.env` with `PAPERLESS_PORT=8000`, `PAPERLESS_DBHOST=10.75.0.30`, `PAPERLESS_REDIS=redis://10.75.0.31:6379`, `PAPERLESS_TIKA_ENABLED=1`, `PAPERLESS_TIKA_ENDPOINT=http://10.75.0.20:9998`, `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://10.75.0.21:3000`. Create `/var/lib/paperless/{consume,media,data}` mode `0750 ops:ops`. Add a Python stdlib service on `0.0.0.0:8000` exposing:
- `GET /api/` -> `200 {"correspondents":"/api/correspondents/","documents":"/api/documents/"}`
- `GET /api/statistics/` -> `200 {"documents_total":0,"documents_inbox":0}`
- `GET /metrics` -> `paperless_compat_up 1`
Register `paperless-compat.service`.

`tika`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:9998` exposing `GET /version` -> `200 "Apache Tika compat-1.0"` (`text/plain`), `GET /tika` -> `200 {"status":"ok"}`, `GET /metrics` with `tika_compat_up 1`. Register `tika-compat.service`.

`gotenberg`: features `headless`, `ssh`. Add a Python stdlib service on `0.0.0.0:3000` exposing `GET /health` -> `200 {"status":"up","modules":{"chromium":{"status":"compat"},"libreoffice":{"status":"compat"}}}`, `GET /metrics` with `gotenberg_compat_up 1`. Register `gotenberg-compat.service`.

`postgres`: features `headless`, `ssh`, `postgresql`; packages `postgresql`, `postgresql-client`. Listen on `0.0.0.0:5432`, best-effort create role/database `paperless` password `paperless`, allow `10.75.0.0/24` in `pg_hba.conf`. Expose `:9187/metrics` with `pg_compat_up 1`.

`redis`: features `headless`, `ssh`, `redis`; packages `redis-server`. Bind to localhost plus `10.75.0.31`. Expose `:9121/metrics` with `redis_compat_up 1`.

## Scenario

Emit exactly one group scenario named `paperless-document-lab-validation`. Put `custom_tests[].assertions[]` inside the scenario entry; leave `scenarios[].tests` empty. Every assertion needs `on_vm`. Use only `port_listening`, `command_output`, and `http_responds`; do not emit `vm_boots`, `network_reachable`, or `service_running`.

- `Stack ports listen`: `port_listening` for `paperless:8000`, `tika:9998`, `gotenberg:3000`, `postgres:5432`, `redis:6379`.
- `Paperless API`: on `paperless`, `curl -fsS http://localhost:8000/api/ | jq -e '.documents == "/api/documents/"' >/dev/null && echo paperless-ok`.
- `Tika version`: on `tika`, `curl -fsS http://localhost:9998/version | grep -qi 'Apache Tika' && echo tika-ok`.
- `Gotenberg health`: on `gotenberg`, `curl -fsS http://localhost:3000/health | jq -e '.status == "up"' >/dev/null && echo gotenberg-ok`.
- `Paperless reaches backends`: on `paperless`, `nc -z -w 5 10.75.0.30 5432 && nc -z -w 5 10.75.0.31 6379 && nc -z -w 5 10.75.0.20 9998 && nc -z -w 5 10.75.0.21 3000 && echo backends-reachable`.

Preserve warnings that real paperless-ngx app binary, Tesseract OCR data files for the chosen language set, real Apache Tika and Gotenberg Java/Chromium binaries, consume folder mount strategy, document encryption at rest, off-host backups, mail-rule ingestion, and `10.75.0.0/24` aliasing are deployment-time concerns.

Running it

  1. Open the chat builder at console.openfactory.tech and paste the prompt into a new conversation.
  2. Review the streamed build plan. You'll see the topology, per-node recipes, and the scenario assertions that will run after boot. Edit the prompt and re-run if anything is off.
  3. Click Build group. OpenFactory fans the plan out to per-node ISO builds. When every ISO reaches built, boot the group on the runner network from the same UI.
  4. Exercise the stack. The scenario assertions run automatically against the live VMs. From the host you can also hit the service ports directly to confirm end-to-end behavior.

Driving OpenFactory from an AI agent instead of the browser? The same flow is exposed through the OpenFactory MCP server — submit the prompt programmatically, get the build-plan preview back, and call create_build / start_vm on the resulting recipes. Single-image builds go straight through the openfactory CLI.

What's still your responsibility

The prompt produces a buildable preparation lab — the right topology, the right ports listening, deployment-time config templates dropped in the right places, and tiny compatibility services that prove the wiring works. A few things still sit outside the recipe and need operator attention before this carries real load:

  • Real Paperless-ngx app. Install from the upstream container or PyPI; the env file already points at the right backends.
  • Tesseract OCR data files. The tessdata for the languages you ingest — the stack runs without them but you won't get searchable text.
  • Real Tika and Gotenberg. Java Tika and the Chromium-bundled Gotenberg both ship as upstream containers; swap the compatibility services out at deploy.
  • Consume folder source. The directory at /var/lib/paperless/consume is ready — mount it from your scanner share (SMB / NFS / IMAP folder) and Paperless picks up what lands.
  • Encryption at rest. Document volumes hold legal, medical, financial paper; Paperless supports field-level encryption if you enable it.
  • Off-host backups. Postgres dump plus the media directory. The dump alone isn't enough — the originals live on disk.

Where to go next

Paper is one half of personal data; photos are the other. The Immich photo vault post builds the same shape for images. For the kernel integrity story under regulated archives, see the runtime attestation post. And the Enterprise & GxP page covers compliance-grade rollouts.

Ready to ship this in production?

OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.