← Blog

Tooling

Self-hosted AI for SMEs: 17 ways the homelab stack breaks

You stood Ollama and Open WebUI up in a weekend. Then ops asked for SSO, backups, and a per-user audit log. Here is where the homelab stack snaps.

Jacob Molkenboer· Founder · A Brand New Company· 16 Jun 2026· 10 min
Open wooden card-index box on ivory paper, typed cards, green flag pin, brass tab, red wax seal, soft window light.

It is Thursday, 16:40. The inside-sales lead is two prompts deep into Open WebUI when the GPU starts swapping. The developer who built the stack is on a flight to Schiphol. Ops Slack lights up because the assistant that quotes shipping windows just stopped answering. The assistant is not down. A thirty-page PDF kicked off by another team has pushed the quoting model out of VRAM, and Ollama is now serving requests at 0.4 tokens per second.

This is the moment a homelab AI dev platform stops being a personal tool and becomes infrastructure. The transition is rougher than most technical co-founders expect.

The pattern showing up on Hacker News most months is familiar. One workstation with a consumer GPU. Ollama or vLLM. Open WebUI as the chat front-end. LiteLLM as the gateway when more than one model is in play. A vector database, usually Qdrant or pgvector. Caddy or Traefik in front. A Cloudflare Tunnel so the founder can hit it from a hotel lobby. It is a thing of beauty when one person uses it. It is a thing of horror when twelve people, three of whom are not engineers, start to depend on it for daily revenue.

We have walked into this shape of stack inside Dutch and German SMEs in the €2M to €20M revenue band roughly twenty times in the last eighteen months. The break-points repeat. Below is the field guide, ranked the only way that matters when payroll runs in nine days: which failures you can fix tonight in docker-compose.yml, and which ones force a real rebuild on something like Coolify or Dokploy.

Failures you can patch in docker-compose.yml

These nine are the cheap save. A senior person who knows YAML can clear all of them in a long evening, and the stack survives the next sprint.

1. Concurrent requests evict each other from VRAM

Ollama loads a model on first request and unloads it when another is asked for. Two users on different models on a 24GB card means a model swap every minute, with each swap costing four to eight seconds of dead time. Set OLLAMA_KEEP_ALIVE to something honest, pin a single model per Ollama container, and run two Ollama instances behind LiteLLM if you genuinely need two models hot. Do not believe the docs that suggest one Ollama can hold several models concurrently. It can, until it can not.

2. No restart policies

Containers without restart: unless-stopped die quietly on the first OOM and stay dead. Nobody notices until the morning standup. Cost: zero. Effort: one line per service.

3. No healthchecks, race conditions on cold start

Open WebUI boots before Postgres is ready, fails to migrate, and serves a friendly white screen. Add a healthcheck to the database and a depends_on with condition: service_healthy on the consumer. Yes, this is Docker Compose v3 material that the original homelab tutorial skipped.

4. The reverse proxy binds to 0.0.0.0 and the admin UI has no auth

The default Ollama port is 11434. The default Open WebUI port is 8080. Both, on a typical homelab compose file, are reachable from anything on the LAN. The intern on the guest Wi-Fi can hit the model and read other people's chats. Bind to 127.0.0.1, put Caddy or Traefik in front with basic auth, then expose only the proxy.

5. Volume UID and GID mismatch

The container runs as UID 1000. The host directory was created by root during install. Logs say permission denied, the chat history vanishes on next restart, and a junior wastes a day blaming Postgres. Add user: "1000:1000" and chown the mount on first boot.

6. No log rotation

Default Docker logging keeps everything. A chat agent under modest load writes one to two GB of JSON logs per day. The host disk fills in roughly a week, Postgres goes read-only, and the whole stack looks broken in ways that take an hour to diagnose. Set max-size and max-file on every service.

7. Timezone drift between containers

The host is on Europe/Amsterdam, Postgres ends up on UTC, the chat app on whatever the base image shipped with. Audit logs and reports go out of sync with what users saw on screen. Pin TZ on every container and mount /etc/localtime read-only.

8. Default ports clashing with the rest of the network

Port 8080 is also the router admin, port 9000 is the office NAS, port 11434 is taken by a previous Ollama install someone forgot about. Remap ports explicitly in compose. Do not rely on the defaults being free.

9. No CORS rules on the LLM gateway

LiteLLM out of the box accepts requests from any origin. The moment the marketing team builds a small internal page that hits it directly from a browser, you have an open key endpoint pinned only by the fact nobody knows the URL. Restrict origins in the gateway config, and put basic auth on anything that handles tokens.

A serviceable compose snippet that closes most of the above on the Ollama service looks like this.

services:
  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    environment:
      OLLAMA_KEEP_ALIVE: "30m"
      OLLAMA_NUM_PARALLEL: "1"
      TZ: "Europe/Amsterdam"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        limits:
          memory: 24g
    healthcheck:
      test: ["CMD", "curl", "-fsS", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 5s
      retries: 3
    logging:
      driver: json-file
      options:
        max-size: "20m"
        max-file: "5"
volumes:
  ollama-data:

Failures that force a rebuild before the next payroll cycle

These eight are the ones you cannot YAML your way out of. They are organisational and architectural, not configurational. The shortest path is to move the stack onto a managed self-hosted PaaS layer such as Coolify or Dokploy, both of which give you the missing pieces without renting a Vercel-sized bill.

10. No backup story for the vector DB and chat history

Compose can mount a volume. It cannot snapshot one. When the SSD on the workstation under the sales floor fails, and it will, because it is a consumer NVMe that has been hammered by embeddings, six months of customer conversations and tribal knowledge vanish. You need scheduled off-host backups with at least seven-day retention. Coolify ships this. Dokploy ships this. Compose does not.

11. One shared API key, zero per-user audit

Every employee uses the same LiteLLM key. When the model gives a customer the wrong delivery date and the customer escalates, you cannot tell who asked, what context they pasted, or whether they edited the answer. For a sub-€20M Dutch SME, this is a problem under article 5(2) of the GDPR the moment any personal data is processed.

12. Secrets in .env, often committed to git

The compose file references .env. The repo lives on a self-hosted Gitea or a private GitHub. On day fourteen someone fat-fingers and pushes the env file. The model provider key, the Postgres password, and the Cloudflare Tunnel token now live in git history forever. Compose has no native secret rotation. A platform layer does.

13. The Cloudflare Tunnel runs from a laptop the founder takes on holiday

This is funnier in writing than in production. The tunnel binary is started under tmux on the founder's MacBook. The founder flies to Crete, lid closed, no battery. The whole company loses the assistant for a week. The fix is not a better tmux session. It is a tunnel that runs on the same host as the stack, supervised, and a stack that does not depend on a person being awake.

14. One workstation under a desk is a single point of failure

Consumer hardware, no ECC memory, no redundant power, cleaning crew with a vacuum. We have seen all three end an AI deployment. As soon as the platform sits on a real workflow, it needs at minimum a UPS, host-level monitoring, and a written answer to the question of what happens when the GPU dies. That answer is hard to write in docker-compose.yml.

15. No update or rollback strategy

Ollama ships fast. Open WebUI ships faster. Postgres minor versions bring schema migrations. On a hand-rolled stack, the typical update cadence is when something looks broken, which means never, until the next CUDA driver upgrade silently breaks the model loader. A platform that does atomic deploys with one-click rollback is worth its weight here.

16. No observability

You cannot tell, today, which of your fourteen agents is melting the GPU, which prompt template is leaking tokens, or whether the new RAG retriever made answers worse. Prometheus and Grafana on top of LiteLLM and Ollama metrics give you that. They also take a weekend to wire up correctly. Coolify and Dokploy have first-class hooks. Raw compose does not.

17. GDPR data residency and DPA gaps

The whole appeal of self-hosted is the data stays here. Then the team adds a Whisper transcription that calls out to an API. Then a vector DB hosted in Frankfurt. Then a logging backend in Dublin. Without a written processing register and at least a data processor agreement with each third party in the chain, the SME is exposed. The Dutch Autoriteit Persoonsgegevens has been writing more letters about this in 2026, not fewer.

Warning

If three or more items from list two are red on a Friday afternoon, plan the rebuild before the next payroll cycle. Every additional week of "we will harden it later" is another week of customer data running through a stack with no audit log and no backups.

The five-minute Friday audit

Open a terminal on the host. Run four commands.

docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
docker exec -it postgres pg_dump --version 2>/dev/null \
  && echo "has pg_dump" || echo "MISSING pg_dump"
df -h /var/lib/docker
crontab -l 2>/dev/null | grep -E 'backup|snapshot' \
  || echo "NO backup cron"

If any port is bound to 0.0.0.0, if disk is past 70%, if no backup cron exists, if pg_dump is missing on the database container, the stack is already on borrowed time. Count those, plus how many of failures ten to seventeen ring true. Three or more, the rebuild is overdue.

What the rebuild actually looks like

What surprises teams is usually item eleven, not item ten. Backups are a one-day job. Re-issuing per-user keys to fourteen employees, each with their own audit trail and a scoped budget on LiteLLM, is a two-day onboarding because nobody designed the role model on day one. The compose-to-Coolify port itself runs four hours on a good night. The audit layer runs four days.

When we ran this migration for a Rotterdam wholesaler last quarter, we ended up writing a small admin UI on top of LiteLLM so the operations manager could see, per person, what was being asked of the model. That kind of glue work is where most of the time on an AI agents rebuild actually goes.

Before you touch any of this, run the four-command audit above. The first column of docker ps will tell you, in about ninety seconds, whether you are running infrastructure or running a science project that has people relying on it.

Key takeaway

A homelab AI stack stops being safe the day a non-engineer depends on it. Run the four-command audit; if three items are red, plan the rebuild before payroll.

FAQ

Can I keep the homelab stack and add things later?

For one or two users, yes. The moment customer data flows through it or a non-engineer depends on it daily, the rebuild becomes cheaper than the next incident.

Coolify or Dokploy?

Both work. Coolify has wider plugin support and a louder community. Dokploy is leaner and faster to learn. For a sub-twenty-employee Dutch SME, pick whichever your senior dev already has bookmarked.

Do I have to leave Ollama behind?

No. Ollama as the model runner is fine. It is the surrounding stack that needs the upgrade: secrets, backups, audit, observability, and a tunnel that does not depend on a laptop being open.

Will the rebuild force a model change?

Almost never. The rebuild is about the platform, not the weights. You keep the same model, the same vector DB, often the same prompts. What changes is who can see what, and what survives an outage.

ai agentstoolingarchitectureoperationssecurityintegrations

Building something?

Start a project