Operations
Over-engineered email agents: a field guide for ops leads
A 200-rpm internal email agent does not need three Kubernetes clusters and Argo CD. It needs a €4 VPS, Caddy, and a systemd timer. The rest is a tax.

The ops manager at a Utrecht logistics company had four browser tabs open when we sat down. Two Argo CD dashboards, one for staging, one for production. A Grafana board glowing yellow. And the Hacker News thread about a phone with a smashed screen mounted inside a toilet tank, serving a website over the public internet.
"We're modernising," she said. The thing she was modernising was an internal email-triage agent. It does 200 requests per minute on a quiet day. It runs across three Kubernetes clusters.
The CX22 from Hetzner that would have run the same workload costs €4.51 a month. The stack she had built was costing somewhere north of €600 a month and three hours a week of an engineer who could have been building product.
This is the field guide. Eighteen mistakes, ranked by how much it hurts to undo them.
What the toilet phone post actually said
The thread sat on the HN front page this week. Someone mounts a cracked-screen phone inside a porcelain cistern, runs a tiny webserver on it, points DNS at a home connection, and serves the front page for weeks. The comments split into two camps. The first camp said "cute, but you can't run a business on that." The second camp, which was correct, said the demo is the point: infrastructure is what survives, not what scales.
Most people in the second camp work on systems with a hundred million users. The ops manager in Utrecht, reading from outside the conversation, took the post as permission to keep building. Her workload is "real software," after all. The phone-in-toilet thing is for hobbyists. Kubernetes is for adults.
The arithmetic does not care. A 200-rpm internal email-triage agent is in fact closer to a phone in a cistern than it is to a hyperscaler. The job is: receive an email, classify it, route it, reply if confident, escalate if not. One binary. One queue, optional. One persistent store the size of a medium-format JPG.
The toilet phone post is not a hobbyist anecdote. It is an upper bound on how much infrastructure most internal workloads actually need.
The reference stack worth measuring against
Before we name the eighteen, here is the thing they are over-engineering against. A Hetzner CX22 (2 vCPU, 4 GB RAM, 40 GB SSD) at €4.51 per month. Caddy as the reverse proxy, which provisions TLS automatically and renews it. A single Python or Go binary the agent runs as. A systemd timer if you want a periodic sweep. SQLite for state. Backups via borg to a second box at €4.51. A two-line deploy script that scp's the binary and runs systemctl restart.
Total cost: around €10 a month. Total moving parts: five. Recovery from total loss: rebuild the box from a Hetzner snapshot, restore borg, restart. Twenty minutes if you have done it once.
Observability for this size of system is one node-exporter, a heartbeat counter the agent writes after every successful sweep, and a five-line shell script that emails the operator if the heartbeat is more than ten minutes old. The Utrecht dashboard tracked sixty metrics. The ops manager looked at three of them, and only during incidents. The other fifty-seven were keeping themselves company.
# /etc/caddy/Caddyfile — the entire reverse-proxy story
triage.intern.example.nl {
basic_auth {
ops $2a$14$Cn... # bcrypt hash
}
rate_limit {
zone agent { key {remote_host} events 60 window 1m }
}
header {
Strict-Transport-Security "max-age=63072000"
X-Frame-Options DENY
Referrer-Policy no-referrer
}
reverse_proxy 127.0.0.1:8080
}
Now the eighteen, grouped by how reversible the mistake is.
Tier one: seven mistakes a Caddyfile undoes
These are the cheapest to fix. They live entirely in the request-handling layer. You delete the YAML, write fifteen lines of Caddy config, and the engineer who pushes the change is at lunch by 13:00.
1. The nginx-ingress-controller installed for TLS
Caddy ships automatic HTTPS. A single hostname line and TLS is on. No ingress controller. No custom resource definitions.
2. cert-manager fighting Let's Encrypt rate limits
cert-manager is fine. It is also unnecessary if Caddy is already handling the listener. Two systems racing each other on the same domain is how teams end up with twenty-four-hour outages that nobody can reproduce.
3. The external rate-limiter service
You do not need Envoy plus Redis to throttle an internal agent. Caddy has the rate_limit handler. One stanza, one number, done.
4. The OAuth2 proxy in front of an intranet service
If the agent is only ever called from the office VPN, Caddy's basic_auth directive and a strong password in a password manager is the right amount of security. The OAuth2 proxy made sense at the previous company. Here it is paperwork.
5. ConfigMap-driven security headers
HSTS, CSP, X-Frame-Options. Caddy ships a header directive. Six lines. You will not edit them again for two years.
6. The CDN for assets nobody outside the VPN can see
An internal agent serves a status page and a handful of CSS files. CloudFront is impressive. It is also not connected to anything that cares.
7. The bare-metal load balancer in front of one pod
If you cannot draw the second pod on a whiteboard, the load balancer is a single point of failure with a status page. Take it out. Point DNS at the box.
All seven are deletable in an afternoon because the request path is the most replaceable part of any stack. Traffic goes where you point it.
This is also the easiest tier to win the argument on. The engineer who installed the OAuth2 proxy can keep the YAML in a personal repo and learn the new tool next sprint without losing anything. Political cost is low because technical cost of switching is low. Start here, even if the real waste lives further down. Wins compound.
Tier two: six mistakes a single binary and a systemd unit undo
These are harder. They have shape inside the application. You will spend a week, not an afternoon. But the rip is local — no platform team needs to be in the meeting.
8. Celery + Redis + RabbitMQ to run a job every minute
A systemd timer with OnCalendar=*:0/1 and a unit file is fourteen lines. Celery is a fine choice for a workload with actual queue dynamics. Email triage at 200 rpm does not have queue dynamics.
# /etc/systemd/system/triage.timer
[Unit]
Description=Run the email-triage sweep every minute
[Timer]
OnCalendar=*:0/1
AccuracySec=1s
Persistent=true
[Install]
WantedBy=timers.target
9. Horizontal Pod Autoscaler thresholds at 200 rpm
200 rpm is roughly three requests per second. A Raspberry Pi handles it. Nothing is going to scale, because nothing needs to. The HPA is decorative.
10. PersistentVolumeClaims for files that fit in /var/lib
If the dataset is under a gigabyte and grows by megabytes per month, it lives in a directory on the VPS and gets backed up by borg. PVCs are how you end up debugging CSI drivers at 02:00 on a Sunday.
11. Postgres-in-a-cluster for an SQLite-shaped problem
One writer, one reader, no concurrent transactions worth speaking of, queries that return in microseconds. SQLite is the answer. A separate Postgres pod with a sidecar and a backup operator is theatre.
12. A Helm chart with fourteen values.yaml files for one binary
Helm is templating. If the thing being templated is one Deployment, one Service, one Ingress and one ConfigMap, the template is longer than the application. Replace with a shell script.
13. The sidecar log shipper into a hosted ELK
journalctl plus a five-line bash script that emails the operator on FAIL covers ninety per cent of what the ops team will ever actually look at. Keep ELK for the workloads that earn it.
Tier three: five mistakes that mean ripping out Argo CD
These are the ones that hurt. They are wired into the platform layer. Undoing them means convincing the team that the platform itself was the wrong call, which is a different conversation than "let's clean up the Caddyfile." Engineers will defend this tier hardest, because they spent real time learning it. Defending the tool against the workload is the failure mode. The workload is the customer.
14. Argo CD reconciling a repo with one Deployment in it
GitOps is great when there are sixty workloads, eight teams and a compliance officer who needs an audit trail. For one binary, it is a continuous reconciliation loop that exists to give itself something to reconcile.
15. Multi-cluster federation "for HA"
You have three clusters because someone read about it. The agent runs in one of them. The other two are warm spares for a failover that has never been tested. Multi-cluster makes sense above a certain workload count. You are not above it.
16. Vault + External Secrets Operator for two API keys
The model provider key and the SMTP password. Two strings. Vault is correct at a certain scale. At two secrets, an EnvironmentFile= owned by the service user with mode 0400 is correct. Rotate quarterly. Document the path.
17. Istio for east-west traffic that has no east
The service mesh is between the agent and itself. The mTLS is between two halves of the same binary. The control plane uses more memory than the application. Remove.
18. Separate dev / staging / prod clusters when a branch deploy script works
A 200-rpm internal tool does not need a staging environment that mirrors production exactly. It needs a second CX22 that you push the staging branch to. €4.51 a month, and a deploy script that diffs by branch name.
What to ship on Monday
If this is your stack and you recognise more than three of the eighteen, do not try to migrate everything at once. The order that works:
- Stand up the reference box. Hetzner CX22, Caddy, systemd, your agent binary. Run it in parallel for a week against the production load you actually have.
- Cut over read-only traffic first. Triage decisions can be made twice and discarded once until you trust the new box.
- Once parity is established, kill the tier-three platform last. The Argo CD rip-out is the easiest decision after the reference box has been running for thirty days without an incident.
Two specific traps in the parallel-run period. The first: do not connect the new box to the production SMTP relay until read-only matches are stable for at least a week. A misclassified reply will go out twice, and somebody on the customer side will notice before you do. The second: write the comparison harness before either stack does any real work that day. The output of the parallel run is the diff, not the demo. If you cannot answer "how did the two stacks disagree last Tuesday" in thirty seconds, the harness is not finished, and you are flying blind in a fog you built yourself.
The temptation will be to skip the parallel run and "just migrate." Do not. The whole point of the reference stack is that it costs €4.51 to keep around as proof.
The bill at the bottom
The Utrecht logistics company ran the parallel reference box for forty-two days. Triage decisions matched on 99.4 per cent of mail. The 0.6 per cent that did not were mostly cases where the old stack had been silently dropping a header. They cut over in March. The cluster bill went from €630 a month to €11. The on-call burden went from one rotation a week to zero in seventy days.
When we built that email-triage AI agent, the thing we ran into was not technical — it was that the engineering team had emotionally committed to the platform before they had measured the workload. We solved it by spending two days running both stacks in parallel and letting the numbers do the argument.
Open a terminal, run kubectl get pods -A, and count the rows that exist to serve one binary doing 200 rpm. That number is your over-engineering tax. The Caddyfile is two screens long. Write it tonight.
Key takeaway
If the workload is a single binary doing 200 rpm, the right amount of infrastructure is a €4 VPS, Caddy, and a systemd timer.
FAQ
When does Kubernetes actually make sense for an internal agent?
When you are running ten or more independent workloads, have a platform team to maintain it, and need scheduled bin-packing across nodes. None of that applies to a single 200-rpm agent.
What about high availability on a single VPS?
Two CX22s behind a small load balancer with health checks gives you very high uptime for under €12 a month. Real geo-distributed HA is a different conversation and rarely justified for internal tools.
How do we audit a single-VPS setup?
Weekly snapshots, a log shipper to a second machine, and the deploy script in git. Auditability comes from process, not from the platform you happen to run on.
What if traffic grows ten times?
2000 rpm is still well within a single CX22's headroom. When you cross 10k sustained rpm, revisit. Until then, vertical scaling on one box is cheaper than horizontal on three.