Chat agents

Chat agent stack scoring: cost, AVG, and Friday CVEs

A reporting agent pulling 240 weekly client reports has three plausible homes. We score them on per-client cost, AVG-defensible logs, and who answers the pager on a Friday.

Jacob Molkenboer· Founder · A Brand New Company· 6 Sept 2025· 8 min

Three tied manila folders with brass tags, one chartreuse wax seal, brass bell, red ribbon, on ivory paper.

Tuesday morning at a 28-person Amsterdam agency. The client services lead wants a chat agent that answers "how did the May campaign do for klant Y" by pulling GA4, Meta Ads, and a Mollie payments feed, summarising it, and dropping a Slack message before standup. The operations manager wants to know who owns the box when the model breaks. The founder wants the per-client cost on a single line of a Google Sheet.

Three stacks come up in the meeting. We have a scoring method for picking between them. It takes 15 minutes and it rules out the wrong ones before the build kicks off.

The workload, in numbers

Map the workload first or every cost estimate downstream is fiction. A typical Dutch retainer agency at the size we work with looks like this:

14 active retainer clients
~17 weekly pulls per client (scheduled reports plus ad-hoc questions in Slack), so 240 chat-agent calls a week
Each call: ~1.2k input tokens of structured numbers plus a 600-token output
Audit trail must survive an AVG question from a client's DPO without a six-hour scramble
The engineer who patches the stack lives in Eindhoven and is asleep at 18:00 CET on Friday

That last constraint is the one most stack debates forget. It is also the one that decides the answer.

Cloudflare Workers AI plus D1

The thinnest stack. Workers AI runs Llama 3.x or Mistral inference at the edge, D1 is SQLite-on-the-edge, Vectorize handles embeddings. Total infrastructure footprint: a wrangler.toml.

Cost math at 240 weekly pulls:

~432k tokens generated per week, plus embedding calls
Single-digit euros per client per month at current Workers AI neuron pricing for a Llama 3.3 class model
D1 and Vectorize free tier covers it until you cross a few million rows of audit log

What we like: zero ops. Every request lands on a deployed Worker. EU data residency is a checkbox in Workers AI, which gives you a credible answer to the first question a Dutch DPO asks.

What hurts: model availability moves under you. We have shipped agents where the model behind a Workers AI alias quietly changed and the prompt regressed by Wednesday. If your reporting agent depends on Mistral 7B's specific way of rendering tables, you own that compatibility risk forever. The fix is to pin to a specific model version in the binding, not the alias, and to wire a smoke-test prompt that runs on deploy.

Friday CVE response: Cloudflare patches. You read a status page. Eindhoven sleeps.

Supabase plus pgvector

Postgres-shaped people pick this. You get auth, row-level security, pgvector, and Edge Functions in one project. The chat agent itself calls an external LLM (Anthropic, Mistral La Plateforme, OpenAI) because Supabase does not host models. That is a feature, not a bug. It means the day a new checkpoint ships, you swap one string in an Edge Function and run an eval suite against the audit log of the last 200 prompts.

Cost math at 240 weekly pulls:

Supabase Pro: roughly 25 euro per month, fixed regardless of client count up to the included quota
External LLM tokens: more per call than Workers AI, with bigger context and stronger reasoning
pgvector at 14 clients times a few thousand chunks is nowhere near a tuning problem

What we like: pgvector is real Postgres. The audit story writes itself. You own an audit_log table, you back it up nightly to a Hetzner box, you grant read access to a DPA. The agency's lawyer can read the schema, which is more than they can say about most SaaS audit trails.

Before you log every prompt into one flat table, plan partitioning. The operational reality bites in year two: a retention sweep over a few million rows of unpartitioned audit log can take the database offline for tens of minutes while it rewrites indexes. Partition by month from day one (PostgreSQL declarative partitioning docs) and a yearly rollover becomes a metadata operation.

Friday CVE response: Supabase patches Postgres. You patch your Edge Functions. The model provider patches itself. Three pagers, none of them yours, in the common case.

Self-hosted Ollama plus Qwen3

The "under a partner's desk" stack. An RTX 4090 or two, Ollama serving Qwen3 14B or 32B, Postgres next to it for logs, and a WireGuard tunnel back to the agency's VPC.

Cost math at 240 weekly pulls:

Hardware: ~3,000 euro one-time for a 4090 box, ~80 euro per month electricity at Dutch grid prices
LLM tokens: zero marginal
One engineer's attention, which is the line item most decks leave out and the line item that decides whether this stack is cheap or expensive over 24 months

What we like: data never leaves the building. For an agency with healthcare, legal, or municipality clients, that argument writes itself in the procurement form. Qwen3 32B is genuinely competent at the structured-numbers summarisation a reporting agent does. Inference at around 40 tokens per second on a single 4090 is fine for an asynchronous Slack-bot workload.

What hurts: model provenance. "Homegrown" open-weights deployments routinely turn out, on inspection, to be a merge of two existing checkpoints with a thin fine-tune on top. That is the same trap waiting for any agency that markets "our own AI" to a Dutch client. If you are going to claim self-hosted to a procurement officer, host weights you verified against the SHA on Hugging Face, write down which checkpoint you serve, and keep the merge config in the same git repo as the deployment.

Friday CVE response: yours. A CVE on Ollama, on the container runtime, on the kernel, on cuDNN, on the NVIDIA driver. You patch, you restart, you verify the embedding endpoint still returns the right vectors. Eindhoven's weekend is gone. Unless you put a maintenance contract on it, which for our clients usually lands at 350 to 600 euro per month and is the line item that ends the "but it is free" conversation.

The scorecard we actually use

We score each stack on five lines, not three, because hidden cost hides between them.

Line	Cloudflare	Supabase	Ollama box
Per-client €/mo at 240 pulls	3 to 6	10 to 25	6 (amortised over 24 months)
AVG audit story	Good	Excellent	Excellent
Friday CVE pager	Cloudflare	Supabase plus you	You
Model lock-in	High	Low	Low
Time-to-first-working-agent	3 days	5 days	12 days

We weight these per agency. A 28-person agency with two Drupal sites and zero dedicated DevOps gets routed to Supabase 80% of the time. A boutique that handles tenders for municipalities gets routed to the Ollama box, with a paid maintenance contract attached to the SOW. A pure-marketing shop with consumer e-commerce clients gets Cloudflare. We do mix stacks. A reasonable hybrid is Cloudflare for the public chat surface, Supabase for the audit and pgvector store, an external LLM for the model. The Ollama box is the only one that does not compose well with the others, because the whole point of it is that nothing leaves the LAN.

The decision code

We literally run this in a notebook when scoping a build.

def pick_stack(agency):
    if agency.handles_health_or_municipality_data:
        return "ollama-on-prem"
    if agency.client_count < 8 and agency.values_speed_to_ship:
        return "cloudflare"
    if agency.has_dedicated_devops and agency.values_postgres:
        return "supabase"
    return "supabase"  # the safe default

It is not subtle. That is the point. Three nested ifs force you to name the constraints out loud, in front of the operations manager. If she pushes back on values_speed_to_ship, you have found the real argument before the contract is signed.

The logging schema we ship

Whichever stack wins, the audit table is the same. We have used this on every chat-agent build for the last 18 months.

create table chat_audit (
  id uuid primary key default gen_random_uuid(),
  ts timestamptz not null default now(),
  client_id text not null,
  user_email text not null,
  prompt_hash text not null,        -- sha256 of input, not the input
  prompt_redacted text,             -- PII-scrubbed copy for audit
  model text not null,
  input_tokens int,
  output_tokens int,
  response_redacted text,
  data_source text,                 -- 'ga4', 'meta_ads', 'mollie'
  cost_eur numeric(8,4)
) partition by range (ts);

create table chat_audit_2026_06 partition of chat_audit
  for values from ('2026-06-01') to ('2026-07-01');

Warning

Never store the raw prompt if it contains client PII. Hash it, redact a copy, and keep the redacted copy. The AP treats verbatim prompts as personal data the moment a name or email appears in them. Skip this and a DPO question turns into an incident, not a log query.

The partition is the operational lesson. A flat audit table grows linearly with usage and never shrinks. Partitioning by month means a yearly retention rollover is a drop table chat_audit_2025_06 that runs in milliseconds, not a delete-where job that pages you on a Sunday afternoon.

The AVG bit, plainly

The Autoriteit Persoonsgegevens cares about three things when a chat agent touches personal data: lawful basis, proportionality, and the audit trail. The stack you pick changes the third. On Cloudflare you write logs into Workers Analytics Engine or D1 and accept the retention defaults. On Supabase you own the schema and the backups. On the Ollama box you own the disk.

Defensibility ranking in our experience: Ollama, then Supabase, then Cloudflare. Cost ranking is roughly the inverse. The agency picks where it sits on that curve.

What we would actually build this month

For the Tuesday-morning agency in the opening, 14 retainer clients, no dedicated DevOps, mostly DTC e-commerce: Supabase. The audit story is clean, the cost is predictable, and when a new model checkpoint ships we swap one string. We built the same reporting agent for a Rotterdam media buyer earlier this year on exactly this stack. The thing we ran into was that pgvector hybrid search on raw GA4 dimensions returned junk until we pre-clustered the dimensions into a "campaign profile" embedding per report period. We solved it with a nightly job that builds the profile vectors before the agent runs, so the chat-time query is always a fast cosine lookup. That work lives in our AI agents practice.

Open a spreadsheet. Three columns: stack, per-client euro per month, who carries the pager on Friday. Fill in your real client count, your real DevOps capacity, your real risk tolerance. The right answer falls out in fifteen minutes.

Key takeaway

Pick the chat agent stack by who pays the Friday CVE pager bill, not by who has the prettiest demo at scoping time.

FAQ

Why not just use OpenAI's Assistants API and skip the stack question?

It collapses to Supabase plus an LLM provider in our scorecard, with worse audit ownership. You do not own the logs, you do not own the orchestration, and your AVG answer becomes a vendor questionnaire instead of a schema.

How does Qwen3 compare to Llama 3.3 70B for report summarisation?

On structured numeric summaries (GA4, ad spend tables), Qwen3 32B lands within a few percent of Llama 3.3 70B on our internal evals, at a fraction of the GPU footprint. For free-text client comms, Llama 3.3 still leads.

What pushes a build off Cloudflare onto Supabase?

When a single client crosses roughly 80 daily pulls or starts using long contexts (8k+ input tokens), the Workers AI bill catches up to a Supabase plus external-LLM equivalent. That is the decision point we have seen most often.

Can we run Supabase and the Ollama box together?

Yes if you are careful. Use Supabase for auth and the audit trail, route inference to the on-prem Ollama through a private endpoint. The risk is that the audit log now leaves the building, so the AVG argument for self-hosting weakens.

chat agentsai agentsarchitecturestrategyoperations

Building something?

Start a project