RAG

Pinecone vs Qdrant vs pgvector: what RAG actually costs

An Eindhoven engineering firm asked us to pick a vector store for their 2.1M-document RAG agent. We benchmarked three options on their actual traffic.

Jacob Molkenboer· Founder · A Brand New Company· 10 Jun 2026· 8 min

Open oak index-card drawer with one raised chartreuse card, brass divider, twine-tied cream cards on ivory paper.

The founding partner sat across from us in a meeting room above a bakery on the Stratumseind in Eindhoven. I want our junior engineers to ask the archive a question and get a real answer, with the source PDF attached, he said. I do not want to spend €40k a year on it. The firm has 26 people and 2.1 million documents going back to 2011.

So we built the RAG agent. Mostly PDFs, some DWG metadata, a long tail of old Word files. Chunked into roughly 9.4 million 768-dim vectors after we pruned boilerplate. The hard question was not the LLM. It was where to put the vectors.

This post is the back-of-napkin we showed them, with the actual numbers from their actual traffic.

The shortlist

Three candidates made the final round.

Pinecone serverless in eu-west-1 (Ireland). The we will not own infrastructure option.
Qdrant Cloud on a managed cluster in Frankfurt. The managed but predictable option.
pgvector on a single Hetzner CCX43 in Falkenstein. The we already speak Postgres option.

We ruled out Weaviate, Chroma Cloud and Vespa for reasons that mattered for this client but would bore you here. Short version: data residency required EU-hosted infrastructure the firm could audit, and the team had to be able to operate it without hiring a dedicated platform engineer.

The test rig

Same corpus, same embedding model (a 768-dim open model we run on a small GPU box), same chunking strategy (1,800-character windows with 200-character overlap), same hybrid retrieval (vector plus BM25 reranker), same evaluation set of 412 real engineering questions written by the partners.

We loaded all 9.4M vectors into each system, ran the eval set five times against each, and measured three things:

Query latency p50 / p95 / p99 from a load test running 20 concurrent users for one hour.
Cost per 1,000 queries, including storage, reads, writes during reindex, and (for Hetzner) amortised server cost.
Time to reindex the full corpus, because the firm adds about 4,000 documents a week and reindexes the embedding model quarterly.

Queries went through the agent's full pipeline, not just the vector lookup. So the latency numbers include the BM25 join and the reranker. The interesting comparison is the delta between systems with everything else held constant.

Pinecone serverless

Pinecone's serverless pricing is reads, writes and storage. For 9.4M vectors at 768 dims, our storage line came to roughly €38 per month. The reads, at the firm's projected 1,400 queries per day, were €11 per month. Initial write to load the corpus: €19 one-time. Reindex with a new embedding model every quarter: another €19.

Steady state, about €49 / month, or €0.0011 per query at their volume.

p95 latency over the test: 142 ms. p99: 287 ms.

The number that hurt was not the cost. It was the cold-read tail. Pinecone serverless pages indexes in and out of memory based on access patterns, and the firm's archive has long-tail queries (someone asks about a 2011 culvert calculation once a year). On those queries we saw the occasional 800ms response. Not a deal-breaker. The partners noticed.

One more thing to watch: Pinecone's per-query cost looks small until you have a chatty agent. Ours rewrites the user query into three sub-queries, hits the index for each, then does a follow-up retrieval after the first generation. That is four reads per question. Multiply accordingly before you sign off on the bill.

Qdrant Cloud

We provisioned a single 4 vCPU / 16 GB RAM node in Frankfurt on Qdrant Cloud. The HNSW index for 9.4M vectors fit in memory with about 3 GB to spare. List price: €158 per month.

p95 latency: 47 ms. p99: 89 ms.

At 1,400 queries per day with four reads each, that is roughly 168,000 reads per month, so €0.00094 per query. Almost identical to Pinecone on paper.

The qualitative difference was predictability. A fixed node, a fixed bill, no surprise on the month the firm runs an internal hackathon and the agent gets hammered. We also liked the filtered search performance. The engineering firm filters every query by project ID and document type, and Qdrant's payload index handled that without the HNSW degeneracy you sometimes see when you bolt filters onto a pure vector index.

Reindex time for the full corpus on this node: 38 minutes.

pgvector on Hetzner

A single Hetzner CCX43 (16 dedicated vCPU, 64 GB RAM, 360 GB NVMe) in Falkenstein costs €127.99 / month including IPv4 and the EU VAT-exempt B2B rate. We ran Postgres 16 with pgvector 0.7 and the HNSW index type.

p95 latency: 34 ms. p99: 71 ms.

Per-query cost, amortised: €0.00091. Effectively tied with the two managed options.

Reindex time: 2 hours 41 minutes. Slower than Qdrant, but it runs in a weekend cron and nobody is waiting on it.

The index build looked like this:

CREATE INDEX docs_embedding_hnsw
  ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 24, ef_construction = 128);

-- Query with a payload filter
SELECT chunk_id, document_id, content
FROM document_chunks
WHERE project_id = ANY($1)
  AND doc_type = ANY($2)
ORDER BY embedding <=> $3
LIMIT 40;

The wins: the firm already runs Postgres. Their role-based access policies (engineers can see project X, contractors only see published deliverables) live in the same database as the vectors, so row-level security composes for free. Backups are pg_dump. Monitoring is the Postgres dashboard the IT lead already had open.

The losses: we own it. Patching, point-in-time recovery, the day pgvector ships a breaking change in its query planner. We added a Hetzner storage box for WAL archiving (€3.81 / month) and a second CCX23 as a warm replica (€31.90 / month), which brought the real bill to about €163 / month, slightly more than Qdrant Cloud.

What we picked

pgvector on Hetzner. Not because it was cheapest (it wasn't, by the time we added a replica), and not because it was fastest (the p95 win over Qdrant was 13 ms, which no human will notice through a chat UI). We picked it because the firm's IT lead can SSH into the box, the access control story uses the row-level security policies they already wrote in 2019, and the next time a junior asks why did the agent retrieve that document he can run a SQL query and find out.

The honest summary:

Option	€ / month	€ / query	p95	Ops burden
Pinecone serverless	49	0.0011	142 ms	Near zero
Qdrant Cloud (Frankfurt, 4vCPU/16GB)	158	0.00094	47 ms	Low
pgvector on Hetzner CCX43 + replica	163	0.00091	34 ms	Medium

The economics flip at scale. If the firm's volume tripled, Pinecone would still be the cheapest by a wide margin. If it grew 10x and the vector count doubled, Qdrant's fixed node would need to be a bigger node and the gap would close further. pgvector on Hetzner does not scale linearly past roughly 50M vectors on a single box. You can shard it, but at that point you are running a database and you should have hired a DBA.

Takeaway

For a single-tenant RAG agent under 20M vectors and under 100k queries per day, the vector store is rarely your cost bottleneck. Pick the one your team can operate at 3am when something breaks.

What the vendor calculators bury

Three things missing from every pricing page we read while doing this.

Agent-driven amplification. Modern RAG agents do not query once per user question. Ours does four. Some do twelve. If your projection is 1,000 questions per day, your read volume is probably 4,000 to 12,000.

Reindex cost. When you change embedding models (and you will, the field moves) you pay the full write cost again, and sometimes the storage cost twice for a few days. Pinecone bills writes. Qdrant and pgvector do not, but Qdrant's CPU goes to 100% during the reindex and you may need a bigger node for that window.

Egress. Pinecone and Qdrant return your query results over the public internet. If your agent runs on a different cloud, you pay that cloud's egress on the way to the LLM. With Hetzner and Postgres on the same private network, that line is zero.

We did not benchmark recall systematically because all three systems scored within 1.5 points of each other on the firm's eval set with default HNSW settings. The retriever was not the bottleneck for answer quality. The chunking strategy was, and that is a separate post.

The smallest useful next step

When we built the RAG agent for the Eindhoven firm, the thing we ran into was not the vector store at all. It was that 18% of the PDFs were scanned with no OCR layer, and the partners had been quietly searching by file name for years. We ended up running a one-time Tesseract pass over the archive and writing the OCR text back to a sidecar table, which doubled retrieval recall on documents older than 2017.

If you are sizing a vector store this quarter, the smallest useful thing you can do today is take 50 real questions from the people who will actually use the agent, run them by hand against your top three candidates with their default settings, and look at the answers and the latencies side by side. The pricing-page math will not survive contact with your corpus.

Key takeaway

For under 20M vectors and 100k queries a day, pick the vector store your team can operate at 3am, not the one with the cheapest pricing page.

FAQ

Which vector store should we use for a small RAG agent?

For under 20M vectors and 100k queries per day, pick the one your team can operate at 3am. Cost differences are small at that scale; operational differences are large.

How much does RAG retrieval actually cost per query?

In our 2.1M-document benchmark at 1,400 queries per day, all three options landed between €0.00091 and €0.0011 per query. Vendor calculator numbers were directionally right.

Is pgvector production-ready for RAG?

Yes, up to roughly 50M vectors on a single well-sized box with HNSW. Past that you are sharding Postgres, which is doable but a different commitment.

Why do RAG cost projections always come in low?

Because they assume one read per user question. Agents that rewrite queries or do follow-up retrievals can multiply reads by four to twelve in practice.

Does Pinecone have an EU region for data residency?

Yes, Pinecone serverless runs in eu-west-1 (Ireland) on AWS. Confirm the contract specifies that region before signing if residency matters to you.

ragai agentsarchitectureknowledge baseoperationscase study

Building something?

Start a project