AI agents
vLLM and Qdrant migration: off OpenAI Assistants in 4 weeks
Hosted AI got cheaper every quarter for two years, then it didn't. Here is the four-week playbook for moving a 38-person Rotterdam logistics SaaS off OpenAI Assistants onto vLLM and Qdrant.

A 17,000 euro invoice nobody budgeted for
The CTO of a 38-person Rotterdam logistics SaaS forwarded the OpenAI bill at 22:47 on a Tuesday in April. The line item for their Assistants-based shipment-classifier agent had jumped 41% month over month, and usage was flat. Their CFO had already started a thread.
The team had built on the Assistants API in late 2024 because the file_search tool meant nobody had to think about embeddings or chunking. It worked. For eighteen months the bills also dropped on schedule, two cuts a year, like clockwork. Then in Q1 2026 they stopped, and the curve started bending the other way. Hacker News had picked up the AI is slowing down framing the same week the invoice landed; capacity was tight, frontier-model prices had quietly reset upward, and the planned discount path had not materialized.
This is the playbook we used to move that company off OpenAI Assistants onto self-hosted vLLM plus Qdrant in four weeks. The cost numbers on both sides are at the bottom. So are the two outages that nearly killed the project in week two and week four.
What the agent actually did
The agent ingested unstructured shipment documents (bills of lading, customs forms, supplier invoices, mostly PDF and JPG) and emitted a structured JSON record for the operations team's dashboard. It also answered freeform questions from operators in the warehouse through a thin chat UI. Three things mattered:
- ~24,000 documents per month, peaking at ~1,800 per hour during the morning customs window.
- Retrieval against a 380k-chunk knowledge base of tariff codes, country rules, and the company's own SOPs.
- An average of 11 tool calls per agent run (PDF parse, code lookup, validate, write back).
The OpenAI bill at decision time was 11,400 euros in March 2026, projected to 14,200 in April. The board wanted that under 4,000.
Baseline before anything else
You cannot tune what you have not measured. Week one was not migration. It was instrumentation. We added a thin OpenTelemetry wrapper around every Assistants call and shipped traces to a local Tempo instance.
from opentelemetry import trace
tracer = trace.get_tracer("agent.shipment")
with tracer.start_as_current_span("assistant.run") as span:
span.set_attribute("doc.id", doc_id)
span.set_attribute("doc.pages", page_count)
run = client.beta.threads.runs.create_and_poll(
thread_id=thread.id,
assistant_id=ASSISTANT_ID,
)
span.set_attribute("run.tokens.prompt", run.usage.prompt_tokens)
span.set_attribute("run.tokens.completion", run.usage.completion_tokens)
span.set_attribute("run.tool_calls", len(run.tool_calls or []))
After five days we knew the things the bill alone cannot tell you. The p50 run was 3.2 seconds, the p99 was 41. 8% of runs were repeats caused by a retry loop on a transient timeout. 23% of all retrieved chunks were never cited by the model. The agent was paying for context it never used.
Those three numbers became our SLOs for the migration: median latency must not exceed 3.5s, p99 must not exceed 45s, retrieval recall@10 must not fall below the baseline of 0.81 measured against a 240-question golden set.
The new stack on paper
vLLM as the inference server, running Qwen2.5-72B-Instruct quantized to AWQ 4-bit on two H100s rented from Hetzner's GPU offering for 2,890 euros per month all in. Qdrant on a 16-vCPU bare-metal node for the vector store. BGE-M3 as the embedding model, served from the same vLLM cluster as a secondary model. LiteLLM as the OpenAI-compatible proxy in front of vLLM, so the application code did not need to know what was behind it.
We did not pick this stack because we love it. We picked it because vLLM speaks the OpenAI chat-completions schema, Qdrant has a stable Python client and good HNSW performance at our scale, and BGE-M3 was the embedding model that scored closest to text-embedding-3-large on our golden set (0.79 vs 0.81). Boring is fine.
Week one: shadow mode
Every production request to OpenAI was duplicated to the new stack asynchronously, with results compared in a side table. No user-visible change.
async def run_agent(doc):
primary = await openai_run(doc)
asyncio.create_task(shadow_run(doc, primary))
return primary
async def shadow_run(doc, primary):
try:
candidate = await vllm_run(doc)
await shadow_log.insert({
"doc_id": doc.id,
"primary_json": primary.json,
"candidate_json": candidate.json,
"fields_match": diff_fields(primary.json, candidate.json),
"latency_ms": candidate.latency_ms,
})
except Exception as e:
await shadow_log.insert({"doc_id": doc.id, "error": str(e)})
The shadow data after five days: 91.2% of structured fields matched the OpenAI output exactly, 6.1% were equivalent (different formatting, same value), 2.7% were genuine disagreements. The team triaged the 2.7% by hand. Two thirds of those were cases where the new stack was right and OpenAI had been wrong. We logged that and moved on.
Week two: the Qdrant cutover and the first outage
Cutting retrieval was supposed to be the easy half. Embeddings were re-computed on a Sunday, 380k chunks loaded into a single Qdrant collection with HNSW, m=16, ef_construct=200. Production reads were flipped at 04:00 on Monday behind a feature flag.
By 09:14 the operations team noticed that roughly 40% of shipment classifications were returning "unknown tariff." Recall@10 had collapsed from 0.81 to 0.47.
The cause was not Qdrant. It was the embedding model swap. text-embedding-3-large produces 3,072-dim vectors. BGE-M3 produces 1,024-dim. We had re-embedded the corpus correctly, but the production query path was still calling OpenAI for the query embedding, which then got truncated to 1,024 dims by a misconfigured numpy cast in a helper nobody had touched in a year. Half the semantic signal was being silently discarded.
When you swap embedding models, dual-write the new vectors but keep the old collection live until both sides of retrieval (corpus and queries) are on the new model. A mismatched embedding pipeline is the single most common way a RAG migration fails silently.
The fix took 90 minutes once we saw it. We rolled the flag back, pointed the query embedder at the BGE-M3 endpoint on vLLM, and re-flipped. Recall came back to 0.79. The outage cost roughly four hours of degraded retrieval and one all-hands apology in the customer Slack channel. It also bought us a permanent CI check that asserts query and corpus embedding dimensions match before a deploy is allowed to proceed.
Week three: flipping inference
Inference was the part everyone was afraid of. We did three things to de-risk it.
First, we rented one extra H100 for the cutover week. Headroom is cheap when the alternative is a Tuesday morning incident.
Second, we ran vLLM with conservative concurrency limits and explicit swap space.
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq_marlin \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.90 \
--swap-space 16 \
--enable-prefix-caching \
--served-model-name shipment-classifier
Third, the LiteLLM proxy sat in front with a fallback rule: any 5xx or any latency over 30s spilled back to OpenAI for that single request. The cutover happened at 03:00 Wednesday and the fallback fired on 1.4% of requests in the first hour. By Thursday it was 0.2%.
Week four: the second outage
The agent had been on the new stack for nine days when vLLM started OOM-killing itself at 11:20 on a Friday. The customs window hit and the GPU process started returning 500s.
The cause was prefix caching. We had enabled it because it cut p50 latency by roughly 28% during testing. But the customs window mixed long unique prefixes (each shipment PDF is different) with our shared system prompt, and the KV cache grew faster than the eviction policy reclaimed it. With max-num-seqs at 32 and contexts up to 28k tokens, the cache hit the GPU memory ceiling.
The 11-minute fix was to drop max-num-seqs to 24 and bump swap-space to 32GB. The proper fix took the rest of the day: we split the workload. Classification requests with long PDF context went to a queue with concurrency 16 and prefix caching off. Chat requests went to a separate vLLM pool with concurrency 48 and caching on. We added a Prometheus alert on vllm:gpu_cache_usage_perc > 92 and a runbook entry that took ten minutes to write and would have saved an hour on the day.
What we kept from OpenAI
Two things, on purpose. First, the LiteLLM fallback to GPT-4.1-mini on 5xx or timeout, currently firing on roughly 0.18% of requests. The cost is rounding-error and the on-call sleeps better. Second, our golden-set evals still run against three models nightly: Qwen2.5-72B (the production model), GPT-4.1-mini (the fallback), and o4-mini (the reference). When Qwen2.5 drifts below the reference by more than two percentage points on the structured-extraction task, we get a Slack ping. That ping fired once in May and pointed us at a system-prompt edit that had broken JSON-mode adherence on long inputs.
You do not have to take a religious position on self-hosting. The OpenAI account stays open, the API key stays warm, the eval suite measures the gap. The cheaper stack carries the traffic. The expensive one keeps it honest.
What it cost, what it saved
Four weeks of work. Two engineers full time, one DevOps half time. The vendor bill for May 2026, the first full month on the new stack:
- Hetzner GPU server, 3x H100 (drops to 2x in July): 4,335 euros
- Qdrant node: 189 euros
- OpenAI fallback plus golden-set evals: 214 euros
- Total: 4,738 euros
Against 11,400 in March, the run-rate saving is roughly 6,600 euros a month, or about 80k annualized. The migration itself cost around 38k in engineering time, so payback was about six months. Not a headline number. A defensible one.
The thing nobody put on the spreadsheet: the team now knows exactly where their tokens go and what their retrieval looks like. When the next pricing email lands, they can read it without flinching.
What we would do differently
One thing. We would build the golden-set eval suite in week zero, before touching the migration. We built ours in week one alongside the OTel instrumentation, which was late enough that the first three days of shadow data were not directly comparable. If you take one thing from this post, take that.
When we built the document agent for that Rotterdam team, the inference cutover was the part everyone feared and the easy half. The traps were on either side: embedding-dim mismatches that silently halved recall, and prefix caching that lit the GPU on fire when production traffic looked nothing like the benchmark. If you are weighing a similar migration of your own AI agents, the four weeks above is the honest version, not the slide-deck one.
The smallest thing you could do today: pull last month's OpenAI bill and split it by metadata.user. If you do not have metadata.user set on every Assistants run, that is your week-zero job.
Key takeaway
The hard part of leaving the Assistants API is not the inference move. It is the embedding-model swap and the KV-cache math nobody warns you about.
FAQ
Why vLLM and not Ollama or TGI?
vLLM gave us OpenAI-compatible chat completions, mature paged-attention scheduling, and tensor parallelism on multi-GPU. Ollama is great for a laptop. TGI was fine but slower at our concurrency target.
Can you run this on smaller models?
Yes. We tried Qwen2.5-32B first. It cost 40% less to serve and was 6 points behind on our golden set. For structured extraction over messy PDFs, those points mattered. Test against your own evals.
What if you cannot afford rented H100s?
Drop to a single L40S and a 14B-class model. You give up some accuracy and concurrency but the math still beats hosted at moderate volumes. Below ~5k runs per month, stay on the hosted API.
Does the Assistants API even still exist?
Yes, but OpenAI announced its sunset in favor of the Responses API. If you are starting a new project on Assistants in 2026, you are building on a deprecated surface. Plan accordingly.