Tooling
Sandboxing AI agents: Daytona vs E2B vs Firecracker stack
A 19-person consultancy in Amsterdam runs 6,800 Python sessions a week behind one research agent. We priced Daytona, E2B, and a self-hosted Firecracker stack to find the honest answer.

It is 23:47 on a Sunday. A new Linux kernel CVE just landed on oss-security, scored 8.8. Your AI research agent runs ~970 Python sessions a day inside Linux microVMs that share that kernel. Somebody patches. The only question worth asking before you sign a sandbox contract is who.
We had this conversation last month with a 19-person data consultancy in Amsterdam. They built a research agent that ingests client CSVs, runs pandas transformations against a 14 GB reference dataset, and writes back a structured report. Roughly 6,800 sessions a week, almost all under 90 seconds of CPU. The agent works. The sandbox layer underneath it was still up for grabs.
We priced three. The honest comparison came down to three columns: per-session cost, cold-start against the 14 GB dataset, and who pages you when a kernel bug drops.
The workload, in one paragraph
Each agent run is a fresh Python 3.13 environment with pandas, polars, pyarrow, and a custom internal package mounted read-only. The 14 GB reference dataset (deduplicated property records since 2008) lives on a shared volume. Cold runs need it memory-mapped; warm runs reuse it. Median session: 38 seconds of CPU. p99: 4 minutes. The agent retries on transient failure. Concurrency peaks at 22 simultaneous sandboxes around 14:00 CET on Mondays.
Option one: Daytona
Daytona is built for developer environments, not throwaway agent sessions. You can run a research agent on it. They have an API, the sandboxes boot under 100 ms after warm pool initialization, and the developer experience is clean.
The math, at our session volume, was the problem. Daytona's pricing assumes longer-lived workspaces. At 6,800 weekly starts with a 38-second median, you pay for compute granularity you do not use. Per session worked out to roughly €0.018 once we modeled the warm pool overhead realistically. Across 354,000 yearly sessions, that is €6,372. Not catastrophic. Not cheap either.
The 14 GB dataset was the harder issue. Daytona will let you attach a volume, but the agent kept hot-mapping it on first run, costing 6 to 9 seconds before pandas had a usable handle. Their snapshot model is optimized for code state, not 14 GB of arrow files. We ranked them second.
Option two: E2B
E2B is purpose-built for this shape of workload. Code interpretation, agent sandboxes, sub-second cold starts. Under the hood, they run Firecracker microVMs (the same AWS-built microVM tech that powers Lambda). Their snapshot and restore is sharp. Their SDK is two lines.
from e2b_code_interpreter import Sandbox
with Sandbox(template="pandas-arrow-14gb") as sbx:
result = sbx.run_code(agent_generated_python)
print(result.text)
Cost at our volume, using their published per-second compute pricing for a 2 vCPU / 4 GB template plus persistent template storage for the 14 GB dataset, modeled to roughly €0.011 per session. Annualized: about €3,900. Including the dataset storage line item: about €4,800.
Cold start to a usable pandas process with the dataset memory-mapped: 340 ms median, 1.1 s p99 on our test template. That is what a managed Firecracker snapshot buys you. The dataset lives in the template; restore brings it back as a copy-on-write mapping.
When CVE-2024-1086 (the Linux kernel netfilter use-after-free) was disclosed, managed Firecracker operators rolled patches fast because they had to: every customer was exposed. Their team patched. You did not.
Option three: Firecracker plus Kata, self-hosted
This was the option the consultancy's two engineers wanted. They had run KVM workloads before. The promise: own the stack, drive per-session cost toward marginal.
On paper, the math looks excellent. A single Hetzner AX102 (~€280/month) can run roughly 80 to 120 concurrent Firecracker microVMs at this workload. Two hosts cover peak with margin. The arithmetic: €560/month divided by 29,500 monthly sessions equals €0.019 per session. Worse than E2B at managed pricing.
The "worse" surprised them. It should not have. E2B's pricing reflects them running this workload at scale, with bin-packing you cannot replicate at 19 people.
The other column was the deciding one. The CVE Sunday.
Self-hosted Firecracker means your two backend engineers are on the kernel CVE rota. That includes weekends, holidays, and the week your senior engineer is in northern Thailand with bad Wi-Fi. Price that into the spreadsheet before you sign anything.
A reasonable on-call premium for two engineers, modeled at €400/month per person to cover the kernel-patching responsibility (not the rest of their job), adds €9,600 a year. That alone makes self-hosted more expensive than E2B by a factor of three.
The non-cost case for self-hosting (data residency, regulatory ring-fencing, no third-party processor) was real for two of the consultancy's clients. We noted it. It did not apply to the agent's reference dataset, which was public Kadaster data.
What we did not shortlist
Worth a paragraph: we looked at the recent Pyodide release, which added the ability to publish WebAssembly wheels directly to PyPI. Genuinely interesting for browser-side agents and offline analysis. For a 14 GB pandas dataset that needs to be memory-mapped on a Linux kernel, WASM is the wrong shape of tool. Right answer, wrong question.
The numbers, side by side
Annualized cost, 354,000 sessions
- Daytona: ~€6,400
- E2B: ~€4,800 (compute plus dataset storage)
- Self-hosted Firecracker + Kata: ~€6,700 (hardware plus on-call premium)
Cold start, 14 GB dataset memory-mapped
- Daytona: 6 to 9 s (volume hot-mapped on first run)
- E2B: 340 ms median, 1.1 s p99 (template snapshot)
- Self-hosted: 280 ms median, if you build the snapshot pipeline yourself; that costs about four engineering weeks up front
Sunday-night CVE patch
- Daytona: their team
- E2B: their team
- Self-hosted: your team
The choice
The consultancy picked E2B. Not because it was cheapest by a wide margin (it was, marginally), but because the cold-start math against the 14 GB dataset was the cleanest, and because their two backend engineers did not want to be the kernel patch team. They wanted to be the agent team.
When we built the research-agent integration for them at ABN, the awkward bit was not the sandbox layer. It was the agent's habit of regenerating the same pandas filter four different ways across one research session, which inflated session count by about 18%. We solved it with a structured intermediate-result cache between agent turns. The sandbox bill dropped to where we had modeled it. If you are building an AI agent that hits a sandbox every turn, look at that loop before you look at the sandbox vendor.
If you want a five-minute audit today: open your agent's last 100 sessions and count how many ran a near-identical pandas operation against the same input. If it is more than fifteen, your sandbox math is lying to you, and the real fix is one cache layer up.
Key takeaway
If your agent runs 6,800 sandboxes a week, the deciding cost is not the cents per session. It is who fields the Sunday-night kernel CVE.
FAQ
Is E2B always the right choice for AI agent sandboxes?
No. For minutes-long developer workspaces Daytona fits better. For strict data residency with a kernel team in-house, self-hosted Firecracker pays back. For short, high-volume agent sessions, E2B wins on cold-start.
Why not just run agent Python in plain Docker containers?
Containers share the host kernel. An agent that executes arbitrary generated code on a shared kernel is one CVE away from a bad day. MicroVMs give you VM-strength isolation with container-speed startup.
What did the per-session cost benchmark include?
Annual sandbox cost divided by 354,000 sessions (6,800 weekly across 52 weeks). It includes compute, storage for the 14 GB reference dataset, and on-call engineering premium for the self-hosted option.
Does self-hosting ever beat managed sandboxes on cost?
Above roughly 50,000 sessions a day with a dedicated platform team already in place, yes. Below that, the on-call premium and snapshot tooling investment swamp the hardware savings.