RAG

Gemma 4 QAT on a MacBook: a fallback brain for healthcare RAG

It is 16:12 on a Tuesday in Utrecht. The cloud key returns 529. The triage nurse is waiting. Here is the fallback brain we built so the system keeps answering.

Jacob Molkenboer· Founder · A Brand New Company· 30 Jul 2024· 9 min

Wooden index-card drawer with brass tab, chartreuse paper flag on one card, linen ledger and red wax seal on ivory paper.

It is 16:12 on a Tuesday in Utrecht. The duty nurse at a regional GP collective types a question into the triage assistant we built last winter: "Patiënt 64j, NOAC sinds 2 maanden, INR niet relevant, bloeding tandvleesrand, hoeveel uur stoppen?" The cloud key returns 529. Again. A status page somewhere is red, the queue is climbing, and a year ago we would have shown a spinner until the nurse gave up and reached for a paper protocol.

This is the field guide to the fallback brain we wired into the system over a long weekend in April. The brain runs on the receptionist's MacBook Air. It is slower than the cloud model. It is dumber. It is also enough to keep the queue moving while the upstream incident counter ticks past two hours.

The case for a local fallback brain

Three reasons, in order of weight.

First, healthcare uptime is not a marketing number. A general practice running 12 consult rooms at peak does not want a clinical-decision assistant that fails open into a spinner. We needed an answer path that does not depend on a single provider's status page.

Second, the AVG (the Dutch face of GDPR) treats medical data as a special category. Routing the same query through a local model when the cloud is down does not change the legal posture, both paths have a processing agreement on file, but it does mean that during the fallback window no PHI leaves the building. That is easier to explain to a privacy officer than "we proxied to a backup provider in Frankfurt that you've never heard of".

Third, cost. The fallback path is a sunk MacBook the practice already owned, plus electricity. The cloud path bills per call. When we estimated this for a 40-doctor collective, the local fallback paid for the engineering work in eight weeks just by absorbing rate-limit retries.

Picking Gemma 4 QAT over the alternatives

Quantization-aware training is not new. The trick is training the model with quantization in the loop, so the 4-bit weights it ends up shipping under are weights it has actually learned to work with. The result is a small model that degrades far less catastrophically than a post-hoc quantized one. Google published the Gemma 4 QAT family in late May, and the reason it caught our eye is straightforward: the 8B QAT variant runs at usable speeds on a base-spec Apple silicon laptop, and it scores within shouting distance of the unquantized 8B on instruction-following benchmarks.

We considered three alternatives before settling.

A smaller hosted model as fallback. Sound on paper, but it shifts the dependency to a second provider's uptime and trades one privacy story for two. Skipped.

A bigger local model on a Mac Studio. Better answers, harder to deploy. Most of our healthcare clients have a fleet of MacBooks; nobody has a Mac Studio sitting unused at reception. Skipped.

A 4B QAT model. Faster, fits on a 16 GB machine with room to spare, but its Dutch medical reasoning fell off a cliff in our acceptance tests. The 8B QAT cleared the floor; the 4B did not.

The MacBook itself

The reference machine in the clinic is a 2024 MacBook Air, M3, 16 GB unified memory, 512 GB SSD. We picked it deliberately with the receptionist's old machine in mind: this is mid-tier, not a workstation. It runs the practice's appointment system, a Chrome window, and the fallback brain at the same time.

Expectations, measured on that exact machine with the model loaded and the laptop on AC power:

First-token latency: 700 to 1100 ms cold, 250 to 400 ms warm.
Steady-state throughput: 22 to 28 tokens per second.
Resident memory with the 8B QAT loaded and a 4k context: about 6.2 GB.
Fan noise during a long generation: noticeable, not embarrassing.

For a triage answer of 120 to 180 tokens, the user sees a complete response in roughly 6 to 9 seconds. The cloud path, on a good day, is 1.8 seconds. The nurses we shadowed during pilot did not complain about the gap; they complained about spinners.

Wiring it up with Ollama

You can serve the model with Ollama, llama.cpp directly, or MLX. We chose Ollama for one reason: it is the package the clinic's IT vendor already supports. The runtime is faster on llama.cpp if you tune it, but the time you save running the model you spend explaining to someone else how to restart it.

brew install ollama
brew services start ollama

# Pull the 8B instruction-tuned QAT variant
ollama pull gemma4:8b-it-qat-q4_0

# Smoke test in Dutch
ollama run gemma4:8b-it-qat-q4_0 \
  "Wat is de standaard wachttijd na een NOAC-stop voor een tandheelkundige ingreep?"

Two things to set before you put this in front of users.

First, pin the Ollama process so the OS does not swap it out when the receptionist opens 40 Chrome tabs. A small launchd plist with a higher Nice priority is enough; we have not needed cgroups.

Second, raise the context window. The default is generous for chat but tight for a RAG that injects six to eight retrieved chunks of guideline text:

cat > ~/Modelfile <<'EOF'
FROM gemma4:8b-it-qat-q4_0
PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER repeat_penalty 1.05
SYSTEM "Je bent een klinische triage-assistent voor Nederlandse huisartsen. Antwoord beknopt in het Nederlands. Citeer altijd de bron uit de gegeven context."
EOF

ollama create triage-fallback -f ~/Modelfile

The router that decides when to fail over

The interesting code is not the model call. It is the router that sits in front of both brains and decides which one answers this specific question.

Our rule is simple: try cloud, give it a hard 6-second budget on first token, and if it misses, stream from the local model instead. Do not retry the cloud once you have committed to local. The user-perceived stutter of restarting a stream is worse than just finishing on the slower brain.

import asyncio, httpx, json
from typing import AsyncIterator

CLOUD_URL = "https://api.anthropic.com/v1/messages"
LOCAL_URL = "http://127.0.0.1:11434/api/chat"
FIRST_TOKEN_BUDGET = 6.0  # seconds

async def stream_cloud(messages, key) -> AsyncIterator[str]:
    headers = {"x-api-key": key, "anthropic-version": "2023-06-01"}
    body = {"model": "claude-sonnet-4-7", "messages": messages,
            "max_tokens": 600, "stream": True}
    async with httpx.AsyncClient(timeout=30.0) as c:
        async with c.stream("POST", CLOUD_URL, headers=headers, json=body) as r:
            r.raise_for_status()
            async for line in r.aiter_lines():
                if line.startswith("data: "):
                    yield line[6:]

async def stream_local(messages) -> AsyncIterator[str]:
    body = {"model": "triage-fallback", "messages": messages, "stream": True}
    async with httpx.AsyncClient(timeout=120.0) as c:
        async with c.stream("POST", LOCAL_URL, json=body) as r:
            r.raise_for_status()
            async for line in r.aiter_lines():
                if line:
                    yield json.loads(line).get("message", {}).get("content", "")

async def answer(messages, key) -> AsyncIterator[str]:
    cloud = stream_cloud(messages, key)
    try:
        first = await asyncio.wait_for(cloud.__anext__(), FIRST_TOKEN_BUDGET)
        yield first
        async for chunk in cloud:
            yield chunk
        return
    except (asyncio.TimeoutError, httpx.HTTPError, StopAsyncIteration):
        pass  # fall through to local
    async for chunk in stream_local(messages):
        yield chunk

The 6-second budget is a tuned number, not a default. Below 4 seconds, you fail over too often on healthy days when the cloud is just slow. Above 8 seconds, the nurse notices the brain handing off and asks what just happened.

Warning

Surface the failover to the user. We render a small italic line under the answer that says "Antwoord gegeven door lokaal model, cloud niet bereikbaar." Hiding the handoff teaches users to trust two different answer qualities as if they were one, and that is how mistakes get blamed on the wrong model.

Acceptance tests for Dutch medical content

The fallback is allowed to be worse than the cloud. It is not allowed to be wrong in dangerous ways. We keep a fixed set of 86 evaluation questions, taken from anonymised real triage queries, each with a gold answer written by a GP and a list of facts the response must contain.

Two metrics gate any model swap.

Fact recall. Of the 86 questions, the local model must hit 90% of the must-contain facts. The cloud model sits around 96%. Below 90%, we do not ship the fallback path; we route to a human nurse instead.

Hazard rate. Of the 86 questions, zero responses are allowed to contain a clinically dangerous statement (wrong dosage, wrong contraindication, wrong stop window). This is graded by a second pass with a stronger model and a human spot-check. If hazard rate is non-zero, the model does not ship as a fallback for that question class, period.

The 8B QAT cleared 91% fact recall and zero hazards on the current eval. We re-run it weekly against the live retrieval index, because the index moves even when the model does not.

What you lose with a local fallback

The local model is worse at three things, in order of how much they matter in healthcare.

Long-context synthesis. Pull eight chunks of guideline text and ask for a single answer with sources cited correctly. The 8B QAT drops citation accuracy by about six points compared to cloud. We compensate by asking the retriever to do harder work upfront: fewer, better chunks.

Multi-step reasoning. "Calculate the dose adjustment for a 72-year-old with eGFR 38 on rivaroxaban." Cloud nails it. The QAT model sometimes shows the right reasoning and the wrong final number. For arithmetic-bearing questions we route through a deterministic calculator and have the model assemble the prose around it.

Structured output. Less relevant in a clinical front-end, but if you reuse the same router for a developer-facing tool, the QAT model needs strict JSON schema validation on the way out.

The smallest thing you can do today

If you run a RAG and you do not yet have a fallback path, the cheapest first step is a five-minute audit: open your last 30 days of error logs, count the cloud timeouts and 5xx responses, multiply by the number of users affected. If the number embarrasses you, the next step is brew install ollama on whichever laptop is closest. The model is free, the runtime is free, and the only thing standing between you and a working fallback is an hour of router code and a weekend of evaluation work.

When we wired this into the triage assistant for a GP collective in the Randstad, the hardest part was not the model or the router; it was writing the 86 evaluation questions with a doctor who had never thought of her work in that shape before. If you want help building AI agents that have to keep answering when the cloud is red, we have done it.

Key takeaway

A local fallback brain is not a smaller cloud model. It is a different product with a different quality contract. Ship it that way, or do not ship it.

FAQ

Can I run this on an 8 GB MacBook Air?

No. 8 GB is tight for an 8B QAT model with a 4k context. The OS will swap and throughput will drop from 25 tokens per second to single digits. 16 GB is the practical floor for the 8B variant.

Is a local model AVG-compliant by default?

Local processing is one piece of compliance, not all of it. You still need processing agreements, access logs, deletion policies, and a DPIA. The model running on-device does not replace the paperwork.

Why not just use a smaller hosted model as the fallback?

Two providers means two outages and two privacy stories. A local model collapses the failure modes onto one machine you control, and the fallback window stops PHI from leaving the building.

How often does the failover actually fire in production?

About 0.4% of queries in our last six months on the GP collective deployment. Most are 60 to 90 second rate-limit windows rather than full outages, but they are the moments that erode user trust the fastest.

ragai agentsknowledge basearchitectureoperationstooling

Building something?

Start a project