AI agents

Local coding models: 14 ways a SaaS founder gets it wrong

It is 22:14 on a Tuesday. Your technical co-founder pings you an HN thread claiming a 32B local model replaced Claude. Before you sign the GPU order, read this.

Jacob Molkenboer· Founder · A Brand New Company· 28 Nov 2025· 8 min

Brass relay switch and folded paper telegram on ivory desk, chartreuse sticky note, forest green leather blotter, window light.

It is 22:14 on a Tuesday. Your technical co-founder pings you a Hacker News link with the title "I replaced Claude with a local model and my team is faster." Six minutes later the message turns into a question: should we order an RTX 6000 Ada this week, or can we get away with the 4090 already sitting in the office?

You read the thread on the way to bed. By morning the question has hardened into a budget request, a Notion doc, and an opinion you do not have yet. We have built coding workflows on both hosted and local stacks for clients running between two and forty seats, and the same fourteen mistakes show up every time. Some you patch in a single ollama config block. Some force a hardware rebuild before the next sprint can start. The trick is knowing which is which before you spend the money.

The HN thread, briefly

The progress is real. Qwen2.5-Coder-32B closed most of the gap on coding-specific benchmarks in late 2024. DeepSeek-Coder-V2 and Llama 3.3 followed. The "I replaced X with local" posts are not lying. They are usually leaving out three things: which task they actually run on it, which context length they used, and what their team size is. The mistakes below are sorted by what it costs to fix them.

Tier 1: the ones a single ollama config patches

If your co-founder spent an afternoon installing ollama and reported back "it is fine but a bit dumb", these are the first five things to check. None of them require new hardware. Four of the five live in ollama's defaults, which are tuned for "runs on any laptop", not "is the best model my hardware can run."

1. The 2048-token context window

Ollama defaults num_ctx to 2048 tokens. That fits one short file. A real coding session needs the file you are editing, two files it imports, an interface definition, and a docstring. The model is not dumb. It is blind.

# Modelfile
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
PARAMETER num_predict 4096

32k is the floor for serious work. 65k if your VRAM allows it. Rebuild with ollama create qwen-coder -f Modelfile and benchmark again.

2. KV-cache quantization left off

The K/V cache eats VRAM faster than the model weights once you push context past 16k. Ollama's OLLAMA_KV_CACHE_TYPE=q8_0 halves the cache cost with a quality hit small enough that no developer notices on autocomplete. The hit on multi-turn agent loops is bigger. Test it on your task before you commit.

3. Wrong `num_gpu` layer count

On a 24GB card running a 32B Q4 model, ollama's autodetect will sometimes leave six layers on the CPU and you will see 9 tok/s where you should see 35. Set PARAMETER num_gpu 65 (or whatever your model's layer count is) explicitly. Watch nvidia-smi during generation. If VRAM use sits below 90%, you have room and you are leaking layers to CPU.

4. Pulling the default quant

The tag qwen2.5-coder:32b usually maps to Q4_K_M. That is fine for chat. For code, Q5_K_M or Q6_K is the line where my own team stops noticing the difference from FP16. The download is one extra command. The quality delta on a 200-line refactor is not subtle.

5. No coding template

Generic chat templates wrap user messages in "You are a helpful assistant". Coding models ship with FIM (fill-in-the-middle) tokens and prefix/suffix structure. If your continue.dev or aider config is sending plain chat completions to a coder model, you are using maybe 60% of what it knows. Most editor plugins handle this if you tell them the model family. Check the config file. It is one line.

Tier 2: the ones that take a workflow change before Friday

These need an afternoon, not a card. Reorder your sprint, do not order parts.

6. Wrong model family

Llama 3.3 70B is a good general model. It is not a coding model. If your co-founder pulled it because an HN comment said "70B beat Sonnet on MMLU", they picked the wrong axis. Qwen2.5-Coder, DeepSeek-Coder-V2 and Codestral are trained on code. The 32B coder will out-edit a 70B generalist on your repo and use a third of the VRAM.

7. No draft model for speculative decoding

Pairing a 32B target with a 1.5B draft model from the same family gives you 1.6x to 2.2x throughput on most coding loads with no quality loss. Most teams do not switch it on because the default works. The default is leaving roughly 40% of inference speed on the table.

8. Single-seat assumption

The HN poster was working alone at midnight. Your team is five people writing at 14:00. One ollama server can batch requests, but only if you start it with OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2. Without those flags, requests queue. Your senior asks for an autocomplete and waits three seconds while a junior is mid-refactor.

9. Agent loops on a non-tool-tuned model

Your coding workflow probably involves multi-step tool use: read file, edit, run tests, read output. Most coder models are tuned for single-turn completion, not function-calling chains. Send an agent loop at them and watch the JSON malform on turn three. Either pick a model with function-calling in the card (Qwen2.5-Coder-Instruct handles light loops) or keep agent work on the hosted model and use local for completion only.

10. No prompt-cache strategy

The hosted APIs give you steep discounts on cached prefixes. Local gives you the same speedup for free, but only if your editor reuses the prefix. If every keystroke sends a fresh "here is my whole repo, please continue", you are paying full prefill latency every time. Aider does this right. Some Continue.dev configs do not. Check before you blame the model.

Tier 3: the ones that need a new GPU before the next sprint

If your honest assessment of the codebase, team size and context length lands here, no config will save you. Order parts.

11. VRAM shortfall

A 24GB RTX 4090 will run a 32B coder at Q4 with 16k context if you are careful. Push to 32k context with KV cache at Q8 and you are at 22.4GB before the cache grows during generation. One long session and you OOM. The RTX 6000 Ada at 48GB is the next stable rung. The Pro 6000 Blackwell at 96GB is the rung after that, and unless you are training, it is overkill for a five-person team.

12. No multi-GPU plan

One 4090 handles one developer well, two acceptably, three poorly. If your team is five, you either need a card sized for batched inference (the 48GB or 96GB tier) or two cards with vLLM in front. Ollama does not parallelise across cards as cleanly as vLLM does. That is a stack change as much as a hardware change.

13. Power, cooling, noise in a Dutch office

A workstation pulling 450W under load in a 12 m² office in Amsterdam at €0.34/kWh costs €1.30 per working day in electricity alone, and the room is 4°C warmer by 16:00. A datacenter card with a blower fan is loud enough that nobody sits next to it. If you do not have a server closet with its own air, you have just bought a heater that your team will resent by August.

14. Storage I/O for model swap

If your workflow needs a coder model for editing and a separate model for review, you will be swapping 20 to 40 GB of weights between disk and VRAM. A SATA SSD takes 90 seconds. A PCIe 4.0 NVMe takes 12. PCIe 5.0 takes 6. On a per-day basis this is small. On a "developer breaks flow waiting" basis it is the difference between local feeling fast and local feeling like compiling C++ in 2008.

The benchmark nobody runs

The MMLU score is irrelevant to you. HumanEval is closer but still not your codebase. Before you decide, take twenty real tasks from last sprint's git log. Feed each to the candidate local model at the context length you actually use, with the editor integration you actually use, on the hardware you actually have. Score them yourself on a four-point scale: shipped, shipped after edits, wrong but useful direction, useless.

If 80% of tasks land in the top two buckets, you have your answer. If they do not, you have a Tier 1 or Tier 2 mistake left to fix first.

When local actually wins

Not every local-vs-hosted argument is a status game. Local is the right call when your data cannot leave the building (medical, legal, defence subcontractors who actually read their MSAs), when tab-complete latency under 100 ms matters more than peak quality, or when half your team writes code from the train and wants the same experience offline. For most sub-€20M SaaS teams shipping a CRUD product to other businesses, none of those apply, and the honest answer is that hosted with a careful prompt-cache strategy still wins on cost per quality unit.

Warning

The most expensive mistake is not picking the wrong tier. It is buying the RTX 6000 first and then discovering the actual problem was the 2048-token default context window.

What to do tomorrow morning

Before the budget request goes to the bank, take an hour, set num_ctx to 32768, set OLLAMA_KV_CACHE_TYPE to q8_0, pull the Q5_K_M tag, and rerun the same twenty tasks. If your co-founder still wants the GPU after that, sign the order. When we rebuilt the assistant stack for a Rotterdam-based SaaS client last quarter, this exact sequence saved them a €7,200 card purchase and revealed that the real problem was a Continue.dev config sending plain chat completions to a coder model. The work we do on AI agents almost always starts with one of those five Tier 1 fixes.

Key takeaway

Four of the five quick fixes for a slow local coding model live in ollama's default config. Check those before you sign the GPU purchase order.

FAQ

Which local coding model should I try first in 2026?

Qwen2.5-Coder-32B-Instruct at Q5_K_M is the safest first try for general repo work. DeepSeek-Coder-V2 Lite is the lighter alternative when VRAM is tight.

Will a single RTX 4090 serve a five-developer team?

Comfortably for two seats, acceptably for three with batching enabled, poorly for five. Either move to a 48GB card or run vLLM across two GPUs before adding seats.

Is local cheaper than hosted at our team size?

Below ten seats, hosted with prompt caching is usually cheaper once you count electricity, cooling, and the engineer time spent maintaining the inference stack.

Can a local model handle agent loops with tool calls?

Only if it is tuned for function calling. Qwen2.5-Coder-Instruct handles light loops. For multi-step agents with branching tool use, keep a hosted model in the loop.

ai agentstoolingarchitecturestrategyoperations

Building something?

Start a project

Local coding models: 14 ways a SaaS founder gets it wrong

The HN thread, briefly

Tier 1: the ones a single ollama config patches

1. The 2048-token context window

2. KV-cache quantization left off

3. Wrong num_gpu layer count

4. Pulling the default quant

5. No coding template

Tier 2: the ones that take a workflow change before Friday

6. Wrong model family

7. No draft model for speculative decoding

8. Single-seat assumption

9. Agent loops on a non-tool-tuned model

10. No prompt-cache strategy

Tier 3: the ones that need a new GPU before the next sprint

11. VRAM shortfall

12. No multi-GPU plan

13. Power, cooling, noise in a Dutch office

14. Storage I/O for model swap

The benchmark nobody runs

When local actually wins

What to do tomorrow morning

FAQ

Building something?

3. Wrong `num_gpu` layer count