AI agents
Local coding models: 14 ways a SaaS founder gets it wrong
It is 22:14 on a Tuesday. Your technical co-founder pings you an HN thread claiming a 32B local model replaced Claude. Before you sign the GPU order, read this.

It is 22:14 on a Tuesday. Your technical co-founder pings you a Hacker News link with the title "I replaced Claude with a local model and my team is faster." Six minutes later the message turns into a question: should we order an RTX 6000 Ada this week, or can we get away with the 4090 already sitting in the office?
You read the thread on the way to bed. By morning the question has hardened into a budget request, a Notion doc, and an opinion you do not have yet. We have built coding workflows on both hosted and local stacks for clients running between two and forty seats, and the same fourteen mistakes show up every time. Some you patch in a single ollama config block. Some force a hardware rebuild before the next sprint can start. The trick is knowing which is which before you spend the money.
The HN thread, briefly
The progress is real. Qwen2.5-Coder-32B closed most of the gap on coding-specific benchmarks in late 2024. DeepSeek-Coder-V2 and Llama 3.3 followed. The "I replaced X with local" posts are not lying. They are usually leaving out three things: which task they actually run on it, which context length they used, and what their team size is. The mistakes below are sorted by what it costs to fix them.
Tier 1: the ones a single ollama config patches
If your co-founder spent an afternoon installing ollama and reported back "it is fine but a bit dumb", these are the first five things to check. None of them require new hardware. Four of the five live in ollama's defaults, which are tuned for "runs on any laptop", not "is the best model my hardware can run."
1. The 2048-token context window
Ollama defaults num_ctx to 2048 tokens. That fits one short file. A real coding session needs the file you are editing, two files it imports, an interface definition, and a docstring. The model is not dumb. It is blind.
# Modelfile
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
PARAMETER num_predict 4096
32k is the floor for serious work. 65k if your VRAM allows it. Rebuild with ollama create qwen-coder -f Modelfile and benchmark again.
2. KV-cache quantization left off
The K/V cache eats VRAM faster than the model weights once you push context past 16k. Ollama's OLLAMA_KV_CACHE_TYPE=q8_0 halves the cache cost with a quality hit small enough that no developer notices on autocomplete. The hit on multi-turn agent loops is bigger. Test it on your task before you commit.
3. Wrong num_gpu layer count
On a 24GB card running a 32B Q4 model, ollama's autodetect will sometimes leave six layers on the CPU and you will see 9 tok/s where you should see 35. Set PARAMETER num_gpu 65 (or whatever your model's layer count is) explicitly. Watch nvidia-smi during generation. If VRAM use sits below 90%, you have room and you are leaking layers to CPU.
4. Pulling the default quant
The tag qwen2.5-coder:32b usually maps to Q4_K_M. That is fine for chat. For code, Q5_K_M or Q6_K is the line where my own team stops noticing the difference from FP16. The download is one extra command. The quality delta on a 200-line refactor is not subtle.
5. No coding template
Generic chat templates wrap user messages in "You are a helpful assistant". Coding models ship with FIM (fill-in-the-middle) tokens and prefix/suffix structure. If your continue.dev or aider config is sending plain chat completions to a coder model, you are using maybe 60% of what it knows. Most editor plugins handle this if you tell them the model family. Check the config file. It is one line.
Tier 2: the ones that take a workflow change before Friday
These need an afternoon, not a card. Reorder your sprint, do not order parts.
6. Wrong model family
Llama 3.3 70B is a good general model. It is not a coding model. If your co-founder pulled it because an HN comment said "70B beat Sonnet on MMLU", they picked the wrong axis. Qwen2.5-Coder, DeepSeek-Coder-V2 and Codestral are trained on code. The 32B coder will out-edit a 70B generalist on your repo and use a third of the VRAM.
7. No draft model for speculative decoding
Pairing a 32B target with a 1.5B draft model from the same family gives you 1.6x to 2.2x throughput on most coding loads with no quality loss. Most teams do not switch it on because the default works. The default is leaving roughly 40% of inference speed on the table.
8. Single-seat assumption
The HN poster was working alone at midnight. Your team is five people writing at 14:00. One ollama server can batch requests, but only if you start it with OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2. Without those flags, requests queue. Your senior asks for an autocomplete and waits three seconds while a junior is mid-refactor.
9. Agent loops on a non-tool-tuned model
Your coding workflow probably involves multi-step tool use: read file, edit, run tests, read output. Most coder models are tuned for single-turn completion, not function-calling chains. Send an agent loop at them and watch the JSON malform on turn three. Either pick a model with function-calling in the card (Qwen2.5-Coder-Instruct handles light loops) or keep agent work on the hosted model and use local for completion only.
10. No prompt-cache strategy
The hosted APIs give you steep discounts on cached prefixes. Local gives you the same speedup for free, but only if your editor reuses the prefix. If every keystroke sends a fresh "here is my whole repo, please continue", you are paying full prefill latency every time. Aider does this right. Some Continue.dev configs do not. Check before you blame the model.
Tier 3: the ones that need a new GPU before the next sprint
If your honest assessment of the codebase, team size and context length lands here, no config will save you. Order parts.
11. VRAM shortfall
A 24GB RTX 4090 will run a 32B coder at Q4 with 16k context if you are careful. Push to 32k context with KV cache at Q8 and you are at 22.4GB before the cache grows during generation. One long session and you OOM. The RTX 6000 Ada at 48GB is the next stable rung. The Pro 6000 Blackwell at 96GB is the rung after that, and unless you are training, it is overkill for a five-person team.
12. No multi-GPU plan
One 4090 handles one developer well, two acceptably, three poorly. If your team is five, you either need a card sized for batched inference (the 48GB or 96GB tier) or two cards with vLLM in front. Ollama does not parallelise across cards as cleanly as vLLM does. That is a stack change as much as a hardware change.
13. Power, cooling, noise in a Dutch office
A workstation pulling 450W under load in a 12 m² office in Amsterdam at €0.34/kWh costs €1.30 per working day in electricity alone, and the room is 4°C warmer by 16:00. A datacenter card with a blower fan is loud enough that nobody sits next to it. If you do not have a server closet with its own air, you have just bought a heater that your team will resent by August.
14. Storage I/O for model swap
If your workflow needs a coder model for editing and a separate model for review, you will be swapping 20 to 40 GB of weights between disk and VRAM. A SATA SSD takes 90 seconds. A PCIe 4.0 NVMe takes 12. PCIe 5.0 takes 6. On a per-day basis this is small. On a "developer breaks flow waiting" basis it is the difference between local feeling fast and local feeling like compiling C++ in 2008.
The benchmark nobody runs
The MMLU score is irrelevant to you. HumanEval is closer but still not your codebase. Before you decide, take twenty real tasks from last sprint's git log. Feed each to the candidate local model at the context length you actually use, with the editor integration you actually use, on the hardware you actually have. Score them yourself on a four-point scale: shipped, shipped after edits, wrong but useful direction, useless.
If 80% of tasks land in the top two buckets, you have your answer. If they do not, you have a Tier 1 or Tier 2 mistake left to fix first.
When local actually wins
Not every local-vs-hosted argument is a status game. Local is the right call when your data cannot leave the building (medical, legal, defence subcontractors who actually read their MSAs), when tab-complete latency under 100 ms matters more than peak quality, or when half your team writes code from the train and wants the same experience offline. For most sub-€20M SaaS teams shipping a CRUD product to other businesses, none of those apply, and the honest answer is that hosted with a careful prompt-cache strategy still wins on cost per quality unit.
The most expensive mistake is not picking the wrong tier. It is buying the RTX 6000 first and then discovering the actual problem was the 2048-token default context window.
What to do tomorrow morning
Before the budget request goes to the bank, take an hour, set num_ctx to 32768, set OLLAMA_KV_CACHE_TYPE to q8_0, pull the Q5_K_M tag, and rerun the same twenty tasks. If your co-founder still wants the GPU after that, sign the order. When we rebuilt the assistant stack for a Rotterdam-based SaaS client last quarter, this exact sequence saved them a €7,200 card purchase and revealed that the real problem was a Continue.dev config sending plain chat completions to a coder model. The work we do on AI agents almost always starts with one of those five Tier 1 fixes.
Key takeaway
Four of the five quick fixes for a slow local coding model live in ollama's default config. Check those before you sign the GPU purchase order.
FAQ
Which local coding model should I try first in 2026?
Qwen2.5-Coder-32B-Instruct at Q5_K_M is the safest first try for general repo work. DeepSeek-Coder-V2 Lite is the lighter alternative when VRAM is tight.
Will a single RTX 4090 serve a five-developer team?
Comfortably for two seats, acceptably for three with batching enabled, poorly for five. Either move to a 48GB card or run vLLM across two GPUs before adding seats.
Is local cheaper than hosted at our team size?
Below ten seats, hosted with prompt caching is usually cheaper once you count electricity, cooling, and the engineer time spent maintaining the inference stack.
Can a local model handle agent loops with tool calls?
Only if it is tuned for function calling. Qwen2.5-Coder-Instruct handles light loops. For multi-step agents with branching tool use, keep a hosted model in the loop.