← Blog

AI agents

Local coding agent on macOS: a 14-person studio playbook

Tuesday morning at 09:14 in Haarlem. A non-engineer founder opens the dashboard and reads twelve green, two amber. Here is the playbook behind that screen.

Jacob Molkenboer· Founder · A Brand New Company· 2 Jul 2025· 9 min
Brass switchboard with cloth patch cables, paper form with green sticky note, forest leather blotter, red wax seal on ivory desk.

Tuesday, 09:14. The Spaarne canal is still flat outside the Haarlem studio window, the espresso machine has stopped wheezing, and Marije (non-engineer, founder) opens the dashboard on her iPad. Fourteen rows. Twelve green. Two amber. One developer's nightly eval failed against the monorepo at 03:22 because a new Swift Package Manager pin broke the Xcode bridge. She forwards it to the lead engineer with one sentence and goes back to standup.

That dashboard, and the fleet behind it, took us six weeks to ship. Here is the playbook.

Going local, and the NDA that forced it

A studio of fourteen running paid cloud-agent seats with team-tier privacy assurances is a fine line item. What was not fine was the answer to "where does this code actually go." Client NDAs at this studio cover three banks, one airline, and a public-sector procurement project. The compliance officer at the airline asked, in writing, in March, for an assurance that no developer prompt or completion would touch a third-party LLM provider. We could not produce that on a cloud seat without enterprise legal work that would have run past the project deadline.

Recent Hacker News traffic on this is not subtle. "Open source AI must win" hit the front page with 903 points the week we were scoping the work, and a tutorial titled "How to setup a local coding agent on macOS" landed at 372 points the day after. Two thirds of the questions in those threads were variants of how do I get this running for a team, not just my laptop. That is the gap this playbook closes.

The fleet, the stack, the constraint

Fleet: fourteen MacBooks. Eleven M3 Pro at 36GB unified memory, three M4 Max at 64GB.

Stack: Ollama 0.5.x as the local runtime, Qwen3-Coder 30B at Q4_K_M quant for inline completion, Qwen3-Coder 72B (Q4) on the two M4 Max boxes for chat-mode refactors, Continue.dev 1.x as the IDE bridge, a small Go binary we wrote called evalctl for the nightly harness, and a static page on the studio's intranet that reads the harness output.

The constraint: the founder must be able to read green-or-red without asking an engineer. That single rule shaped every downstream decision.

The Ollama install, fleet edition

You can fetch Ollama and click through it on one Mac in ninety seconds. Doing that fourteen times, with model pulls of 18 to 40GB each, on residential fibre, is the bit nobody writes about.

We did it in two passes. Pass one, a launchd job that installs Ollama via Homebrew and pulls only the inline-completion model, fired at 23:00 local so the office uplink was free. Pass two, the chat model, gated by a hardware check so it only landed on the M4 Max boxes.

The bootstrap script we deployed lives at /usr/local/sbin/abn-localagent-bootstrap.sh:

#!/usr/bin/env bash
set -euo pipefail

# 1. Ollama via Homebrew (idempotent)
if ! command -v ollama >/dev/null 2>&1; then
  /opt/homebrew/bin/brew install ollama
fi

# 2. Start the service
brew services start ollama

# 3. Pull the completion model on every machine
ollama pull qwen3-coder:30b-instruct-q4_K_M

# 4. Pull the chat model only on M4 Max
chip=$(system_profiler SPHardwareDataType | awk -F': ' '/Chip/ {print $2}')
if [[ "$chip" == *"M4 Max"* ]]; then
  ollama pull qwen3-coder:72b-instruct-q4_K_M
fi

# 5. Mark bootstrap done so the dashboard can read it
mkdir -p /var/abn/localagent
date -u +"%Y-%m-%dT%H:%M:%SZ" > /var/abn/localagent/bootstrapped_at

We pushed it through Jamf, which the studio already used for MDM. If you do not have Jamf, a Munki repo or an Ansible playbook over ssh works. The point is one button, fourteen machines, no Slack ping.

One detail from the Ollama docs that we did not appreciate until later: the daemon binds to 127.0.0.1 by default. Keep it that way. If you need to reach it from another machine for the eval harness, route over Tailscale ACLs rather than opening port 11434 to the office wifi.

Continue.dev as the IDE bridge

Continue.dev is the closest thing to a sane open standard for "agent talks to editor." It plugs into VS Code, JetBrains, and (the part that mattered for us) it has a real Xcode integration through Apple's editor extension API. Config lives in ~/.continue/config.yaml.

Here is the studio config we shipped, redacted of anything client-specific:

models:
  - name: Qwen3 Coder Local
    provider: ollama
    model: qwen3-coder:30b-instruct-q4_K_M
    roles:
      - autocomplete
      - chat
    apiBase: http://127.0.0.1:11434
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 1024

  - name: Qwen3 Coder Local Heavy
    provider: ollama
    model: qwen3-coder:72b-instruct-q4_K_M
    roles:
      - chat
      - edit
    apiBase: http://127.0.0.1:11434
    defaultCompletionOptions:
      temperature: 0.1

context:
  - provider: code
  - provider: diff
  - provider: terminal
  - provider: docs
    params:
      sites:
        - https://docs.swift.org
        - https://developer.apple.com/documentation

systemMessage: |
  You write production code for a Dutch product studio.
  Match the conventions of the surrounding file. Do not invent APIs.
  When unsure, return one line of reasoning, then code.

We pushed this out the same way as the bootstrap: a launchd job on first login copies the template into the user's home directory only if no config.yaml exists. It never overwrites manual edits.

The Xcode trap

Continue.dev in VS Code is a fifteen-minute setup. Xcode is the trap.

Continue's Xcode integration is a separate macOS app that uses the Editor Extensions API. The first thing every developer has to do, by hand, is open System Settings, then Login Items & Extensions, then Xcode Source Editor, and tick the Continue extension. We could not script this. The TCC database is signed, and modifying it without user consent is exactly what Apple does not allow. So we baked the click-through into the onboarding doc with a screenshot.

Warning

Every macOS point release flips the Xcode source editor extension toggle back to off. If your dashboard does not detect that, the team will silently lose Xcode completion for a week before anyone notices.

The second Xcode-specific problem is latency. Inline completion in Xcode uses the SourceEditor protocol, which is not streaming. The local model has to return the full completion before Xcode renders it. With Qwen3-Coder 30B at Q4 on an M3 Pro, first-token latency sits around 180 to 260 milliseconds and a thirty-line completion lands in roughly 700 milliseconds. That is acceptable. With the 72B model it is not, which is why we kept the 72B on M4 Max chat-mode only.

The nightly eval, against the studio's own monorepo

This is the part most local-coding-agent guides skip, and the part we are proudest of.

Every night at 03:00 local, on each developer's machine, a launchd job runs evalctl. It does three things.

First, it checks out a fixed reference commit of the studio monorepo. One SHA, frozen quarterly, so the comparison stays apples to apples across nights.

Second, it runs ten task prompts written by the senior engineers. Small, specific, with a known good answer. Examples: given this protobuf, write the Swift decoder that matches the existing pattern in Sources/Wire/Decoders/. This MySQL migration is missing a down step, write it. This React Native screen leaks the subscription on unmount, fix it. The known-good answer for each task is a diff a senior engineer wrote by hand.

Third, it scores the model output against the known-good answer on a four-part rubric: does the code compile, does it match file conventions (a regex pass), does it match the structural shape of the expected diff (an AST diff via tree-sitter), and does it pass the relevant test in the monorepo's suite.

type Result struct {
    TaskID     string  `json:"task_id"`
    Compiles   bool    `json:"compiles"`
    StyleMatch float64 `json:"style_match"`
    ShapeMatch float64 `json:"shape_match"`
    TestsPass  bool    `json:"tests_pass"`
    DurationMs int64   `json:"duration_ms"`
}

func score(r Result) string {
    if !r.Compiles || !r.TestsPass {
        return "red"
    }
    if r.StyleMatch < 0.7 || r.ShapeMatch < 0.6 {
        return "amber"
    }
    return "green"
}

The whole run writes one JSON file per developer per night to /var/abn/localagent/eval/YYYY-MM-DD/<hostname>.json. A second launchd job at 07:30 rsyncs those files to the intranet box. The dashboard reads them. Fourteen rows.

The dashboard the founder reads

The brief from Marije was one sentence: "I want to know, before I drink my second coffee, whether the agent is making the team faster or slower."

So the dashboard does exactly that. Each developer is a row. Each row has today's colour, yesterday's colour, and a seven-day sparkline. There is one number at the top: average task duration across the fleet for the last seven nights, compared with the seven nights before. Green if it dropped, red if it climbed.

There is no chat. There are no completion counts. There are no token bills. There is no model card. One question, one answer.

Takeaway

If a non-engineer founder cannot read your local coding agent's status in one glance, you do not have a fleet. You have fourteen science projects.

Two weeks in, the dashboard caught something we did not expect. The 72B model on one M4 Max started returning amber every night for one specific developer. His branch had been on the wrong base for three weeks, the protobuf decoders he was generating did not match the new wire format, and nobody had noticed. The eval caught it because the rubric ran against latest monorepo HEAD, not his branch. We rebased. The dashboard went green the next morning.

What we got wrong on the first pass

A short list, because honesty travels further than gloss.

Quantisation choice. We started on Q5_K_M for everything. Memory pressure on the 36GB machines pushed swap, swap blew first-token latency past three seconds, developers turned the extension off. We dropped to Q4_K_M. Latency fell back under 300ms. The extension stayed on.

Eval drift. The first version of the rubric scored style by exact-match regex. Twelve weeks in, the team agreed on a new naming convention for Swift result types. The rubric did not get the memo. Every machine went amber. Now the rubric lives in the monorepo and changes go through a PR like everything else.

The Xcode toggle. Every macOS update flips the Xcode extension toggle back to off. We caught it once by accident. Now the bootstrap script writes a checksum of the current macOS build to a file, and if it changes, the dashboard puts a yellow banner up that reads "macOS updated, please re-tick the Continue extension." That one banner has saved a full day of confused Slack messages.

The five-minute audit you can run today

If you are running a small team on cloud-hosted coding agents and the privacy bill is coming due, do this before lunch.

  1. Install Ollama on one developer's machine. Pull qwen3-coder:30b-instruct-q4_K_M.
  2. Install the Continue.dev extension in VS Code. Point it at the local model with the YAML above.
  3. Pick three tasks from your last sprint that already merged. Run them through the local agent. Compare the diffs to what actually shipped.

If two out of three are close enough that the developer would have shipped them, you have your answer. If they are not, you know what to evaluate against before you sign next year's seat contract.

When we built this for the Haarlem studio last spring, the part we kept tripping over was that Xcode toggle flipping after every macOS point release. We solved it with the build-checksum banner above, and folded the whole stack into our AI agents practice. The script names in this post match what we ship.

Key takeaway

If a non-engineer founder cannot read your local coding agent's status in one glance, you do not have a fleet, you have fourteen science projects.

FAQ

What hardware do I need to run Qwen3-Coder locally on a Mac?

An Apple Silicon Mac with 36GB unified memory will run the 30B Q4 quant comfortably for inline completion. The 72B Q4 needs 64GB and is only worth running for chat-mode refactors, not autocomplete.

Does this make sense for a solo developer, or only a team?

Solo is the easier setup, but the playbook earns its keep at team scale where privacy, fleet rollout, and a nightly eval against your own code start to matter more than raw completion speed.

Why Continue.dev instead of writing a custom VS Code extension?

Continue.dev already handles autocomplete, chat, edit mode, and an Xcode bridge. Rebuilding that surface area to save a config file is the kind of yak-shave a small team should never start.

How long did the setup actually take?

Six weeks of calendar time, around fifteen engineer-days of real work. Most of that went into the eval harness and the dashboard, not the Ollama or Continue.dev installs.

ai agentstoolingarchitectureworkflowoperations

Building something?

Start a project