AI agents
Claude as the design system: a Figma handoff replaced
A 22-person Eindhoven studio swapped Figma handoffs for a Claude agent that writes Tailwind React stubs the front-end actually merges. Here is how we built it.

It is a Thursday afternoon in a studio off the Vestdijk in Eindhoven. The lead designer, Famke, has just finished a checkout flow in Figma. Four breakpoints, twelve states, a sticky summary, a discount field that collapses on mobile. She exports the frames, drops a Loom into Slack, tags two front-end devs, and writes the sentence every designer in this country has typed at some point: let me know if anything is unclear.
Three days later the PR comes back. The padding is off by 4px in two places. The empty state is missing. The discount field expands instead of collapsing. Famke leaves nine comments. The devs sigh. Everybody is doing their job; the handoff is the problem.
This is the studio that asked us to fix it. Twenty-two people, mostly product designers, two in-house front-end engineers, the rest contractors. They ship for Dutch banks and a couple of energy retailers. Figma is the source of truth. React + Tailwind is the destination. The space between those two is where their margin lives, and where it leaks.
We spent six weeks building them a Claude-driven component-spec agent. It does not replace the designers. It does not replace the engineers. It replaces the handoff: the Loom, the Slack tag, the back-and-forth about whether a chip is a Badge or a Tag. Six months in, the front-end team merges roughly 70% of the stubs the agent produces with only cosmetic edits. This is the post about how it works and what we got wrong on the way.
The actual problem with Figma handoffs
The honest framing: Figma is a drawing tool that grew a component model. React is a component model that grew a styling system. Neither was designed to describe the other. Tokens help. Auto-layout helps. Dev Mode helps. None of them close the gap, because the gap is not visual; it is semantic.
When a designer draws a card with a header, a body, and a CTA, they have made a hundred implicit decisions. Is the CTA a Button or a Link styled as a Button? Does the card have a hover state on touch devices? Is the header an h3 or visually styled like one but semantically a div? Figma cannot answer these questions. The designer can, but only if you ask, and asking takes a meeting.
The thread on Hacker News last week titled "I design with Claude more than Figma now" hit 106 points because that experience is now common. A growing chunk of design work is conversational: what should this be called, what states does it need, what happens on small screens. Figma is the canvas; the thinking happens elsewhere. We took that observation and built it into a pipeline.
What the agent actually does
The agent is not a code generator in the Vercel v0 sense. It does not look at a Figma frame and try to guess React. It does something narrower and more useful: it produces a component specification, then a Tailwind-ready stub that matches the studio's existing component library, naming conventions, and accessibility rules.
The input is three things:
- A Figma node ID, fetched through the Figma REST API.
- The studio's
components.json— a manifest of every existing component, its props, and its allowed variants. - A short freeform brief from the designer, written as if explaining to a junior dev.
The output is a pull request. Not a draft, not a gist, a real PR against the studio's monorepo with three files: the component, a Storybook story, and a Vitest file that asserts the props contract. The PR description includes the spec the agent wrote first, so the reviewer can sanity-check the intent before reading the code.
The spec-first loop
The thing that made this work, after two failed attempts, was forcing the agent to write the spec before the code. Claude is good at React. Claude is excellent at writing a clear English description of a component and then writing code that matches that description. If you skip the description and go straight to code, you get plausible-looking React that drifts from the studio's conventions within twenty lines.
The spec is a Markdown document with a fixed structure:
## CheckoutSummary
**Purpose**: Sticky order summary on the right of the cart page (desktop), collapsing into a bottom drawer on mobile.
**Props**
- items: LineItem[]
- discount?: Discount
- onApplyDiscount(code: string): Promise<Result>
- variant: 'desktop' | 'mobile'
**States**
- empty (no items)
- with-items
- with-discount-applied
- discount-error
- loading (during applyDiscount)
**Accessibility**
- Sticky region uses role="complementary"
- Discount input has aria-describedby pointing at error text
- Drawer is a focus-trapped dialog on mobile
**Out of scope**
- Currency formatting (use existing <Money/>)
- Tax calculation (parent passes final amounts)
That document is what the designer reviews. Not the React. The React is generated from the spec in a second pass, with the spec attached as the system context. If the spec is wrong, the designer fixes the spec, and the code regenerates. Reviewing English is faster than reviewing JSX, and designers can actually do it.
Generate the spec first, the code second. The spec is what the designer reviews; the code is what the agent writes. Conflate those two steps and the loop collapses.
How the components.json manifest keeps it honest
The single biggest risk with code-generating agents is drift. Round one, you get a clean component. Round fifty, you have three subtly different Button implementations and a Toast that nobody remembers writing. We solved this with a manifest the agent must consult on every run.
The manifest is generated nightly from the actual codebase, using a small AST walker. For every exported component it records the file path, the props with their types, the variants the component accepts, and a one-line description pulled from a JSDoc block above the export. It looks like this:
{
"Button": {
"path": "src/ui/Button.tsx",
"props": {
"variant": ["primary", "secondary", "ghost", "danger"],
"size": ["sm", "md", "lg"],
"loading": "boolean",
"icon": "ReactNode"
},
"description": "Primary action element. Never use for navigation; use Link."
},
"Link": { "...": "..." }
}
Before the agent writes a single line, it receives the full manifest and is instructed to reuse existing components wherever the spec allows. If a CheckoutSummary needs a button, it imports Button. If the spec calls for a chip and Chip already exists, it uses Chip. If nothing fits, it must say so in the spec, in a section called New components required, and stop. A human decides whether to extend the library.
That last rule is the one that earned the front-end team's trust. The agent does not invent components. It cannot. The manifest is its world.
The Tailwind constraint that everybody underestimates
Tailwind is a thousand atomic decisions per component. A naive agent will pick text-gray-700 when the design system has text-ink-secondary. It will use p-4 when the spacing token is p-card. Each of those mistakes is small and each one needs a manual fix, and the manual fixes are what kill the time savings.
We solved it by stripping Tailwind's default theme out of the agent's context entirely and replacing it with the studio's tailwind.config.ts, condensed into a token reference table. The agent literally does not know what text-gray-700 means in its context window, because Tailwind's defaults are not there. It only knows text-ink-primary, text-ink-secondary, bg-surface-raised, and so on. This is the same trick you use to make any LLM speak a domain language: don't tell it to prefer the domain, remove the alternative.
Do not let your agent see vanilla Tailwind utilities if you have a design-token override. It will reach for the defaults under load, especially when the prompt gets long. Remove them from context, don't just discourage them.
What we got wrong the first two times
The first version was a single prompt: here is the Figma frame, here is the manifest, write the component. It worked for trivial components and failed for anything with state. Specifically, it kept inventing prop names that read well but did not match the studio's convention (isLoading vs loading, onClickApply vs onApplyDiscount). Pull requests turned into bikeshedding sessions about naming.
The second version added a linter pass: generate the code, then run a custom ESLint plugin that checked prop-name conventions and import order. That caught the bikeshed problems but missed structural ones. We were also burning tokens on multiple round-trips when the agent rewrote whole files to fix a single prop name.
The third version is the one that stuck. The pipeline is now four distinct steps, each with a narrow job:
- Read: pull the Figma node, the manifest, and the designer's brief into a context bundle.
- Spec: produce the Markdown specification. The designer signs off here, in a Slack thread driven by a small bot. No code yet.
- Stub: generate the React + Tailwind + Storybook + Vitest files, with the manifest and the signed-off spec as system context.
- Gate: run ESLint, TypeScript, the test suite, and a Playwright visual diff against the Figma frame's exported PNG. If anything fails, the agent gets the error log and one retry. If the retry fails, it opens the PR as a draft with the failures listed and tags a human.
The visual diff is approximate. We accept anything under a 4% pixel-diff threshold, which sounds loose but maps to what humans notice in practice. The Playwright + pixelmatch stack is well-documented in the Playwright snapshot guide, and writing the comparison was a couple of afternoons, not a project.
What the front-end team actually feels
The front-end leads were the most skeptical group, which is what you would expect and what you should hope for. They have been burned by tools that promised React from designs since 2017. Their condition for adoption was simple: the PR has to be one we would accept from a junior dev on a good day. That meant matching their imports, their prop conventions, their test patterns, their commit message format.
We did not get there in week one. We got there in week five, and only after we let them write the prompts for the spec-to-stub step themselves. Their version was meaner than ours. It said things like do not import Button from anywhere except @studio/ui and props must be alphabetised. Those rules made the agent's output indistinguishable from theirs.
The number that matters to them: the average PR from the agent now takes 11 minutes to review and 23 minutes to merge, against 90+ minutes for a hand-built component of similar scope. They have not been asked to write a card-style component from a Figma frame in four months.
Where it does not work
Three places. First, anything with non-trivial animation. Tailwind plus a couple of variants the agent handles fine; Framer Motion sequences with shared layouts, no. We send those to a human and do not pretend otherwise.
Second, anything stateful that touches the studio's data layer. The agent does not have a model of the API. We tried giving it OpenAPI specs as context and the output got worse, not better; it started inventing endpoints that read sensibly but did not exist. We pulled it back to props-only and the quality returned.
Third, the first component in a new product. The manifest is empty or thin, the conventions are unsettled, and the agent has nothing to ground on. We use it from component three or four onwards, once the patterns exist.
The scaffolding, not the model
Anyone who has actually shipped agents knows the durable observation: the model matters less than the scaffolding around it. Our agent runs on a frontier model today and will run on a different one in six months. The spec template, the manifest, the gate step, the four-stage pipeline: those are the parts that compound. The model is interchangeable.
This is also why we did not build a Figma plugin. A plugin lives inside Figma's runtime and Figma's permission model. The pipeline lives in the studio's CI, which is where the conventions, the linters, and the tests already live. Meeting the work where it lands beats wrapping it in a new UI.
When we built the spec-and-stub pipeline for that Eindhoven studio, the thing we ran into was Tailwind drift; we ended up solving it by stripping the default theme from the agent's context and replacing it with the studio's tokens, which is the trick that finally made the merges quiet. If you want to see how we build AI agents like this one against real codebases, that link is the place to start.
The smallest thing you could do today: open your design system repo, write a thirty-line components.json manifest by hand for your ten most-used components, and read it back. If it does not describe how your team actually builds, your agent will not either.
Key takeaway
Generate the component spec first, the code second. The designer reviews the spec; the agent writes the stub. The handoff disappears when the artefact in the middle becomes English.
FAQ
Why generate a written spec before the code?
Designers can read English faster than JSX. The spec is the review surface; the code is a deterministic transform of it. Catching mistakes in the spec is roughly ten times cheaper than catching them in a PR.
Does this replace the front-end team?
No. It replaces the handoff conversation. Front-end engineers still review every PR, write the harder components, own the data layer, and set the conventions the agent must follow.
What if the agent invents a component that does not exist?
It cannot. The manifest is generated from the real codebase nightly, and the agent is instructed to stop and request a new component if nothing in the manifest fits. A human decides whether to extend the library.
Why not use a Figma-to-code plugin instead?
Plugins live inside Figma and cannot run your linters, tests, or visual diffs. A CI pipeline meets the work where it already lands and enforces the same gates the rest of the team ships against.
How do you stop the agent from drifting toward vanilla Tailwind?
Strip the Tailwind default theme out of its context entirely. Give it only the studio's tokens. It cannot prefer what it cannot see, and that is more reliable than telling it to behave.