Strategy

Product-description agent token costs: the 17-item field guide

You ran the numbers on the AI product-description agent before greenlighting it. The first invoice was four times what your spreadsheet said. Here is why.

Jacob Molkenboer· Founder · A Brand New Company· 20 Jun 2026· 9 min

Brass apothecary scale tipped left on ivory paper, folded invoice under lighter pan, twine-bound receipts, green price tag, red wax seal.

The June invoice from the LLM vendor landed at 23:14 on a Tuesday. €18,400, on a budget of €4,000. You had run the math twice. You had even built a small spreadsheet — average prompt length, average output, vendor price per million tokens, multiply by 42,000 SKU-variants. Your number came out at €3,800. The actual number came out at almost five times that.

The agent works. The descriptions are decent. The catalogue is live. The cost is wrong, and the cost will be wrong every month from now until you figure out which of the seventeen places you mis-modelled.

This is that list, ranked by the worst possible question: if you notice the mistake tomorrow morning, what does the fix take? A single line in a prompt template, an evening re-embed of the catalogue, or a full rebuild of the retrieval layer.

The economics, before the list

A product-description agent has three cost lines that founders routinely treat as one. The input tokens you send (brief, brand voice, examples, product attributes). The output tokens it produces (the description itself, plus any reasoning you asked for). The embedding tokens you pay once when you index the catalogue for retrieval. Each line has a different price, a different model, and a different lever. Mis-attributing cost between them is the original sin of every overrun we have ever seen.

The math worth doing first: for one variant generation, how many input tokens are unique to that variant, and how many are shared with every other call in the batch? The shared portion is what Anthropic's prompt caching bills at roughly a tenth of the standard input rate after the first hit. If 90% of your prompt is shared — brand voice, tone guide, format rules, few-shot examples — and you have not turned caching on, you are paying ten times for that portion across 42,000 calls. That alone explains the gap on most overruns we are asked to look at.

Mistakes you can undo in one prompt template

Items one to ten. Change a line, redeploy the orchestrator, the next batch costs what it should. Order is rough — most impact first.

1. Model tier picked from the playground demo. You tested ten variants in the chat console with a frontier model, the prose was warm, you shipped that tier. You never re-ran the brief on a Haiku-class model. For 42k variants of a structured task with a fixed brand voice and three good few-shot examples, the cheaper tier almost always passes. Fix: change the model id in the orchestrator and run a 200-item blind A/B before committing.

2. Prompt caching off. It is off by default. You have to opt in with a cache_control breakpoint or your vendor's equivalent. The brand voice, format rules, and example block are all eligible. Caching turned on, the same prompt costs about 10% of what it used to on the cached portion.

3. Volatile data above the stable block. The cache hashes from the top of the prompt down. If the SKU id, the timestamp, or the warehouse code sits in the first line of the system message, every call invalidates and you are paying full rate forever.

# Bad: a fresh timestamp at the top voids the cache on every call.
system = f"""
Generated at: {now}
SKU: {sku_id}
Brand voice: {brand_voice}   # 8k tokens, never changes
...examples, format rules...
"""

# Good: stable block first, marked for caching. Variant fields last.
messages = [{
    "role": "system",
    "content": [
        {
            "type": "text",
            "text": BRAND_VOICE + FORMAT_RULES + EXAMPLES,  # ~8k, stable
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": f"SKU: {sku_id}\nAttributes: {attrs}",   # ~300, fresh
        },
    ],
}]

4. Counting input only. Output tokens cost four to five times input on every major vendor. A 600-token description multiplied by 42,000 variants is the line that actually blew the budget. Estimating from input alone is how the spreadsheet missed by four-fold.

5. Chain-of-thought you never read. You asked for "reasoning" and "description" in the JSON because a blog post said it improves output. You have not opened a single reasoning field. The model produced 400 output tokens of it per variant. Strip the field.

6. Verbose JSON schema. The model dutifully writes "meta_keywords_array_localized_nl_NL" because you named the field that way. Shorter field names cost fewer output tokens, and your parser does not care.

7. Translating with the expensive model. Generating EN first, then asking the same Sonnet-class call to render NL, DE, and FR. A cheaper second pass — even Haiku translating Haiku — does this work for a fraction. Better still, write the brief once in Dutch and translate from there.

8. Re-sending the full taxonomy on every call. You paste the 12,000-token category tree into the system prompt every variant. Cache it (see #2). Better: pre-retrieve the three categories relevant to this variant and pass only those.

9. One call per variant when variants share 90% of content. A black T-shirt in S/M/L/XL is one description with four size lines. Batch by parent product, ask the model for n variant blocks in one call, post-process. Input tokens drop 70–80%.

10. Observability tool billing per ingested character. Some hosted tracing tools charge by log volume. At 42k runs × 8k tokens × repeat-attempts, that line becomes its own surprise. Sample 5% in production, log everything in dev.

Warning

Prompt caching invalidates when the cached prefix changes by a single character. Pin the brand-voice block to a versioned file and only roll a new version when you are willing to pay for re-warming the cache across the next thousand calls.

Mistakes that force a re-embed of the catalogue

Items eleven to fifteen. None of these are one-line fixes. Each costs an evening of catalogue re-processing and a careful eval before you flip the retriever over. The work is unglamorous and unavoidable.

11. Embedding model picked from a leaderboard, not from your text. MTEB scores on English Wikipedia do not predict performance on Dutch product copy with brand jargon. Build a 200-item gold set from your own catalogue and test three candidates on recall@5. Dutch-tuned multilingual models often beat the headline name. Re-embed.

12. Embedding raw HTML. Someone dumped the live product page through the embedder. Sixty per cent of every vector is <div class="row"> and Tailwind class noise. Strip to clean attributes — title, brand, material, colour, key features — and embed those. Re-embed.

13. 3072-dim embeddings for a 42k catalogue. For this size, 1024-dim is fine, costs roughly a third in storage, and queries are noticeably faster. The largest dimension is rarely the right one for a small catalogue with structured attributes.

14. Per-product instead of per-variant chunks. The retriever returns the parent description when the agent needed the colour variant. Each variant becomes its own row, with the shared parent attributes attached. Re-chunk, re-embed.

15. No language column in the vector store. Multilingual catalogues end up retrieving German neighbours into a Dutch generation call because cosine similarity does not care about language. Add a language filter on retrieval, re-normalise with consistent stop-words, and re-embed.

Mistakes that force a catalogue rebuild

Items sixteen and seventeen. These touch the retrieval layer's schema or its index. Plan a calm week and a careful cut-over.

16. Brute-force similarity search. pgvector without an ivfflat or hnsw index does an O(n) scan on every query. At 42,000 vectors and any real traffic, your retrieval latency and your CPU bill both climb. The fix — building a tuned index — is mechanically simple, but you re-load the catalogue once you know the parameters you want and you need to re-run your eval on the indexed store before flipping production.

17. Retrieval boundary tied to the wrong key. The agent retrieves by product_id, but your team thinks in style_id and a style groups variants the customer experiences together. The embedding was built against the wrong unit of meaning. To fix this, you change the export schema, re-embed, re-tune the prompts that reference retrieved chunks, and re-eval. Plan a week.

What to do tomorrow morning

Open one log entry from yesterday. Count the tokens and split them into three lines: cached input, fresh input, output. If you cannot tell which line is which, you cannot fix items one through ten. When we rebuilt the product-copy pipeline for a Dutch homeware client running roughly 38,000 SKU-variants, the thing we ran into was item three — a build timestamp sat at the top of the system prompt and the cache never hit a single call. Moving four lines down the file cut their monthly invoice by 71%. The longer playbook for that kind of work lives under AI agents.

Key takeaway

Most product-description-agent overruns come from prompt-cache misses and counting input tokens only. Both are one-line fixes if you find them before re-embedding the catalogue.

FAQ

How much can prompt caching actually save on a product-description agent?

On a typical setup where 80–90% of the prompt is shared brand-voice and format instructions, the cached portion bills at roughly 10% of the standard input rate after the first hit. Net invoices usually fall 40–70%.

Is a Haiku-class model really good enough for product descriptions?

For structured copy with a clear voice guide and three solid few-shot examples, almost always yes. Run a 200-item blind comparison against your current model on real SKUs before swapping the whole catalogue.

How do I know the embedding model is wrong without re-embedding everything?

Build a gold set of 200 query/expected-result pairs from your own catalogue. Re-embed only that subset with two or three alternatives and measure recall@5. Two hours of work, no production change.

What's the first thing to check when the invoice surprises you?

Open one log entry, count cached input, fresh input, and output tokens separately. Most overruns are visible in the first prompt you look at.

ai agentse-commercestrategyarchitectureoperationsrag

Building something?

Start a project