Chat agents

Logistics chat agents: a playbook for the human handoff

A freight forwarder's chat agent answers 80% of tracking questions fine. The other 20% is where money and trust live. Here is the handoff playbook we use.

Jacob Molkenboer· Founder · A Brand New Company· 23 Feb 2024· 10 min

Manila freight manifest under brass desk bell, pneumatic tube cartridge, green paper tag, red wax seal on ivory paper.

It is 19:40 in Rotterdam. A small forwarder's chat queue has eleven open conversations. Eight are tracking pings the agent answered in two seconds. One is a customer asking why a pallet arrived with a torn shrink-wrap. One is a German shipper who wants to know if the bot's promised Tuesday delivery is "a real promise or a robot promise." The last one is a number the agent has not seen before, asking if the company moves dangerous goods to Algeria.

The chat agent should answer some of these and escalate the rest. The whole game is figuring out which is which, and doing it before the customer asks twice.

This is the playbook we use at ABN for logistics SMEs: small freight forwarders, 3PLs, last-mile operators, and customs brokers. Most of the value of a chat agent in this space does not come from answering more questions. It comes from handing the right ones to a human at exactly the right moment.

The wrong frame: maximise containment

The first instinct of every operations lead is to track "containment rate", the share of conversations the bot closed without a human. Vendors love this metric because it is easy to measure and easy to inflate.

In logistics it is the wrong target. A bot that closes a damaged-pallet claim chat "successfully" has just created a legal exposure for your client. A bot that confirms a Tuesday delivery to a customer whose shipment is still sitting in Antwerp has shifted a delay problem into a trust problem. A bot that confidently tells a shipper you handle Class 3 dangerous goods when you do not has just promised something you cannot deliver.

The right target is harder to measure: handoff quality. Did the bot escalate the moments that needed a human, with enough context that the human picks up in under thirty seconds?

Five triggers that always escalate

These are non-negotiable in every logistics deployment we have shipped. The chat agent never tries to close them, never asks "is there anything else I can help with?", and never offers a workaround. It hands the chat to a human and stops talking.

Money on the line

Any phrase containing a quote request, a refund, a credit note, a disputed invoice, or a damage claim. Bots are terrible at the boundary between "I am sorry" and "we accept liability." The legal cost of one wrong word is larger than a year of bot savings.

A regulated cargo class

Dangerous goods (ADR, IMDG), pharma cold chain, live animals, weapons, anything covered by export controls. The bot does not need to know the rules. It needs to recognise the keywords and step back.

A new account asking for terms

If the chat is from a contact the system has never billed and the conversation contains pricing, credit terms, or contract language, escalate. New-account onboarding belongs to sales, not to a chat agent.

Customer sentiment below a threshold

We score every customer message with a fast sentiment pass. Two consecutive negative messages, or one message containing strong frustration markers ("unacceptable", "complaint", "lawyer"), routes to a human immediately.

Repeated bot failure

If the bot has answered the same customer twice and the customer is still asking the same thing, the bot is the problem. Hand it off before the third "no, I mean...".

Three triggers based on confidence, not topic

The five above are content rules. They fire on what the customer said. The next three fire on what the bot wants to say.

Retrieval below a confidence floor

We instrument the RAG layer to return both an answer and a confidence score derived from retrieved-chunk similarity. Below 0.62 cosine similarity on the top chunk (your number will differ, so calibrate it on a labelled set), the bot does not answer. It hands off. The customer never sees "I am not sure, but..." which trains them to distrust everything that follows.

Tool-call failure

If the bot tries to look up a shipment in the TMS and the lookup returns an error or stale data, escalate. Do not invent. Do not say "give me a moment" and try again. The dispatcher can call the carrier in five seconds; the bot cannot.

Hallucination self-check

Before sending any answer that contains a date, a price, or a regulatory claim, we run a cheap second pass that asks the model "is this grounded in the retrieved context, yes or no?" with a short, structured prompt. A "no" routes to a human. This catches the rare case where the model confidently invents a delivery date from nothing.

The handoff itself: under thirty seconds or it failed

A handoff is not "I will connect you to a colleague." That is a stall. A real handoff has three parts:

A human is notified in a channel they actually watch, usually a Slack channel, sometimes a phone ring, almost never an email.
The human picks up with full context: customer name, account history, the bot's summary of the conversation so far, and the specific reason for escalation.
The customer sees a single message: "Marieke from dispatch is picking this up now." Not five paragraphs of apology.

Routing code, simplified from a recent deployment:

type HandoffReason =
  | "money"
  | "regulated_cargo"
  | "new_account_terms"
  | "negative_sentiment"
  | "repeat_failure"
  | "low_confidence"
  | "tool_error"
  | "ungrounded";

interface HandoffDecision {
  escalate: boolean;
  reason?: HandoffReason;
  summary: string;
}

async function decide(
  msg: CustomerMessage,
  ctx: ConversationContext,
  draft: AgentDraft,
): Promise<HandoffDecision> {
  if (containsClaimLanguage(msg.text)) {
    return { escalate: true, reason: "money", summary: summarise(ctx) };
  }
  if (matchesRegulatedCargo(msg.text)) {
    return { escalate: true, reason: "regulated_cargo", summary: summarise(ctx) };
  }
  if (ctx.account.isNew && mentionsTerms(msg.text)) {
    return { escalate: true, reason: "new_account_terms", summary: summarise(ctx) };
  }
  if (ctx.recentSentiment.consecutiveNegative >= 2) {
    return { escalate: true, reason: "negative_sentiment", summary: summarise(ctx) };
  }
  if (ctx.recentBotAnswers.unresolvedRepeats >= 2) {
    return { escalate: true, reason: "repeat_failure", summary: summarise(ctx) };
  }
  if (draft.retrievalConfidence < 0.62) {
    return { escalate: true, reason: "low_confidence", summary: summarise(ctx) };
  }
  if (draft.toolErrors.length > 0) {
    return { escalate: true, reason: "tool_error", summary: summarise(ctx) };
  }
  if (draft.containsFactualClaim && !(await isGrounded(draft, ctx.retrieved))) {
    return { escalate: true, reason: "ungrounded", summary: summarise(ctx) };
  }
  return { escalate: false, summary: "" };
}

Order matters. Cheap content rules first, expensive model passes last. The grounding check costs a model call; we only do it if everything else passed.

What the dispatcher sees

This is where most deployments fail. The bot escalates, but the human gets a Slack ping that says "new chat" and has to scroll through twenty turns to figure out what is happening. By then the customer has waited four minutes and asked the same thing again.

The dispatcher view we ship has four blocks, in this order:

Why this is on your desk. One sentence. "Damage claim, shrink-wrap torn, pallet 4 of 6." Or "ADR enquiry from a new shipper." That is the whole header.
Who. Customer name, company, account status (existing, new, suspended), last three shipments with status, and outstanding balance if any.
What the bot already said. A bulleted summary of the bot's turns, not the raw transcript. Five lines maximum. The raw transcript is one click away.
Suggested next action. Pre-filled draft response the dispatcher can edit and send, or discard. Never sent automatically; this is the bot's last useful contribution.

The number to watch is not containment, it is time-to-human on the chats that needed a human. 60% containment with sub-30-second handoffs beats 85% containment with messy escalations, every quarter.

What we got wrong the first three times

Three mistakes worth naming, because every team makes them.

We tried to score sentiment with a generic classifier. Out-of-the-box sentiment models trained on product reviews are useless on freight chat. "The pallet is f***ed" and "wait, where is it" both score neutral. We ended up fine-tuning a small classifier on a few thousand labelled freight messages, which took an afternoon and worked.

We let the bot apologise. Early versions of the agent ended every escalation with three sentences of "I am truly sorry for the inconvenience." Customers hated it. Now the bot says one short sentence and gets out of the way.

We let context windows grow unbounded. One client's chat history per customer hit twelve thousand turns over a year. Loading all of it into every prompt was expensive and made the bot worse, not better. We now summarise anything older than the current shipment cycle and load the summary plus the last forty turns. The Mattermost team recently hit a similar wall with a ten-thousand-message UI cap; chat history is a budget, not a free resource.

Multimodal handoffs: the photo of damaged goods

The single highest-value handoff trigger we shipped last year was image-based. A customer sends a photo of damaged goods. The bot does not try to assess fault. It does try to extract: number of pallets in frame, visible damage type (crush, tear, wet, contamination), and whether the SSCC label is readable. That extraction goes into the dispatcher's handoff card so the human picks up already half-informed.

Warning

Never let the bot quantify damage or assign blame on a damage photo. Extract observable facts, hand off, full stop. "It looks like minor damage" is a sentence that has cost insurers seven figures.

One audit you can run today

Pull the last hundred chats your team handled and tag each one with the right outcome: should it have been bot-only, bot-then-human, or human-only from message one. Then check what your current system actually did. The gap between those two columns is your roadmap. We ran this audit for a 3PL in Tilburg last month and found their agent was escalating 14% of conversations that should have stayed with the bot, and quietly closing 9% that should have gone to a human within two turns. The 9% is the dangerous number. The 14% is fixable with prompt tuning. The 9% will eventually cost the company a claim or a customer.

When we built the chat agent for a Dutch freight forwarder earlier this year, the thing we ran into was the regulated-cargo edge case: shippers asking obliquely about goods they did not want to name in writing. We solved it by treating any ambiguity in the cargo description as an automatic handoff, not a retrieval problem. If you are designing or rebuilding a chat layer for a logistics business, our AI agents work starts with exactly this kind of trigger map, before a single line of prompt is written.

The smallest thing you can do today: open your last fifty chat transcripts and underline every sentence where a human should have taken over. Count them. That number is the floor of what your handoff logic needs to catch.

Key takeaway

Containment rate is a vanity metric in logistics chat agents. Time-to-human on the chats that needed a human is the real measure of the system.

FAQ

What is containment rate and why is it the wrong target for logistics chat agents?

Containment rate is the share of chats the bot closes without a human. In logistics it rewards the bot for closing chats it should have escalated, such as damage claims, regulated cargo, and new-account pricing. Optimise for handoff quality instead.

When should a logistics chat agent always hand off to a human?

On money (claims, refunds, credit notes), regulated cargo, new accounts asking about terms, two consecutive negative messages, repeated bot failures, low retrieval confidence, tool-call errors, and ungrounded factual claims.

What confidence threshold should we use to escalate from RAG to a human?

Start around 0.62 cosine similarity on the top retrieved chunk, then calibrate against a labelled set of real chats from your operation. The exact number is less important than measuring it and acting on it.

Should a chat agent assess damage from a customer photo?

No. Extract observable facts (pallet count, visible damage type, label readability) and hand off to a human. Letting the bot quantify damage or assign blame creates legal exposure.

How fast should a human pick up after a chat agent escalates?

Under thirty seconds. Beyond that, the customer asks the same question again and the handoff feels like a stall. The dispatcher card must contain the reason, customer history, and a bot summary so context is instant.

chat agentsai agentsautomationworkflowoperationscase study

Building something?

Start a project