Chat agents
WhatsApp chat agent on legacy Planon: the full playbook
It is 22:14 on a Thursday and a tenant in Tilburg has just messaged WhatsApp about a leaking radiator. The work order must exist in Planon before midnight.

It is 22:14 on a Thursday and a tenant in a Tilburg housing block has just messaged WhatsApp: “kraan in keuken lekt, water staat al op de vloer”. The on-call planner is asleep. The night-shift contractor has a phone but no laptop. The maintenance team’s facility management system, a Planon instance first deployed in 2014, will not accept a work order unless the tenant ID, location code, asset, priority and SLA window are all correctly filled. The SLA clock starts the moment the incident object is created. If nobody creates that object before midnight, the contractual response window slips into the next calendar day and the housing association gets a complaint they did not earn.
This is the post about how we built the chat agent that closes that gap for a 37-person facility-services group, processes about 1,180 tenant maintenance requests a week, and writes every one of them straight into the twelve-year-old Planon FMIS without ever asking the planner to copy anything.
The constraint we could not negotiate
The brief was unambiguous. Planon stays. The group’s customer contracts, KPI reports, sub-contractor SLAs and finance exports all run off the same FMIS. A replacement project had been costed at over half a million euro and three quarters of staff time. Nobody had the appetite. So the AI agent had to write into the legacy system, not around it.
The version we were dealing with predated the modern Planon REST APIs. There was a SOAP web service, a partial Planon Connect endpoint, a database read replica we were allowed to touch, and the formal Planon AppSuite XML interface for incident creation. The internal team had a Planon administrator who knew the schema cold and one half-time Java developer who had been keeping the customisations alive since 2018. That was the whole runway.
Two non-negotiables came from the operations director. Every WhatsApp message had to land as an incident object inside Planon (no shadow database), and the SLA clock had to be the one Planon already calculated. Not a parallel timer in the agent. The auditor would never accept two clocks.
Architecture in one paragraph
The pipeline runs in four hops. A tenant sends a WhatsApp message to the facility group’s Business number. The WhatsApp Cloud API delivers it to Twilio Conversations, which gives us a single threaded inbox across WhatsApp, SMS and the existing webchat widget. Our agent middleware (a TypeScript service on the group’s own VPC) reads the thread, runs the classifier and slot-filler, and decides whether to auto-create the Planon incident, ask one clarifying question, or hand the conversation off to a human. The middleware POSTs the incident into Planon via the AppSuite XML interface. Planon returns the incident number. The middleware writes that number back into the Twilio conversation as a system message, so every human who later joins the thread sees the case ID at the top.
Why Twilio Conversations and not the WhatsApp Cloud API directly? Because the group already ran Twilio for outbound SMS reminders and the night-shift dispatcher used Twilio Flex for voice. Putting WhatsApp into the same Conversations resource meant the planner could pick up a thread on a tablet, the night supervisor could see it in Flex, and we did not have to invent a new inbox UI. The unified Conversations object is documented at twilio.com/docs/conversations and is one of the few cases where the vendor abstraction is worth the per-message fee.
Identifying tenants without making them log in
The single hardest UX decision was authentication. Tenants will not click links. They will not type tenant IDs. They will not log in to a portal. They will photograph a leaking pipe and send it with the caption “huis nummer 14 het lekt weer”. The agent has to figure out who they are without making the conversation feel like a form.
Three signals do most of the work. The WhatsApp phone number, matched against the contact record in Planon’s tenant table. Free-text address parsing using the Adres, Huisnummer and Postcode fields already on the tenant master. A fallback question, only when the first two fail, asking for the four-digit Planon tenant code printed on every invoice. About 92% of inbound messages match on phone number on the first try. The address parser catches another 6% or so (housemates using a different phone). The remaining 2% trigger the four-digit fallback. We never ask for full names. Privacy and reading-from-a-screen friction are the same problem in this product.
The intent layer
The classifier is small. There are only seven intent classes that matter to a facility-services group: new maintenance request, update or chase on an existing case, access or key request, complaint or quality issue, billing or rent question (handed straight to a different team), emergency (gas, water, electricity, lift trapped), and everything else (small-talk, wrong number, scam).
We use a single LLM call per inbound message with a structured-output schema. The schema returns intent, confidence, extracted slots (location, asset type, severity, language) and a refusal flag. If confidence drops below 0.78 the agent asks one clarifying question and re-classifies. Two clarifications and the conversation routes to a human. We capped it at two because in production logs we saw that any agent asking a third question lost the tenant. They would put the phone down and call the front desk, which is the failure mode we were paid to remove.
For the emergency class the LLM call still runs but is wrapped in a regex guardrail. Any message that also matches a list of trigger phrases (“gas”, “rook”, “spuit”, “geen stroom”, “vast in de lift”, and so on) jumps straight to the on-call dispatcher in Flex without waiting for the model. The model is right roughly 99% of the time on these, but 99% is not good enough when the failure mode is a gas leak. Guardrails on the fast path are not a question of taste. The recent run of front-page stories about agents running amok inside production systems is a reasonable yearly reminder: an agent with the right tool calls and a sloppy classifier can do real damage. We treat the Planon write as a privileged action and gate it the same way you would gate a sudo.
Writing the incident object the right way
This is the section where we earned our fee.
Planon’s incident object has roughly forty fields. About fifteen of them are mandatory if you want the SLA engine to start the clock automatically. Get any one of them wrong and Planon will accept the record but skip the SLA evaluation, which means the report at the end of the month will show every chat-agent-created incident as having a blank response time. The Planon administrator would notice this on day three and the whole project would die.
We mapped every mandatory field to a deterministic source. The tenant ID came from the phone-number match. The location came from the tenant’s primary address record. The asset code was nullable, set only when the LLM extracted one with confidence above 0.9. The priority was derived from the severity slot, mapped through the customer-specific priority matrix (the housing association uses 1 through 4, the school portfolio uses 1 through 5; the matrix is per-contract). The SLA was looked up from the contract object, never inferred. The report date was server time at write. The description was the LLM’s clean Dutch summary plus a literal verbatim of the tenant’s first message and any image filenames.
function composeIncident(msg, tenant, contract) {
const priority = priorityMatrix[contract.mandantCode][msg.severity];
return {
BO_TENANT_ID: tenant.id,
BO_LOCATION: tenant.primaryLocationId,
BO_ASSET: msg.assetCode ?? null,
BO_PRIORITY: priority, // computed, never from LLM
BO_SLA: contract.slaId, // computed, never from LLM
BO_REPORT_DATE: new Date().toISOString(),
BO_DESCRIPTION: `${msg.cleanSummary}\n\nVerbatim: ${msg.raw}`,
BO_MANDANT: contract.mandantCode,
};
}
We deliberately do not let the LLM choose priority or SLA. The model writes the description and the asset hint. Everything that touches the contractual clock is computed from joins inside the middleware. If we ever have to defend a missed SLA in front of the customer, the trail has to be explainable in one screen. “The LLM thought it was a P2” is not a defensible answer.
Do not let the LLM pick the contractual priority. Map it through the customer’s existing priority matrix from data already in the FMIS. The audit trail is the product.
Twilio Conversations as the escalation lane
Escalations happen for three reasons. The classifier is uncertain twice, the intent is “complaint”, or the SLA clock is running below a threshold (we use 25% remaining) and the case is still in “new” status.
In all three cases the agent posts a Conversations system message naming the reason, pings the right Twilio queue (planners by day, on-call dispatcher by night), and leaves the WhatsApp thread open. The human picks up inside the same thread. The tenant sees no break. From the tenant’s side, “the company” is one continuous chat window. From our side, we know exactly which messages were model-generated and which were human, because every model message is tagged with a hidden attribute, x-abn-author: agent-v3.1, that surfaces in the audit export.
A subtle thing about Twilio Conversations: the WhatsApp 24-hour customer-care window still applies even when a human takes over. If the planner replies more than 24 hours after the last inbound message, the message gets dropped unless it goes out as an approved template. The middleware tracks the window per conversation and quietly switches outbound to a template after it expires. Meta’s WhatsApp Cloud API messaging guide is the source of truth here and is worth a read before you ship anything serious.
Guardrails, refusals, and the audit trail
Three guardrails worth describing.
First, every Planon write is preceded by a dry-run validation against the same XML schema Planon uses internally. If the dry-run fails, the message escalates to a human and the agent never retries with mutated input. We never have the model “fix” the XML. The write either passes validation as composed by deterministic code, or it does not run.
Second, the model has no tool that modifies or closes an incident once written. The chat agent can only create new incidents or post comments. Status changes (accepted, in progress, closed) are reserved for the planner UI. A chat agent that can close its own tickets is a chat agent that will optimise its own metrics.
Third, every model call is logged with the full prompt, model version, raw response, latency and cost into a Postgres table the customer owns and reads. The retention is set by their DPO, not by us. Given the current public discussion around vendor-side data retention minimums, designing for “we control the logs” from day one is the only sane default. We turn on zero-day retention at the model vendor and write our own logs into the customer’s database.
For the orchestration layer itself we have looked at Burr from DAGWorks for the state-machine view. We are not running it in this project (the agent loop is small enough that a hand-written state machine is more legible), but for a more conversational agent with branching, a graph framework pays for itself the first time you have to debug a stuck conversation.
What changed in the first eight weeks
Numbers from the customer’s own dashboard, not ours.
- Median time from tenant message to Planon incident created: 14 minutes (planner by hand) down to 38 seconds (agent).
- Share of out-of-hours tenant requests that hit the next-morning queue with a correct SLA window already attached: from 41% to 96%.
- Planner hours per week previously spent on triage typing: roughly 22 hours, redirected into the cases that actually need a human.
- Complaint rate (“nobody got back to me”) on out-of-hours messages: down by about half, measured in the housing association’s monthly KPI report.
We do not claim the agent makes planners faster. It removes typing they should never have been doing. That is a different and more honest claim.
The five-minute audit you can run tomorrow
If you are looking at a similar setup, the cheapest first step is a field-level audit of your own FMIS or CRM. Print the mandatory fields on the incident or ticket object, sit next to a planner for an hour, and watch where they paste from. Anywhere they paste, you have a candidate for the agent to write deterministically. You will probably find the SLA clock is the field that scares everyone the most. Map it back to a contract row, not a model guess, and the project gets a lot less terrifying.
When we built the chat agent for the Tilburg group, the thing we did not see coming was that Planon’s XML interface accepts a slightly different priority code per customer mandant, and one enum was off by one between the housing-association mandant and the schools mandant. We solved it by reading the mandant code from the tenant record before composing the XML, which cost us a week and is most of the work in building AI agents against systems built before there was an API to write to.
Key takeaway
Let the chat agent create work, never close it. Map SLA priority through the customer's existing matrix, not the model. The audit trail is the product.
FAQ
Does the agent ever close a ticket on its own?
No. The chat agent can create incidents and add comments. Status changes (accepted, in progress, closed) stay with the planner UI. An agent that can close its own work optimises its own metrics.
How do you handle the WhatsApp 24-hour customer-care window?
The middleware tracks the window per conversation. Once it expires, outbound human replies are automatically switched to an approved template message until the tenant writes again.
What happens if Planon is unreachable?
The agent acknowledges the tenant, queues the incident in a local outbox, and retries with backoff. The report date written into Planon is the original message timestamp, so the SLA clock still anchors correctly once it comes back.
Why Twilio Conversations rather than the WhatsApp Cloud API directly?
The planners, the night dispatcher and the existing SMS reminders already lived inside Twilio. One inbox per tenant across WhatsApp, SMS and webchat is worth the per-message fee.
Does the LLM ever choose the SLA priority?
No. The model extracts a severity hint. Priority and SLA are computed inside the middleware from the customer-specific matrix and the contract object. Anything that touches the contractual clock has to be defensible without the model.