Process automation
Process automation in industrial IoT: a SCADA anomaly queue
On a Tuesday afternoon in May, a Deventer reactor pressure ticked from 3.8 bar to 4.3 bar in eight seconds. The queue caught it. No setpoint moved.

On a Tuesday afternoon in May, around 14:42 local time, the headline pressure sensor on a reactor in Deventer ticked from 3.8 bar to 4.3 bar in eight seconds. Twelve years ago, that reading would have walked straight through the Wonderware Historian and into the control loop. This time it stopped in a queue. A process engineer looked at it for ninety seconds, decided it was a real swing and not a transient, and approved a manual setpoint adjustment from her laptop in the office in Apeldoorn. The OPC-UA bridge wrote the new value at 14:46. No alarm. No call to the night shift. No paperwork on Monday.
The customer and the two histories
The customer is an 18-person industrial-IoT SaaS spun out of a process-control consultancy in 2019. They sell predictive-maintenance dashboards to mid-sized chemical and food plants across the Benelux and Germany. Their stack has the shape that anyone who has worked OT-side will recognise. A 12-year-old Wonderware Historian (now branded AVEVA Historian) sits on the plant side. A homegrown InfluxDB cluster sits on the IT side. A nightly ETL stitches them together and nobody fully trusts it.
The Wonderware instance is the source of truth for what the plant physically did. InfluxDB is what the dashboards and the alerting engine read from. When the two disagree, which they do roughly 4,200 times per week across the customer base, somebody has to decide which version of reality is correct. Until last winter, that somebody was a junior procesingenieur with a Notion page and a lot of patience.
The brief we got was short. Make the reconciliation queue something a senior engineer can drain in two coffee breaks, not a junior engineer can drown in over five days.
What the agent actually does
The agent runs every 30 seconds. It does three things. It pulls the same 10-minute trailing window from both stores. It diffs each tagged sensor against a per-tag tolerance. It decides whether to auto-correct the IT-side cluster or park the anomaly for a human.
def reconcile_window(window):
hist = pull_historian(window) # AVEVA / Wonderware
influx = pull_influx(window) # IT-side dashboard source
for tag in TAGGED_SENSORS:
a = hist.series(tag)
b = influx.series(tag)
delta = drift(a, b)
if delta["mae"] > tag.tolerance:
yield Anomaly(tag, a, b, delta)
That is the entire top loop. The interesting part is what happens between yielding an anomaly and writing anything anywhere. Across the customer base, the loop currently surfaces about 4,200 anomalies per week. After the gate described below, around 230 land in front of a human. The rest are auto-corrected on the IT side, with a log line, and never touch the plant.
The 4-bar rule and the queue
The first lesson the customer's lead engineer gave us was this: any pressure deviation over 4 bar absolute is not the agent's call. Ever. Swings of that size can shift a relief valve, void a CE marking on the vessel, and ruin somebody's quarter. The rule is hard-coded.
def must_human(anomaly):
# Pressure deviations over 4 bar can shift a relief valve.
# Never auto-corrected, even when the data is unambiguous.
if anomaly.tag.physics == "pressure" and abs(anomaly.delta) > 4.0:
return True
# Any tag mapped to a redundant safety loop.
if anomaly.tag.safety_class >= 2:
return True
return False
The gate produces, on average, 38 items per day across the customer base. Each one becomes a single Slack message in a private channel called #procesingenieur-queue, with the time-aligned chart attached, a short natural-language summary, and two buttons: approve write and reject. The agent does nothing until a button is pressed. If neither is pressed within two hours, the anomaly is escalated to the on-call number through Opsgenie.
Why the agent never closes the loop
There is an engineering reason for the queue and there is a political reason. Both are real and neither is going away.
The engineering reason is that any model, local or hosted, can hallucinate the wrong tag id at a rate too low to catch in QA and too high to bet a relief valve on. The reconciliation diff is plain numerical Python. The summary the engineer reads in Slack passes through a language model, and that model is allowed to be wrong about the prose. It is never allowed to choose the tag.
The political reason is that the customer's end clients sign contracts that say a named human approves every setpoint change on the process side. NIST SP 800-82r3 and IEC 62443 both lean hard on that boundary. An auditor walking into a plant will ask to see the approval trail before they ask about anything else.
So the write boundary at the OPC-UA bridge is enforced at the credential level, not the application level. The agent's service account can enqueue. It cannot write. A separate process, owned by the customer's ops team and running on a different host, drains the approved queue and writes to the bridge. The two processes do not share a secret store. If the agent is fully compromised tomorrow, the worst it can do is fill a Slack channel.
Why not a local model on the plant
One question comes up every time we explain this stack to a new prospect: why not run a local model on the plant network and skip the VPN to a hosted summariser? The HN front page this week has a thousand-comment Ask HN thread on exactly that, with people benchmarking local Qwen and Llama variants against hosted models for daily coding. The answer for this customer is mixed.
The reconciliation logic itself runs on a small box inside the OT segment. It has no model at all. It is pandas, numpy, and a httpx client. The model only enters when we generate the Slack summary, and that one runs off-plant behind a VPN with strict egress rules. A local model on the plant side was on the table for six weeks. The customer's IT security lead vetoed it for a reason we agreed with: a model on the OT segment is one more thing to patch, one more thing to monitor, and one more thing an auditor will ask about. The summary is not on the safety path. We were happy to move it off-network.
The grid-alignment trap
The single longest debugging session on this project was about clocks. Wonderware logs on change with millisecond precision. InfluxDB logs on a fixed one-second cadence. The two clocks drift by a few hundred milliseconds depending on which PLC is talking and how busy the OPC server is. Comparing raw points produces phantom anomalies that look like data corruption and are in fact one source sampling a curve more aggressively than the other.
import pandas as pd
def aligned(a, b, freq="1s"):
"""Resample both series to a shared 1-second grid before comparing."""
start = max(a.index.min(), b.index.min())
end = min(a.index.max(), b.index.max())
grid = pd.date_range(start, end, freq=freq)
return a.reindex(grid).interpolate(), b.reindex(grid).interpolate()
def drift(a, b):
a, b = aligned(a, b)
delta = (a - b).abs()
return {
"mae": float(delta.mean()),
"p95": float(delta.quantile(0.95)),
"max": float(delta.max()),
"max_at": delta.idxmax().isoformat(),
}
The first version of this agent flagged 11,000 anomalies in its first 24 hours. We were diffing pre-aligned series. Almost all of it was clock skew. Fix the alignment before you tune the tolerance.
Three failure modes we hit early
The first was the clock-skew flood above. The second was schema drift on the Wonderware side. An engineer at one plant added six new tags to a vessel and did not tell anyone. The agent saw new keys in the historian, no matches in InfluxDB, and panicked. We now treat an unmatched-tag rate above 2% in a single window as a configuration event, not an anomaly, and the agent pings the customer's ops channel instead of the procesingenieur queue.
The third was retention. The InfluxDB cluster has a 7-day hot retention policy on raw points. One morning the agent asked for an 8-day window after a long bank-holiday weekend, got empty InfluxDB series, computed drift as zero, and quietly approved a batch of writes that should never have been approved. Nobody got hurt because the 4-bar gate caught the only one that mattered. We now treat an empty series as an error condition and bail out of the loop entirely. If the data is not there, the agent is not allowed to guess.
What changed for the ops team
Before the agent, the junior procesingenieur spent about 11 hours a week eyeballing dashboards and tagging Notion entries. The senior engineer spent about 4 hours a week reviewing the junior's tags. The night shift was called twice a week on average to confirm or override a setpoint write the dashboard had recommended. None of that was high-leverage work.
Six months after the agent went live, the junior is no longer in this loop at all. Total ops time spent on reconciliation is about 90 minutes a week in two scheduled blocks, plus the 38-items-a-day Slack queue which is mostly approve-in-three-seconds. The night shift has been called once in the last six weeks. The senior engineer has used the freed hours to start a control-loop tuning project the customer has wanted for three years and never had the bandwidth for.
The 4,200 number is anomalies per week, before the gate. After auto-correction, around 230 reach a human. Of those, the engineer approves about 92% as the agent suggested. The 8% rejected is the entire point of the queue.
An anomaly queue earns its keep only when humans reject a real share of what the agent surfaces. If the rejection rate sits near zero, you have built a notifier, not a queue.
When we built this agent for the Deventer customer, the thing we ran into was not the model. It was the eight months of work explaining to the customer's external auditor why a credential-level write boundary at the OPC-UA bridge is materially safer than the manual workflow it replaced. We ended up solving it by giving the auditor a read-only view of the queue and the approval log, scoped per plant. The audit closed in two visits. If you are running a similar two-historian split and quietly hoping the nightly ETL will hold for another year, this is the kind of process automation work we do.
The smallest thing you could do today: pull the same 10-minute window from your primary historian and from whatever your dashboard reads, snap both to a 1-second grid, and look at the mean absolute error per tag. The numbers will tell you whether you have a reconciliation problem or whether you only think you do.
Key takeaway
An anomaly queue earns its keep only when humans reject a real share of what the agent surfaces. Otherwise you have built a notifier, not a queue.
FAQ
Why not let the agent write directly to OPC-UA?
Auditors require a named human approval per setpoint change under NIST SP 800-82 and IEC 62443. A model can also hallucinate a wrong tag id at a rate too low to catch in QA and too high to bet a relief valve on.
How does the agent handle clock skew between Wonderware and InfluxDB?
It resamples both series onto a shared 1-second grid before diffing. Raw-point comparison produces phantom anomalies because the two stores sample at different cadences and the OPC server adds variable latency.
What model runs inside the agent?
The reconciliation loop runs no model at all, only pandas and numpy. A hosted language model generates the Slack summary, but it never selects the tag id or decides whether a write is allowed.
What happens if the procesingenieur queue is not drained in time?
An anomaly unactioned for two hours escalates to the on-call number through Opsgenie. If that ping is missed, the agent still does not auto-correct. It holds the proposed write until a human signs off.