AI agents
S3 egress incident: when an AI refactor doubled our bill
The Slack ping came at 14:02 on a Tuesday: an AWS budget alarm tripped, and the rsync wrapper that should have been quiet had quietly doubled our S3 egress.

The Slack ping came at 14:02 on a Tuesday in May: an AWS budget alarm tripped on the eu-west-1 account that holds the media archive for a public broadcaster in Groningen. Spend was tracking 2.1x the rolling fourteen-day average, almost all of it S3 egress out of a single bucket. The nightly archive sync had finished without errors. Nothing in CloudWatch looked wrong. The logs said "complete".
Four hours later we knew what happened. Someone, on the Friday before, had asked an AI assistant to clean up the rsync wrapper that drove the nightly sync. The diff was small, well-commented, and reviewed in eleven minutes. It also re-introduced a bug pattern an earlier engineer had specifically designed out two years before. This is the walkthrough.
The wrapper, before
The script lived at /usr/local/bin/archive-sync on a single Hetzner box in the broadcaster's Groningen office. It pushed daily media drops, roughly 800 GB of MXF and ProRes a night, to s3://archive-prod/YYYY/MM/DD/. The ingest box wrote files into /mnt/ingest/$DAY/ throughout the day. At 02:00 Amsterdam time, archive-sync ran.
The original script, stripped of comments:
#!/usr/bin/env bash
set -euo pipefail
DAY="$(date -u -d 'yesterday' +%Y/%m/%d)"
LOCAL="/mnt/ingest/${DAY}"
BUCKET="archive-prod"
MANIFEST_LOCAL="${LOCAL}/.manifest.sha256"
MANIFEST_S3="s3://${BUCKET}/${DAY}/.manifest.sha256"
# 1. Build local manifest of today's files
( cd "$LOCAL" && find . -type f ! -name '.manifest.*' \
-print0 | xargs -0 sha256sum > "$MANIFEST_LOCAL" )
# 2. Pull previous manifest from S3 (one GET, a few KB)
aws s3 cp "$MANIFEST_S3" /tmp/old-manifest 2>/dev/null || : > /tmp/old-manifest
# 3. Upload only the files whose hash changed
comm -23 <(sort "$MANIFEST_LOCAL") <(sort /tmp/old-manifest) \
| awk '{ print $2 }' \
| while read -r f; do
aws s3 cp "${LOCAL}/${f}" "s3://${BUCKET}/${DAY}/${f}"
done
# 4. Publish the new manifest
aws s3 cp "$MANIFEST_LOCAL" "$MANIFEST_S3"
Four moving parts. The manifest is the load-bearing one. The script computes SHA-256 locally, pulls the previous manifest from S3 (one GET, a few kilobytes), diffs them, and only copies the files whose hash actually changed. On a typical night, about 5% of files changed. Egress on the verification step was the size of one small JSON file, never the data.
There was a comment at the top, in Dutch, that the original author wrote in 2024: # NIET vervangen door 'aws s3 sync'. Geprobeerd. Doet HEAD op elk object. Roughly: "Do not replace with aws s3 sync. We tried. It does HEAD on every object."
That comment got deleted in the refactor.
The refactor
The on-prem engineer asked an AI assistant to "clean this up, add error handling, make the steps clearer". The assistant produced a 110-line version. It was, by every readability metric, better. Functions had names. The manifest logic got pulled into build_manifest and fetch_previous_manifest. There was a verify_upload step at the end that the original lacked.
Here is what verify_upload looked like:
verify_upload() {
local day="$1"
local bucket="$2"
local tmp
tmp="$(mktemp -d)"
trap 'rm -rf "$tmp"' RETURN
# Pull each object back, recompute the hash, compare to the local manifest.
aws s3 sync "s3://${bucket}/${day}/" "${tmp}/" --only-show-errors
( cd "$tmp" && sha256sum -c "${LOCAL}/.manifest.sha256" )
}
Read it twice. The intent is honest: pull what we just uploaded, recompute the hash, confirm it matches. Belt and braces.
What it actually does: at the end of every nightly run, download the entire day's archive (800 GB) from S3 to a temp directory, verify checksums locally, then delete it. Every night. Every byte we had just pushed up came right back down again.
At AWS's standard internet egress rate for eu-west-1, 800 GB a night runs into several thousand euros a month in transfer alone. The original was billing roughly forty euros a month for the same workload, because the manifest comparison happened locally and S3 only ever served one small JSON file. The full S3 pricing page spells it out: bytes leaving the bucket to the internet are billed; bytes that stay inside the region are not.
Inside the diff
The PR description was lifted verbatim from the assistant's reply:
Cleaned up archive-sync.sh: extracted manifest logic into named functions, added structured error handling, added post-upload verification. No behaviour change.
"No behaviour change" is the phrase you watch for. The function bodies were faithful translations. The error handling was real. The verification step was new, and the reviewer (an in-house ops lead, not an engineer by trade) approved it because verification is good.
The bug was not in any single line. The bug was that verification had been deliberately omitted from the original because the manifest comparison made it unnecessary, and re-introducing it with aws s3 sync in the wrong direction quietly inverted the script's data flow. That intent lived in a one-line Dutch comment and in nobody's head. The assistant cannot read what is not in the file. The reviewer did not catch it because the diff added code, and added code feels safer than removed code.
This pattern, an AI-generated refactor that re-introduces a bug a previous engineer explicitly designed out, is the shape behind the wider conversation that has been running on Hacker News this past month about rsync regressions. Well-intentioned changes that quietly violate invariants the original code held implicit.
The four hours
13:48. AWS Budget Actions tripped, posting to #alerts.
14:02. First human eyes (ops on-call).
14:09. Cost Explorer showed the spike isolated to one bucket. Egress, not storage, not requests.
14:15. VPC flow logs ruled out (no EC2 in eu-west-1 for this bucket).
14:38. CloudTrail data events showed GetObject calls from the Groningen office IP, in long bursts that aligned with 02:00 to 04:30 Amsterdam time, every night for a week.
15:02. Pulled archive-sync.sh from the box. Read it. Spotted verify_upload.
15:14. Confirmed by running the wrapper in dry-run mode and watching the planned downloads.
15:30. Hotfix: removed verify_upload, redeployed.
17:50. AWS Support case opened, not for refund, for clarity on the billing window.
Four hours. The reason it took four hours was that the symptom (egress) and the cause (a verification step calling sync in the wrong direction) were two files apart in our heads. We looked at network configs, CloudFront behaviour, and IAM policies before we looked at the script the change had actually landed in. Always read the recently-changed script first. We knew this. We did not do it.
The diff we should have caught
In hindsight the giveaway was a single direction reversal. The original moved bytes one way, local to S3. The refactor added a function that moved bytes the other way, S3 to local. In a script whose entire purpose is one-directional, any line that moves data the opposite way is suspicious.
Three review heuristics we now apply to any refactor of an I/O script:
- Find every bytes-moving call.
rsync,scp,aws s3 cp,aws s3 sync,curl,wget. Note source and destination for each. - Compare the source/dest table to the original. Any new row, any swapped source and destination, gets a written justification in the PR body.
- Look for deleted comments. Comments that say "DO NOT" or "we tried this" describe constraints the code does not. If an AI refactor removed one, the diff is not done.
These are unglamorous. They take maybe ninety seconds per script. They would have caught this one inside the eleven-minute window the PR actually had.
Our review fixes
We did three things, none of them about banning AI refactors. We use AI assistants for refactors every day, and we will keep doing it. We changed how we review them.
First, every refactor PR that touches a script in a cron or systemd timer path now requires a "data-flow diff": a side-by-side table of every external call, with arrows. We generate it with a small wrapper around git diff and the assistant itself, and it lands directly in the PR description. Thirty seconds, every time.
Second, we added a CI step that diffs the previous run's CloudWatch egress metric against the new one for the affected bucket. If a refactor lands, the next night's run is watched. If egress jumps more than 1.5x, the wrapper is rolled back automatically. This is the brace; the data-flow review is the belt. AWS already publishes the server-side integrity primitives you actually want for verification in the object integrity checksums docs: PutObject with --checksum-algorithm SHA256 means S3 computes the hash on receipt and you compare against your local one without ever pulling the bytes back.
Third, we lifted every "DO NOT" comment in the repo into a top-level INVARIANTS.md referenced from CODEOWNERS. Refactors that touch a line near one of those comments now require a second reviewer. The comments live in two places, which is annoying, and is the price of having an AI in the loop that does not know what we tried before.
The dangerous AI refactor is not the one that breaks the build. It is the one that passes review, runs cleanly, and quietly violates an invariant the original code held implicit. Catch it with data-flow diffs, not with vibes.
The smaller lesson
S3 egress is one of those AWS costs that does not show up in the architecture review and shows up loudly in the bill. Every GB out of an eu-west-1 bucket to the public internet is billed; every GB that stays inside the region is not. Verification scripts that pull data back across the internet are a cost trap. Verification scripts that ask S3 to compute the hash server-side are not.
The broader pattern is the one the open-source crowd has been arguing about for the last fortnight. AI-driven refactors land at the speed of typing, and the review process for them is still mostly the review process we built for human-authored diffs. That older process assumes a human wrote the change and held the implicit constraints in their head. The assistant does not. The constraints have to be written down, and the reviews have to look for the constraints that were deleted, not just the lines that were added.
When we build AI agents and automation for clients running cost-sensitive infrastructure across the Netherlands and Thailand, this is the failure mode we plan for first. The thing we ran into on this Groningen archive was that the assistant cannot see what is not in the file. We ended up solving it by writing the invariants down twice, once as comments and once as CI checks that fire on the right metric.
Five-minute audit you can run right now: open the cron jobs on your most critical box and find the script you have not touched in six months. Read every command that moves bytes. Note the direction. If a recent refactor reversed any of them without a written reason, you already know what to do.
Key takeaway
The dangerous AI refactor is the one that passes review, runs cleanly, and quietly violates an invariant the original code held implicit.
FAQ
Was this the AI assistant's fault?
No. The assistant produced clean, well-named code. The bug was in our review process. We approved a diff that added a function whose data direction contradicted the script's whole purpose.
How do we audit our own cron scripts for this pattern?
Open every cron job on your most critical box. Read each line that moves bytes (rsync, aws s3 cp, curl). Note source and destination. Any recent refactor that reversed a direction without a written reason is the same shape.
Why didn't CloudWatch catch it on night one?
It did. The budget alarm fired the morning after the first inflated night. Nobody noticed until the third compounding day. Route AWS Budget Actions alerts to a Slack channel humans actually read.
What is the right way to verify an S3 upload?
Use S3's server-side checksums. Send PutObject with --checksum-algorithm SHA256, store the returned checksum, and compare against your local hash. No bytes ever leave the bucket to confirm integrity.