AI System Design Newsletter

AI System Design Newsletter

Updated : How a Date Format Change Broke a Production AI Model Without Any Errors

Jun 03, 2026
∙ Paid

The model that got worse in January and nobody noticed until March

A financial services company ran a credit risk assessment model in production. In January, one of their data vendors quietly changed the format of a field in their feed — a date field switched from YYYY-MM-DD to MM/DD/YYYY. The parsing code didn’t crash. It silently produced wrong values. The model consumed those wrong values and produced wrong risk scores. The wrong scores were used to make lending decisions for eight weeks before a data analyst noticed that default rates on loans approved in January were running 2.3 standard deviations above the historical baseline.

Eight weeks. Hundreds of lending decisions. Real money lent to real people at the wrong risk assessment.

The system had uptime monitoring. It had latency monitoring. It had error rate monitoring. None of those metrics moved when the date format changed, because the model never crashed and the pipeline never errored. The thing that changed was the quality of the output — and nobody was measuring that.

This is the core problem of AI observability. The system can be fully operational and completely wrong at the same time. Traditional infrastructure monitoring can’t see this. You need something different.


Why uptime and latency tell you almost nothing about whether your AI is working

Infrastructure monitoring was built for a world where “working” meant “responding.” A web server that returns 200 status codes is working. A database that completes queries is working. The system’s correctness is implicit in its operation — if it runs without errors, it’s doing the right thing.

AI systems break this assumption completely. A language model or ML pipeline can respond instantly, return a valid JSON object, complete without any errors, and be completely wrong. The output is syntactically valid but semantically broken. Traditional monitoring has no category for this.

Think of a weather forecast service. Uptime monitoring tells you the forecast API is responding. Latency monitoring tells you it’s responding in 80 milliseconds. Error rate monitoring tells you it’s never returning a 500. None of those metrics tell you the forecast is predicting 75°F on days when it snows. The system is operational. The output is wrong.

Your AI pipeline is that forecast service. The question “is it working?” has two completely separate answers: “is it running?” (infrastructure question) and “is it producing correct outputs?” (quality question). Infrastructure monitoring answers the first. AI observability answers the second. You need both, and most teams only have the first.

The tricky part: AI output quality is not binary. A response isn’t simply correct or incorrect. It’s on a spectrum from excellent to subtly wrong to dangerously misleading. A risk model that’s 15% too optimistic doesn’t fail — it succeeds incorrectly, consistently, at scale. The January incident produced eight weeks of subtly wrong outputs before the pattern was visible in outcome data.

Here’s the key thing: the gap between “the system is running” and “the system is working correctly” is exactly the space that AI observability fills — and filling it requires measuring the quality of outputs, not just the availability of the system.


The three layers of AI observability — what each one catches


User's avatar

Continue reading this post for free, courtesy of AI Engineering.

Or purchase a paid subscription.
© 2026 AI Engineering · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture