A traditional software incident has a certain shape. Something breaks, there is an error, the error points to a line of code, someone fixes it. The causality is usually traceable, the fix is usually deterministic, and once the issue is resolved it tends to stay resolved. AI production incidents do not have this shape.
Consider what an incident looks like when a language model starts giving bad answers. Maybe the answers were always bad for a subset of inputs and you just noticed. Maybe a recent prompt change shifted behavior in a way nobody anticipated. Maybe the model was updated by the API provider and the update changed something subtle about how it handles a particular pattern of inputs. Maybe the retrieval system upstream degraded and the model is now generating from incomplete context. Maybe all of these are happening simultaneously and each is contributing a fraction of the observed degradation.
None of these root causes produce a stack trace. You have outputs that are wrong or degraded, and you have to work backwards from samples of bad outputs to figure out what changed and where. This is less like debugging software and more like doing forensics on a system that has no logs of its internal state, only logs of its inputs and outputs.
The operational practices that help are not particularly exotic, but they require advance planning. First, you need baselines. If you do not know what normal looks like across the distribution of queries your system handles, you cannot detect when normal changes. This means sampling production outputs regularly, evaluating them against quality criteria, and storing those evaluations over time so you can compare. Second, you need attribution capability - the ability to trace a specific bad output back to its components. Which retrieved documents were included? What was the exact prompt? What model version was used? Storing this information for a sample of production calls is essential for incident investigation. Third, you need reproducibility - the ability to replay a production call in a test environment to verify that a proposed fix actually changes the behavior.
The hardest part of AI operations is the probabilistic nature of the failures. A model does not always give a bad answer on a bad input - it might give a bad answer 20% of the time and a good answer 80% of the time. A fix that improves the bad-answer rate from 20% to 8% is a genuine improvement, but it does not produce a clean "before / after" test case. You need sample sizes large enough to detect the change, which means evaluation cycles are longer than you would want them to be.
Teams that have been running AI products for a couple of years have developed practices that work. The common threads: invest heavily in evaluation infrastructure before you need it for incident response, keep model versions and prompt versions pinned and change them deliberately rather than letting providers roll updates silently, and build a feedback loop from user complaints to production samples to evaluation. None of this is glamorous. It is the boring operational work that makes AI products reliable rather than just impressive.