The Alert Flood
The SRE team at this major European telco was receiving 112 alerts per day. Not 112 unique problems, 112 notifications, the overwhelming majority of which were duplicates, low-severity noise, or events that had already auto-resolved by the time an engineer looked at them. The team had developed a kind of learned helplessness: alerts were paged, acknowledged, and ignored.
The real danger wasn't the volume, it was what was buried inside it. Genuine incidents were arriving in the same feed as disk usage warnings and pod restarts, with no consistent way to tell them apart at a glance. MTTR had climbed to 4.2 hours average because engineers had to manually triage the queue before they could begin investigating anything. Three high-severity incidents in a single quarter had been missed entirely until downstream systems began to fail.
"We weren't ignoring alerts because we were lazy. We were ignoring them because acting on every one would have taken our entire shift. The signal was completely lost in the noise."
, SRE Lead (anonymised)
Our Approach: Context-Aware Event Correlation
The standard answer to alert fatigue is to raise thresholds and add more suppression rules. We took a different position: the problem wasn't the alerting rules, it was the absence of context. Individual alerts fired in isolation, with no awareness of whether a related alert had fired 30 seconds earlier, or whether this same pattern had been a false positive in 47 of the last 50 occurrences.
Our AIOps implementation introduced ML-based event correlation: instead of treating each alert as independent, the model grouped related alerts into a single incident based on topology, timing, and historical co-occurrence patterns. A single disk warning and three pod restarts on the same node became one correlated incident, not three separate pages.
Training on Your Own History
The model was trained on 12 months of historical incident data, not a generic telecom dataset, but this team's specific history of what they'd investigated, how they'd resolved it, and what had turned out to be noise. This made the suppression decisions interpretable: when the model suppressed an alert, it could cite the specific historical pattern it was matching against.
The alert suppression rules were complemented by a runbook library: for the incidents that did surface, automated runbooks handled the first-response steps. Certificate expiry reminders triggered automatic renewal workflows. Pod restart loops triggered drain-and-reschedule. The human SRE was notified only when the automated response either failed or the incident exceeded a predefined severity threshold.
Building Trust Before Cutting Over
We ran the AIOps layer in shadow mode for six weeks before it touched any live alerting. The model produced suppression decisions in parallel with the existing system, the SRE team could see what it would have suppressed and verify whether those decisions were correct. Disagreements were fed back as training examples.
This shadow period served two purposes. First, it let us tune the model against live traffic rather than historical data. Second, and more importantly, it built enough operational trust that the team was willing to hand over the suppression decisions entirely. SRE teams tend to be sceptical of automated systems that might suppress a real incident, the shadow period gave them the evidence they needed.
The runbook library was built incrementally alongside the shadow period. We started with the ten most common auto-resolvable alert types and expanded from there. By the time we went live, 34 automated runbooks covered 68% of the alert types that had previously required manual first response.
Results After 90 Days
Three months after the system went fully live:
- Daily actionable alerts: 112 → 8
- MTTR: 4.2 hours → 38 minutes
- Security incidents: 79% reduction
- Alert noise suppressed: 90%
- SRE on-call load: halved
- Missed P1 incidents: 0 in the following quarter
"For the first time in years, being on-call doesn't feel like a punishment. When an alert comes in now, we know it's real. That changes everything about how the team operates."
, Head of SRE (anonymised)
The Wider Lesson
Alert fatigue is not an operations problem. It's a platform design problem. If your alerting system treats every event as equally worthy of human attention, you've designed a system that will eventually be ignored. The fix isn't to page people louder, it's to build a platform that does the first-level triage automatically and only escalates what genuinely requires human judgment.
The teams who benefit most from AIOps are not the ones with the most sophisticated tooling, they're the ones willing to invest in the historical data and the shadow-mode validation period that lets the model earn trust before it acts.