31 reviewed items across 3 content types.
PagerDuty opened 312 alerts for one API latency incident. Alertmanager grouped poorly because every pod and path label created a distinct alert instance.
Teams often page on CPU, memory, restarts, and queue depth because those metrics are easy to query. The result is alert fatigue: many pages describe possible causes without confirming user impact. During real incidents, on-call engineers must separate noise from the customer-visible failure. Without following this practice, teams typically discover the problem during a production incident. The cost of fixing issues reactively is 10-100x higher than preventing them proactively. This becomes especially dangerous at scale when multiple teams depend on the same infrastructure, because one team's shortcut becomes another team's outage. Organizations that skip this practice often find themselves in a cycle of firefighting instead of building. The pattern is predictable: it works fine in development, survives staging, and fails spectacularly in production under real traffic and real failure conditions.