Overview
Implemented observability stack to monitor services, track SLOs and send alerts. Work included: instrumentation, dashboards, alert routing and runbooks.
Features
- Custom dashboards for latency/error rates
- Alert routing with escalation policies
Challenges
- Managing cardinality in metrics
- Alert fatigue tuning