Four Golden Signals of Monitoring
What are the four golden signals of monitoring and why are they important?
What are the four golden signals of monitoring and why are they important?
The Four Golden Signals (from Google SRE) are: Latency (request response time), Traffic (requests per second), Errors (rate of failed requests), and Saturation (resource utilization). These metrics quickly indicate service health and help diagnose issues. They're the minimum metrics every service should track.
These signals come from Google's Site Reliability Engineering practices. Together they provide a comprehensive view of system health. Latency shows user experience, traffic shows demand, errors show reliability, and saturation shows capacity. Start with these before adding more specific metrics.
Prometheus alerting rules
- Only monitoring one signal (e.g., just CPU usage)
- Setting alerts too sensitive (alert fatigue) or too loose (missing issues)
- Not measuring latency at different percentiles (p50, p95, p99)
- How do you distinguish between latency for successful vs failed requests?
- What's the difference between metrics, logs, and traces?
- How do you set appropriate alert thresholds?
More Monitoring interview questions
Also worth your time on this topic
Monitoring and Alerting Strategy
How do you design a monitoring and alerting strategy? What metrics would you track and how do you avoid alert fatigue?
mid
Monitoring & Observability Checklist
Comprehensive checklist for implementing monitoring, logging, tracing, and alerting across your infrastructure and applications.
60-90 minutes
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.