Production Monitoring: Metrics You Cannot Ignore
Do you learn about your site going down from an angry customer tweet? Congratulations, you have no monitoring. Or it's configured so poorly it's useless. Let's fix that.
Monitoring Dashboard: What Should Be on Screen
📊 Production Dashboard
Metric #1: Latency (Response Delay)
What we measure: Time from request to response. Not average, but percentiles — p95, p99.
Why it matters: Average latency of 100ms can hide that 5% of users wait 5 seconds.
Alert threshold: p95 > 500ms, p99 > 2s.
Metric #2: Error Rate
What we measure: Ratio of HTTP 5xx errors to total requests.
Why it matters: 1% errors at 10,000 RPS = 100 users per minute see "Internal Server Error".
Alert threshold: > 0.5% over 5 minutes.
Metric #3: Saturation (Resource Saturation)
What we measure: CPU, RAM, Disk I/O, network connections.
Why it matters: When CPU hits 95%, the system is already failing. You need to know at 70%.
Alert threshold: CPU > 70%, RAM > 85%, Disk I/O > 80%.
Metric #4: Traffic
What we measure: RPS (requests per second), active users.
Why it matters: Sudden traffic spike could be a DDoS attack or a viral post about you.
Alert threshold: Deviation > 200% from baseline over 10 minutes.
Tools: What to Use
- Prometheus + Grafana: Gold standard for metrics. Open-source, flexible.
- Datadog: SaaS "all-in-one" solution. Expensive but convenient.
- Zabbix: For those who love enterprise and aren't afraid of complexity.
NineLab Advice: Monitoring isn't "set and forget". Alerts need regular review. If you're woken up at night by false positives — you'll stop reacting to them.
Conclusion: Good monitoring means you learn about problems before they become disasters. Set up these 4 metrics, and you'll sleep soundly.