Production Monitoring: Metrics You Cannot Ignore
Do you learn about your site going down from an angry customer tweet? Congratulations, you have no monitoring. Or it's configured so poorly it's useless. Let's fix that.
Monitoring Dashboard: What Should Be on Screen
📊 Production Dashboard
Metric #1: Latency (Response Delay)
What we measure: Time from request to response. Not average, but percentiles — p95, p99.
Why it matters: Average latency of 100ms can hide that 5% of users wait 5 seconds.
Alert threshold: p95 > 500ms, p99 > 2s.
Metric #2: Error Rate
What we measure: Ratio of HTTP 5xx errors to total requests.
Why it matters: 1% errors at 10,000 RPS = 100 users per minute see "Internal Server Error".
Alert threshold: > 0.5% over 5 minutes.
Metric #3: Saturation (Resource Saturation)
What we measure: CPU, RAM, Disk I/O, network connections.
Why it matters: When CPU hits 95%, the system is already failing. You need to know at 70%.
Alert threshold: CPU > 70%, RAM > 85%, Disk I/O > 80%.
Metric #4: Traffic
What we measure: RPS (requests per second), active users.
Why it matters: Sudden traffic spike could be a DDoS attack or a viral post about you.
Alert threshold: Deviation > 200% from baseline over 10 minutes.
Tools: What to Use
- Prometheus + Grafana: Gold standard for metrics. Open-source, flexible.
- Datadog: SaaS "all-in-one" solution. Expensive but convenient.
- Zabbix: For those who love enterprise and aren't afraid of complexity.
NineLab Advice: Monitoring isn't "set and forget". Alerts need regular review. If you're woken up at night by false positives — you'll stop reacting to them.
Conclusion: Good monitoring means you learn about problems before they become disasters. Set up these 4 metrics, and you'll sleep soundly.
Related services
FAQ for this topic
With a pilot: one non-critical service, baseline policies, observability, and a clear release path—otherwise complexity eats velocity.
No: canaries, DB migrations, rollbacks, and windows for stateful parts still matter.
In a vault with rotation, audit, and least privilege—not in git or plain env everywhere.
Per-service SLOs, queue lag, replication lag, deploy failures, cluster headroom—tied to user journeys.
Want to apply this in practice?
Tell us about your system — we’ll propose a work plan and the metrics worth fixing in an SLA/SLO.
Related articles
Why Business Needs SRE? Translating Reliability into Money
Why businesses adopt SRE: SLIs, SLOs, error budgets, and tying reliability to money—without chasing vanity nines or drowning teams in process.
Read ArticleCI/CD: How to Stop Fearing Friday Releases
CI/CD for business outcomes: why manual releases cost more than downtime, how pipelines cut release risk, and what to automate first—from repo hooks to production gates.
Read Article