Kubernetes in Production: A CTO Checklist Before Launching a Cluster
Kubernetes promises "autoscaling out of the box." In practice, a cluster without discipline is expensive chaos: CrashLoop, OOMKill, secrets in plain text, and Friday-night deploys via `kubectl apply -f`.
Minimum Production Checklist
- RBAC and namespaces — separate prod/stage, least privilege for CI.
- Requests/limits — on every Deployment; without limits, neighbors kill each other.
- Ingress + TLS — cert-manager, HSTS, rate limiting at the edge.
- GitOps — Argo CD / Flux, rollbacks with one click.
- Monitoring — Prometheus + alerts on pod restarts, saturation, error rate.
- etcd and PV backups — a DR plan on paper, not in the DevOps engineer's head.
Common Mistakes
- One cluster for everything — prod and experiments in the same namespace.
- Stateful workloads without an operator — PostgreSQL "in a Pod" without Patroni/Crunchy.
- No staging environment identical to prod topology.
We build and operate clusters in high-load and IoT projects. Services: turnkey Kubernetes, DevOps and CI/CD. Audit of an existing cluster — from ₽35,000, see pricing.
Related services
FAQ for this topic
With a pilot: one non-critical service, baseline policies, observability, and a clear release path—otherwise complexity eats velocity.
No: canaries, DB migrations, rollbacks, and windows for stateful parts still matter.
In a vault with rotation, audit, and least privilege—not in git or plain env everywhere.
Per-service SLOs, queue lag, replication lag, deploy failures, cluster headroom—tied to user journeys.
Want to apply this in practice?
Tell us about your system — we’ll propose a work plan and the metrics worth fixing in an SLA/SLO.
Related articles
DevOps and CI/CD in Production: What to Set Up First
DevOps services for business: build pipeline, staging, zero-downtime deploy, monitoring and rollback — priorities for the first 4–6 weeks.
Read ArticleProduction Monitoring: Metrics You Cannot Ignore
Production monitoring metrics that matter before users notice: RED/USE signals, SLO-oriented dashboards, alerting hygiene, and how to connect telemetry to incident response.
Read ArticleWhy Business Needs SRE? Translating Reliability into Money
Why businesses adopt SRE: SLIs, SLOs, error budgets, and tying reliability to money—without chasing vanity nines or drowning teams in process.
Read ArticleCI/CD: How to Stop Fearing Friday Releases
CI/CD for business outcomes: why manual releases cost more than downtime, how pipelines cut release risk, and what to automate first—from repo hooks to production gates.
Read Article