HighLoad Infrastructure for Judo Battle Portal
Next.js SSR + Strapi at 15,000 concurrent connections: three-node architecture, Varnish cache, PM2 cluster, and autonomous CI/CD.
About the Project
Sports portal with a heavy frontend: Next.js SSR, 24 JS chunks, dynamic content, and Strapi CMS. The goal was to handle 15,000 concurrent connections during tournament peaks without user-facing degradation. We split the monolith into isolated nodes, built smart caching, and validated the KPI with a stress test.
From Monolith to Isolated Architecture
Nginx, Frontend, Backend, and DB on one server. Next.js SSR handled every request — at 15,000 connections CPU hit 100%, Strapi ran single-threaded, admin sessions broke, SSH was open to brute force.
Three independent nodes on private network 172.16.0.0/28: Proxy (Nginx + Varnish, 12 GB cache), Frontend (PM2 cluster, 8 Next.js instances), Backend (4 Strapi instances). External world sees only port 443. Deploys, SSL, and log rotation run on autopilot.
Solution Architecture

Proxy Server
Nginx + Varnish with custom VCL. Static assets cached for 1 year, SEO pages for 10 minutes with language cookies. RSC, prefetch, and API bypass cache. Grace mode serves stale pages up to 1 hour when backend fails.
Frontend (Next.js)
PM2 Cluster Mode across all CPU cores (8 instances). Automated cron CI/CD: backup, git pull, build, safe restart via pm2safe. max_memory_restart: 1G, log rotation at 5 MB.
Backend (Strapi)
Cluster of 4 instances on ports 1337–1340 (STRAPI_WEB_CONCURRENCY: 2). Admin panel pinned to 1337 for stable sessions, public API load-balanced. systemd resurrection on server reboot.
Key Solutions
Smart Varnish Caching
24 JS chunks and static assets — immutable for 1 year. SEO content (news, athletes, clubs) — 10 minutes. Next.js dynamics and admin bypass cache to preserve interactivity and auth.
Autonomous CI/CD
Script polls GitHub every 5 minutes. New commit → tar.gz backup → npm install → build → safe PM2 restart. Updates without user-facing downtime.
PM2 Clustering
8 Next.js + 4 Strapi instances with auto-restart on memory leaks (1G / 2G). One node failure doesn't stop the service — others pick up traffic instantly.
Closed Perimeter
SSH (port 22) closed by default on all servers. Access only via provider console or temporary scripts. Internal network isolated — only Proxy is public.
Automated SSL Renewal
Certbot via systemd timer on Proxy (every Sunday). On Backend — bash script with proper Nginx stop in standalone mode. RandomizedDelaySec to avoid Let's Encrypt rate spikes.
Grace Mode & Self-Healing
When backend fails or stalls, Varnish serves cached pages for another hour — users don't see 502/504. PM2 and systemd automatically restore processes.
Stress Test Results
Apache Bench stress test: 15,000 concurrent connections, 100,000 requests on a heavy SSR site. Project KPI achieved with significant headroom.
15,000+
concurrent connections without failure
100,000
requests in a single run
≤ 25%
CPU on Proxy server at peak
1–3%
CPU Frontend/Backend (cache working)
≤ 6 GB
RAM per machine in steady state
0
502/504 for users (grace mode)
Bottleneck: Limit was not software but physical hosting bandwidth (~1 Gbps). At full channel saturation TLS errors and timeouts began — CPU and RAM remained well within capacity.