← All Case Studies

Scaling a Social Platform to 80,000 Concurrent Users

When I joined vVoosh as the sole infrastructure lead in late 2017, the social entertainment platform was growing faster than its infrastructure could handle. The engineering team was small and focused on product, and there was nobody dedicated to the underlying platform. Outages during peak usage had become a regular occurrence, and the team’s approach to scaling was reactive: when things fell over, someone would SSH in and try to fix it.

I was brought in to own the entire infrastructure — from provisioning and deployment to monitoring and scaling. What followed was a focused, methodical effort that culminated in a two-hour performance test sustaining 80,000 concurrent users with stability throughout.

Starting point

The infrastructure I inherited was a collection of manually provisioned EC2 instances sitting in a single AWS account. There was no separation between environments, no infrastructure as code, and no meaningful monitoring beyond basic CloudWatch metrics. Deployments were semi-automated at best — a mix of scripts and manual steps that varied depending on which engineer had last touched them.

As a social entertainment platform, vVoosh’s traffic pattern was extremely spiky. Marketing campaigns and viral content could double or triple concurrent users within minutes. The infrastructure needed to handle these surges without manual intervention, and it wasn’t remotely equipped to do so.

The core issues I identified in my first week:

  • No load testing baseline. The team didn’t know their capacity limits. They discovered them in production, during real user traffic, when things broke.
  • Single AWS account for everything. Production, staging, and development workloads all shared one account with no resource isolation or access boundaries.
  • No infrastructure as code. Environments couldn’t be reliably reproduced, and changes were made through the AWS console.
  • No real-time observability. When problems occurred, diagnosing root causes meant guessing and manually tailing logs on individual instances.

Building the foundation

Before I could optimise anything, I needed two things: reproducible infrastructure and the ability to observe what was happening across the platform in real time.

Multi-account AWS with Terraform. I restructured the entire AWS presence into a multi-account setup — separate accounts for production, staging, and development — managed entirely through Terraform. This gave us proper blast radius isolation (a misconfiguration in staging couldn’t affect production), clear cost attribution, and the ability to spin up or tear down complete environments from version-controlled definitions. Every VPC, subnet, security group, auto-scaling group, RDS instance, and load balancer was codified.

The multi-account approach also improved security posture. IAM policies could be scoped tightly per account, and cross-account access was explicit and auditable.

Dual monitoring with Outlyer and ELK. I deployed Outlyer for real-time infrastructure and application metrics — CPU, memory, network, request rates, response times, error rates — with dashboards that gave the team instant visibility across the fleet. Alongside Outlyer, I stood up an ELK stack (Elasticsearch, Logstash, Kibana) to centralise application logs from every instance. The combination gave us both the high-level view (“is the platform healthy?”) and the ability to drill down into specific requests, errors, and performance anomalies.

Jenkins CI/CD pipelines. I implemented Jenkins-based CI/CD pipelines for automated build, test, and deployment. Previously, deployments were a manual process that varied by service and by the person executing them. The Jenkins pipelines standardised deployments across all services, making them repeatable, auditable, and fast. This was essential for the iteration speed I needed during the scaling work — I was frequently deploying infrastructure and configuration changes, and I needed those deployments to be reliable.

Systematic load testing

With the infrastructure codified and observable, I designed a structured load testing approach. The methodology was iterative: establish a baseline, identify the bottleneck, fix it, re-test, repeat.

The initial baseline test revealed the platform started degrading at approximately 15,000 concurrent users — well below the traffic levels that marketing campaigns were already driving. The Outlyer dashboards and ELK logs made it possible to pinpoint exactly which components failed first:

  • WebSocket connection handlers running out of file descriptors at scale
  • Database connection pools exhausting under concurrent load, causing cascading query timeouts
  • In-memory session storage that broke during auto-scaling events (users lost their sessions when routed to new instances)
  • Media processing running synchronously on the main application thread, blocking API responses

Each of these was addressed in turn:

Auto-scaling with custom metrics. I configured auto-scaling groups that responded to WebSocket connection count and request queue depth rather than CPU utilisation alone. CPU is a lagging indicator for this type of workload; connection count gave us much earlier warning of load increases.

Session externalisation to Redis. Moving session storage to ElastiCache (Redis) meant users could be routed to any instance without losing state — a prerequisite for auto-scaling to work transparently.

Asynchronous media processing. Heavy media operations were offloaded to an SQS queue processed by a dedicated worker fleet, removing them from the request path entirely.

Database read replicas. Read-heavy queries were directed to Aurora read replicas, eliminating the lock contention that appeared under concurrent load.

The 80,000-user test

After several weeks of iterative testing and fixing, we ran the definitive performance test: a sustained two-hour load test simulating 80,000 concurrent users with realistic usage patterns — browsing, posting, messaging, media uploads.

The results:

  • 80,000 concurrent users sustained for two hours with p99 response times under 200ms
  • Auto-scaling responded within 90 seconds of threshold breaches, with no manual intervention required
  • Zero errors during the test — the platform handled the load cleanly throughout
  • Infrastructure costs decreased by roughly 25% compared to the pre-Terraform setup, because auto-scaling meant we weren’t paying for peak capacity around the clock

Reflections

The vVoosh engagement was unique in my career because of the breadth of ownership. As the sole infrastructure lead, every decision — from AWS account structure to monitoring tool selection to CI/CD pipeline design — was mine to make and mine to deliver. That concentration of responsibility forced a disciplined approach: get the foundations right (IaC, monitoring, CI/CD), then build on them systematically.

The sequence matters. You cannot scale what you cannot measure, and you cannot measure what you cannot reproduce. Terraform first, monitoring second, then systematic load testing, then targeted optimisation. Skipping steps in that sequence means you’re optimising blind.