← All Case Studies

Cutting CI Costs by 85% with AWS Spot and GitHub App Autoscaling

When I joined Systematica Investments as an embedded platform engineer, the CI infrastructure was typical of what you find at mid-sized firms that have grown quickly: a pool of always-on, self-hosted GitHub Actions runners sitting on reserved EC2 instances. The machines ran 24/7 regardless of whether anyone was pushing code. At peak times, developers queued behind each other waiting for a free runner. At night and on weekends, the fleet sat idle, burning money.

The monthly bill for CI compute alone was well into five figures. Leadership asked whether there was a better way. There was.

The problem in detail

The existing setup used a fixed pool of eight m5.2xlarge instances across two availability zones. Each instance ran a single GitHub Actions runner process. The runners were provisioned with Terraform and managed via a systemd service, which was straightforward to operate but fundamentally inflexible.

The core issues were threefold. First, cost: reserved instances meant paying for capacity whether it was used or not. Second, throughput: eight runners was not enough during the morning burst when most of the London-based team pushed code between 09:00 and 11:00. Third, feedback speed: developers waiting ten or fifteen minutes for a runner to become available is ten or fifteen minutes of broken flow.

Beyond the runners themselves, CI workflows also needed to deploy to EKS clusters for integration testing. The existing approach used long-lived IAM access keys shared across runners — a security concern that had been flagged repeatedly but never addressed.

Designing the solution

I proposed an event-driven architecture that would spin up runners on demand and tear them down when idle. The key components were:

A custom GitHub App registered against the organisation. GitHub Apps receive webhook events for workflow job queued, in-progress, and completed lifecycle transitions. This gave us a real-time signal of demand.

AWS API Gateway as the webhook receiver. The GitHub App’s webhook URL pointed at an API Gateway endpoint configured with a Lambda integration. This gave us a serverless, highly available ingress point with no infrastructure to manage.

An AWS Lambda function that processed incoming webhook payloads. When a workflow_job event arrived with action queued, the function called the EC2 API to launch a new Spot Instance from a pre-baked AMI containing the runner agent, Docker, and all build tooling. When a completed event arrived, it scheduled the instance for termination after a brief grace period.

EC2 Spot Instances as the compute layer. Our CI workloads were inherently interruption-tolerant — if a Spot Instance was reclaimed, the workflow simply retried on a fresh runner. I configured the launch template to request capacity from a pool of six compatible instance types across three availability zones, which kept the Spot interruption rate below 3%.

A pre-baked AMI built with Packer. The AMI included the GitHub Actions runner binary, Docker, language runtimes, and cached dependencies. Boot-to-ready time was under ninety seconds, meaning a developer’s workflow would pick up a fresh runner almost immediately after being queued.

IAM-based EKS deployments

Alongside the autoscaling work, I replaced the shared long-lived IAM access keys with IAM roles for EC2. Each runner instance assumed a scoped IAM role that granted only the permissions required for CI operations — pushing container images to ECR, running Helm upgrades against specific EKS namespaces, and reading configuration from Parameter Store.

For EKS access specifically, I configured the clusters’ aws-auth ConfigMap to map the runner IAM role to a Kubernetes RBAC role with permissions limited to the CI namespaces. This meant that even a compromised runner could not access production workloads. The combination of ephemeral Spot Instances and tightly scoped IAM roles eliminated the static credential risk entirely — each runner existed for minutes, not months, and could only touch what it needed.

Implementation

I built the Lambda function in Python, keeping it deliberately simple — under 300 lines including error handling and CloudWatch metric publishing. The function maintained no state; instance lifecycle was tracked via EC2 tags that recorded the repository, workflow run ID, and runner name.

The GitHub App was scoped with minimal permissions: actions:read and administration:read on the organisation’s repositories. Registration, webhook secret rotation, and JWT authentication were all handled through Terraform using the GitHub provider.

I used Terragrunt to manage the deployment across our staging and production AWS accounts, with environment-specific variables controlling instance types, maximum fleet size, and idle timeout durations.

One subtlety worth noting: I implemented a small buffer of “warm” instances during business hours. Rather than scaling to zero between jobs, the Lambda function maintained two idle runners between 08:00 and 18:00 UTC on weekdays. This shaved the ninety-second cold start off the most latency-sensitive workflows without meaningfully affecting cost.

Handling Spot interruptions

Spot reclamation was the risk most people asked about. In practice, it was a non-issue. The diversified instance pool kept interruption rates low, and I configured the runner agent to deregister itself gracefully on receiving a termination notice via the EC2 metadata endpoint. GitHub Actions would automatically requeue the job, and the Lambda would launch a replacement.

Over six months of production use, we observed a total of eleven Spot interruptions across thousands of job executions. Every interrupted job completed successfully on retry.

Results

The numbers spoke for themselves:

  • 85% reduction in monthly CI compute costs. We went from a fixed reserved-instance bill to a usage-based Spot model. Weekend and overnight spend dropped to near zero.
  • 3x peak throughput. The fleet could scale to twenty-four concurrent runners during the morning burst, up from eight.
  • 90-second median queue time. Down from eight to fifteen minutes during peak hours.
  • Zero static credentials. IAM roles and ephemeral instances eliminated the long-lived access key risk entirely.
  • Zero maintenance overhead. No patching, no capacity planning, no on-call for stuck runners. The Lambda and API Gateway required no operational attention.

The project took approximately six weeks from proposal to production, including the Packer pipeline, Terraform modules, Lambda function, GitHub App registration, IAM role configuration, and rollout across all repositories.

Reflections

The biggest lesson was that CI infrastructure does not need to be complicated. The entire system comprised a Lambda function, an API Gateway, a launch template, and a Packer build — all stateless, all idempotent, all managed through infrastructure as code. There was no Kubernetes, no queue service, no database.

Sometimes the best platform engineering is choosing not to add complexity.