← All Case Studies

Eliminating a 500-Job Queue Bottleneck in CI/CD

When I joined the Open Banking programme at Lloyds Bank through Publicis Sapient in early 2019, the engineering organisation was facing a problem that was quietly undermining delivery velocity. The Jenkins CI cluster had a persistent queue of 500 to 700 pending jobs during core hours. Developers were waiting hours for feedback. Squads were scheduling builds around each other. And the programme’s ambitious PSD2 regulatory deadlines were starting to look precarious.

The root cause wasn’t a single failure — it was an accumulation of infrastructure decisions that hadn’t scaled with the organisation’s growth, compounded by a source control migration that had introduced its own complications.

The scale of the problem

The Open Banking initiative required dozens of squads to deliver API-driven banking services under strict regulatory timelines set by the FCA. The Jenkins infrastructure was serving all of them from a shared cluster, and the numbers told the story clearly:

  • A persistent backlog of 500-700 pending jobs during peak hours
  • Average wait times exceeding two hours, with outliers hitting four
  • Selenium-based end-to-end test suites monopolising executors for 40-60 minutes per run
  • An ongoing migration from Gerrit to GitHub Enterprise that had left webhook configurations in an inconsistent state

The organisation had recently moved from Gerrit to GitHub Enterprise for source control. This migration was necessary — Gerrit’s review model was adding friction to the delivery workflow — but the webhook configurations carried over from the old system were triggering builds indiscriminately. Every push event, including draft pull requests, work-in-progress commits, and branch housekeeping, spawned a Jenkins job. During active development hours, this generated hundreds of unnecessary builds that consumed executor capacity for work that would never be merged in that state.

Underneath the Jenkins layer sat an Apache Mesos cluster that was handling workload scheduling. The Mesos setup had been provisioned for a substantially smaller workload and hadn’t been re-evaluated as the programme grew. Resource allocation was static, and there was no dynamic scaling to absorb peak demand.

Diagnosing the bottleneck

Before proposing changes, I spent the first week instrumenting the pipeline to understand exactly where time and resources were being consumed. I collected queue metrics from the Jenkins API, mapped executor utilisation by job type, and traced the webhook event flow from GitHub Enterprise through to job execution.

Three findings shaped the approach:

Selenium suites were starving the cluster. End-to-end tests ran on the same executor pool as compilation and unit test jobs. A single Selenium run would claim an executor for 45 minutes, and during peak hours, dozens of these would stack up simultaneously. This accounted for roughly 60% of the queue depth.

The Gerrit-to-GitHub migration left webhook hygiene issues. The migrated webhook configurations were triggering builds on events that shouldn’t have produced jobs. Combined with GitHub Enterprise’s richer event model (which fires more webhook types than Gerrit), the build volume had roughly doubled post-migration without a corresponding increase in meaningful work.

Mesos resource allocation was static and undersized. The cluster couldn’t dynamically respond to demand, so capacity was effectively fixed at whatever had been provisioned months earlier.

The approach

Rather than simply requesting more hardware — which would have addressed the symptom but not the architecture — I proposed a series of targeted interventions.

Zalenium on Kubernetes for Selenium testing. This was the most impactful change. I deployed Zalenium, a Docker-based Selenium Grid implementation, on a dedicated Kubernetes cluster managed with Helm. Zalenium dynamically provisions browser containers on demand — spinning up anywhere from 50 to 100 concurrent browser instances depending on queue depth — and tears them down when tests complete. This completely separated the Selenium workload from the general-purpose Jenkins executors. Tests that previously queued for an executor now had their own elastic infrastructure that scaled with demand.

The Kubernetes deployment also gave us capabilities the old static grid lacked: video recording of test runs for debugging, live preview of running tests, and automatic cleanup of stale sessions. The Helm charts made the deployment reproducible and version-controlled.

Automated GitHub webhook filtering. I reworked the webhook integration between GitHub Enterprise and Jenkins to apply intelligent filtering. The automation inspected incoming webhook payloads and applied rules: skip builds for draft pull requests, debounce rapid pushes to the same branch (only triggering a build on the latest commit after a quiet period), and deprioritise builds for branches without open, review-ready pull requests. This filtering reduced the daily build volume by approximately 35%, directly reclaiming executor capacity for meaningful work.

Mesos cluster right-sizing and tuning. With the Selenium workload removed and webhook noise reduced, I worked with the infrastructure team to re-evaluate the Mesos resource allocation. We adjusted CPU and memory reservations to match actual workload profiles, improved task scheduling to reduce resource fragmentation, and established monitoring dashboards so the team could track utilisation trends going forward.

Measured results

We tracked metrics for four weeks after the changes were fully deployed:

  • Queue depth dropped from 500-700 pending jobs to under 20 during peak hours
  • Average wait time fell from over two hours to under five minutes
  • Selenium test throughput increased by approximately 40% thanks to Zalenium’s parallel browser provisioning
  • Daily build count decreased by 35% through webhook filtering, with no reduction in meaningful test coverage

The most significant outcome was harder to quantify: the return of developer flow. Teams stopped scheduling builds around each other. Engineers could push a change, get a result, and iterate — the way continuous integration is supposed to work. The feedback loop from commit to test result tightened from hours to minutes.

Lessons learned

This engagement reinforced a pattern I’ve encountered repeatedly: CI/CD infrastructure is provisioned once for a certain scale and then neglected as teams and codebases grow. The bottleneck at Lloyds didn’t appear overnight. It accumulated gradually, and each squad experienced it as “Jenkins is slow” without visibility into the systemic causes — the webhook noise, the Selenium contention, the static Mesos allocation.

The Gerrit-to-GitHub migration is worth highlighting specifically. Source control migrations are often treated as straightforward, but the downstream effects on CI/CD pipelines can be substantial. Webhook semantics differ between platforms, event volumes change, and configurations that worked adequately on the old system may behave quite differently on the new one. Post-migration CI hygiene deserves its own attention.

Regulatory programmes like Open Banking have hard deadlines. You cannot negotiate an extension with the FCA because your build queue is too long. Making the delivery pipeline fast and reliable was as critical to the programme’s success as any feature the teams were building.