Automating Infrastructure for the 100,000 Genomes Project

In 2016, I joined Genomics England to work on the infrastructure underpinning the 100,000 Genomes Project — a landmark initiative by NHS England to sequence 100,000 whole genomes from patients with rare diseases and cancer. The project’s ambition was extraordinary: build a genomic medicine service for the NHS that could transform diagnosis and treatment for patients across the country.

The infrastructure challenge matched the ambition. We were managing over 300 virtual machines including high-performance computing clusters, spanning both on-premises VMware and AWS environments. My role was to bring modern automation practices to an environment where the stakes — both computational and human — were exceptionally high.

The infrastructure landscape

Genomics England’s environment was unlike anything I’d encountered in typical enterprise or startup settings. The data volumes were staggering: a single whole genome sequence produces roughly 100 GB of raw data, and the analytical pipelines that process it generate several times that in intermediate files. Multiply that by tens of thousands of genomes, and you’re working with petabytes of data flowing through computational pipelines that could run for hours per sample.

The 300+ VM estate comprised:

High-performance compute clusters for bioinformatics pipelines — alignment, variant calling, annotation — running workloads that demanded significant CPU, memory, and I/O throughput
Application and middleware servers supporting the research portal, data access systems, and internal tooling
Infrastructure services — DNS, LDAP, monitoring, logging, build systems
Development and staging environments mirroring production for pipeline validation

The hybrid deployment model added complexity. Some workloads had to run on-premises on VMware vSphere due to data sovereignty and NHS governance requirements — patient genomic data couldn’t simply be moved to the cloud. Other workloads benefited from AWS’s elastic capacity for burst computing. Data movement between environments was governed by strict security policies with comprehensive audit trails.

When I arrived, much of this was manually provisioned. New VMs were built by hand, configuration was applied through a mix of scripts and ad-hoc steps, and spinning up a new analytical environment took days. For a project on a national timeline, that was a bottleneck that needed addressing urgently.

Automated image building and configuration management

Packer for cross-platform images. I introduced HashiCorp Packer to build machine images for both VMware (generating templates) and AWS (generating AMIs) from a single set of definitions. Every image was version-controlled, automatically tested post-build, and identical across environments. This was critical for the bioinformatics toolchain — genomic analysis tools are notoriously sensitive to library versions and configuration parameters, and a subtle difference between two compute nodes could produce different analytical results. In a clinical context, that’s unacceptable.

SaltStack for ongoing configuration management. On top of the base images, I deployed SaltStack across the entire estate. Salt’s master-minion architecture coordinated configuration across both VMware VMs and AWS instances from a single control plane. The Salt states managed everything from bioinformatics tool installation and version pinning (BWA, GATK, Samtools, and dozens of others) to security hardening, CIS benchmark compliance, monitoring agent deployment, and user access management.

Salt’s pillar system allowed environment-specific configuration — VMware vs. AWS, development vs. production, different analytical pipeline requirements — without duplicating state definitions. A single source of truth for configuration, with environment-specific parameters injected at apply time.

Jenkins for build automation. I set up Jenkins pipelines to automate the image build process, configuration testing, and deployment workflows. Packer builds were triggered by Git commits, validated automatically, and promoted through staging before reaching production. This CI/CD approach to infrastructure — treating VM images and configurations with the same rigour as application code — was relatively novel in this context and dramatically reduced the risk of infrastructure changes.

Cloud cost optimisation

One of the first things I did upon arriving was audit the AWS spend. The findings were eye-opening. I identified instances running around the clock that were only needed during business hours, oversized instance types for workloads that had been right-sized months earlier, and orphaned resources from previous experiments that nobody had cleaned up.

Within the first week, I implemented a set of straightforward cost optimisations:

Right-sizing instances based on actual utilisation metrics rather than original estimates
Scheduling non-production environments to shut down outside business hours
Cleaning up orphaned resources — unused EBS volumes, old snapshots, unattached Elastic IPs
Reserved Instances for the baseline HPC workload that ran continuously

These changes delivered over 10% reduction in cloud costs in the first week, and the savings compounded as we expanded the practices. Terraform codified the scheduling and cleanup policies, making them persistent rather than one-off.

Atlassian suite migration and ELK deployment

Beyond the core infrastructure automation, I led two additional initiatives that improved the team’s operational capabilities.

Atlassian suite migration. The team had been using a fragmented collection of tools for project management, documentation, and issue tracking. I migrated them to a standardised Atlassian stack — Jira for project management, Confluence for documentation, and Bitbucket for source control — giving the team consistent tooling and better visibility into work in progress.

ELK stack for centralised logging. I deployed Elasticsearch, Logstash, and Kibana to aggregate logs from across the 300+ VM estate. Before this, troubleshooting meant SSH-ing into individual machines and tailing log files — a process that was slow at best and impossible at scale. The ELK deployment gave the team searchable, centralised logs with dashboards for error rates, pipeline execution times, and system health metrics.

Results and impact

Over the course of my engagement, the infrastructure automation work delivered measurable improvements:

New compute environment provisioning dropped from days to under an hour — fully configured, validated environments built from version-controlled definitions
Configuration drift was effectively eliminated across 300+ VMs through SaltStack enforcement
Cloud costs reduced by over 10% in the first week, with ongoing savings from automated scheduling and right-sizing
Centralised logging and monitoring replaced manual troubleshooting across the entire estate
Standardised tooling improved team collaboration and project visibility

Beyond the metrics, the work contributed to a programme with genuine human impact. The 100,000 Genomes Project has enabled diagnoses for patients with rare diseases who had spent years without answers. It has informed cancer treatment decisions. When a misconfigured compute node doesn’t just mean a failed build but could affect the accuracy of a patient’s genomic analysis, you approach infrastructure automation with a different level of rigour. That context shaped how I think about infrastructure work to this day.