Zero-Downtime Releases for a £4 Billion Grocery Platform

Tesco’s online grocery platform generates approximately £4 billion in annual revenue, serving millions of customers across mobile and web applications. When I joined the platform engineering team in 2017, the release process was a source of persistent anxiety. Deployments were infrequent, manually coordinated, and carried a non-trivial risk of customer-facing disruption. For a platform of this scale, where even brief outages translate directly into lost revenue and degraded customer trust, the status quo was untenable.

My objective was to make weekly deployments routine and safe — zero customer impact, no maintenance windows, no late-night heroics.

The deployment landscape

The grocery platform was a large-scale architecture running on AWS, with a mix of services deployed across EC2 instances and ECS containers. The platform served everything from product browsing and basket management to payment processing and delivery slot allocation. Connectivity back to Tesco’s on-premises systems ran through AWS Direct Connect, adding another layer of complexity to the deployment picture.

The services spanned both traditional EC2-based deployments (long-running services with persistent state) and containerised ECS workloads (stateless API services). Lambda functions handled event-driven operations — webhook processing, notification dispatch, and various integration tasks. The hybrid nature of the deployment targets meant there was no single deployment strategy that would work across everything.

The specific challenges were:

Inconsistent deployment approaches. Different teams had evolved different deployment methods depending on their service’s hosting model. EC2 services used one approach, ECS another, Lambda a third. There was no unified process, which meant each deployment carried its own unique risks.

Health check gaps. Services were registered with load balancers as soon as they started, but many needed 30-45 seconds to fully initialise — warming caches, establishing database connections, loading configuration from Consul. During that window, requests routed to the new instances would fail or respond slowly.

No automated rollback. When something went wrong, the rollback process was manual and varied by service type. Someone had to identify the problem, make the call, and execute the rollback — all while customers experienced degraded service.

Tight coupling with on-premises systems via Direct Connect. Deployments that affected services communicating over Direct Connect required additional care. A badly timed deployment could disrupt the link between AWS-hosted services and on-premises systems, with knock-on effects across the platform.

Designing the deployment strategy

I built a unified deployment framework that accommodated all three hosting models while providing consistent zero-downtime guarantees.

Blue-green for ECS services. Each containerised service was configured with two ECS services behind separate Application Load Balancer target groups. Deployments went to the inactive environment first. Once all tasks passed deep health checks — not just HTTP 200 responses, but verification of database connectivity, cache readiness, and downstream dependency availability — traffic was switched atomically via an ALB listener rule update. The old environment remained running as an instant rollback target.

Rolling deployments with connection draining for EC2. For EC2-based services where blue-green wasn’t practical (due to stateful components or licensing constraints), I implemented rolling deployments with proper connection draining. New instances were added to the auto-scaling group, health-checked thoroughly, and registered with the load balancer. Only after the new instances were fully serving traffic did the old instances begin draining connections and terminating. The sequencing ensured no in-flight requests were dropped.

Lambda versioning with aliases. Lambda functions were deployed as new versions, with a production alias pointing to the current version. Cutover was an alias update — atomic and instantly reversible. For functions involved in Direct Connect-dependent workflows, I added a canary deployment step that routed a small percentage of invocations to the new version before full cutover.

Automated rollback with CloudWatch triggers. Across all three models, I implemented automated rollback logic. A Lambda-based orchestrator monitored error rates and response times from CloudWatch for a five-minute observation window after each deployment. If metrics exceeded defined thresholds, the rollback executed automatically — reverting the ALB listener for ECS services, reattaching old instances for EC2 services, or flipping the Lambda alias back. No human decision required.

Handling Direct Connect dependencies

The Direct Connect link between AWS and Tesco’s on-premises infrastructure was effectively a shared resource that multiple services depended on. Deployments that affected services communicating over this link required coordination to avoid disrupting critical data flows.

I established deployment guardrails for Direct Connect-dependent services: deployments were staged so that only one service communicating over the link was updated at a time, with health validation of the Direct Connect-facing endpoints between each stage. This added a few minutes to the deployment sequence for these specific services, but eliminated the risk of a cascading failure across the on-premises integration layer.

Results in production

We rolled the framework out progressively, starting with the highest-traffic ECS services and extending to EC2 and Lambda workloads over several weeks.

Weekly deployments became routine across mobile and web applications, up from roughly monthly
Zero customer-facing downtime during deployments — confirmed across hundreds of releases during the observation period
Mean time to rollback dropped from 30+ minutes to under 15 seconds for ECS services, under a minute for EC2
Deployment-related incidents dropped to zero in the quarter following full rollout
Developer confidence increased measurably — in retrospective surveys, deployment anxiety went from a top frustration to a non-issue

The most telling indicator was cultural. Engineers started describing deployments as “boring.” For a platform processing £4 billion in annual revenue, boring deployments are the highest compliment infrastructure can receive.

Takeaways

Tesco taught me that zero-downtime deployment at scale isn’t a single technique — it’s a framework that adapts to the realities of a heterogeneous environment. The platform ran EC2, ECS, and Lambda workloads with Direct Connect dependencies. A one-size-fits-all approach would have been inadequate. The key was establishing consistent principles — deep health checks, automated rollback, atomic cutover — while adapting the implementation to each hosting model.

For a platform serving millions of customers across mobile and web, making deployments unremarkable was one of the highest-value infrastructure investments the team made. When releases are safe, teams release more frequently. When they release more frequently, each release is smaller. Smaller releases are easier to reason about, easier to test, and easier to roll back. That virtuous cycle transformed the team’s relationship with shipping software.