Modernising 2,500 Nomad Jobs into 300 IaC Files

When I joined the Apple Pay platform team, the scale of the Nomad deployment was impressive. Thousands of microservices orchestrated by HashiCorp Nomad, running across a substantial fleet of bare-metal servers spread across multiple data centres. What was less impressive was how those services were managed: 2,500 individual Nomad job specification files, maintained by hand, deployed through unvalidated shell scripts and manual nomad job run commands.

The result was what you would expect. Configuration drift between the five environments — development, integration, staging, pre-production, and production. Deployments that worked in staging but failed in production because someone forgot to update an environment variable. No audit trail beyond “who ran this command last.” A single deployment taking up to twenty-five minutes because the scripts ran sequentially, environment by environment, with manual verification steps in between.

The team knew it was a problem. What they needed was someone to design a path forward and execute on it without disrupting a live, revenue-critical payment processing platform.

Understanding the scale

Before proposing a solution, I spent two weeks cataloguing the existing Nomad job definitions. The 2,500 figure was the total across all five environments and multiple data centres — but many were variations of the same service with different parameters. After analysis, I identified approximately 300 distinct service definitions, each deployed to multiple environments with environment-specific configuration.

This was the key insight: the problem was not 2,500 unique services. It was 300 services with poor configuration management, multiplied across environments and data centres through copy-paste.

I also catalogued the common patterns. Most jobs fell into a handful of archetypes: long-running HTTP services, batch processors, periodic jobs (cron-style), and system services (one per node). Each archetype shared structural similarities — resource allocations, health check patterns, logging configuration, network modes — with service-specific differences limited to the container image, environment variables, and scaling parameters.

The Terraform and Terragrunt approach

I chose Terraform with the HashiCorp Nomad provider as the declaration layer, and Terragrunt as the orchestration and DRY (Don’t Repeat Yourself) layer. The rationale was pragmatic: the team already used Terraform for other infrastructure, so extending it to Nomad management kept the tooling surface area small.

The architecture was structured as follows:

Terraform modules for each job archetype. I wrote four base modules — http-service, batch-processor, periodic-job, and system-service — each encapsulating the Nomad job specification as a Terraform resource with parameterised inputs for image, resources, scaling, environment variables, and health checks.

Terragrunt configurations for each service-environment-datacentre combination. A single terragrunt.hcl file defined a service’s deployment in a given environment and data centre, referencing the appropriate Terraform module and providing the environment-specific inputs. Shared values (service name, team ownership, default resource allocations) were defined in parent Terragrunt configurations and inherited through the directory hierarchy.

The directory structure looked like this:

infrastructure/
  _envcommon/
    http-service.hcl       # Shared defaults for HTTP services
    batch-processor.hcl    # Shared defaults for batch processors
  production/
    dc-us-east/
      env.hcl
      payment-gateway/
        terragrunt.hcl     # Minimal, service-specific overrides
    dc-eu-west/
      env.hcl
      payment-gateway/
        terragrunt.hcl
  staging/
    env.hcl
    payment-gateway/
      terragrunt.hcl

A typical service’s terragrunt.hcl was often fewer than 30 lines — just the image tag, any service-specific environment variables, and scaling parameters. Everything else was inherited.

The migration process

Migrating 2,500 live job definitions to Terraform without causing outages required care. I developed a three-phase migration process:

Phase 1: Import. I wrote a Python script that parsed existing Nomad job specification files and generated corresponding Terragrunt configurations and terraform import commands. This was not a perfect translation — many jobs had accumulated bespoke configuration that did not map cleanly to the module interfaces — but it produced a working first draft for each service that could be refined manually.

Phase 2: Validate. For each migrated service, I ran terraform plan and compared the planned Nomad job specification against the currently running job using nomad job inspect. The goal was a zero-diff plan: Terraform should describe exactly what was already running, with no unintended changes. This was the most time-consuming phase, as it surfaced all the subtle inconsistencies and undocumented configuration that had accumulated over time.

Phase 3: Apply. Once a service’s Terraform configuration produced a clean plan, I applied it in a maintenance window. The Nomad provider’s apply is effectively a nomad job run, so the running allocation was replaced with a new one matching the Terraform-managed specification. Services with health checks experienced zero-downtime rolling deployments.

I migrated services in batches, starting with low-risk internal tools, progressing to lower environments, and finally tackling production payment processing services across all data centres. The entire migration took approximately four months.

Automated PR validation with GitHub checks

With all services managed through Terraform, I introduced automated validation through GitHub checks that were previously impossible with the shell script approach:

Automated plan on every pull request. The CI pipeline ran terragrunt plan on every PR and reported the results directly as a GitHub check. Reviewers could see exactly what would change across all affected environments before approving.

Policy enforcement via Sentinel. I wrote Sentinel policies that prevented common mistakes: deploying without health checks, requesting more resources than the node class could provide, or removing required environment variables. These ran as part of the GitHub check suite, blocking merges that violated policy.

Drift detection. A scheduled CI job ran terragrunt plan nightly and alerted if any service had drifted from its declared state — catching manual interventions that bypassed the Terraform workflow.

Parallelised deployment. Terragrunt’s dependency graph allowed deployments to run in parallel across independent services and data centres. This was the single biggest factor in the deployment time improvement.

Results

2,500 job files consolidated into 300 Terragrunt configurations referencing 4 shared Terraform modules, spanning 5 environments and multiple data centres.
Deployment time reduced by 80% — from approximately 25 minutes down to 4 minutes. Parallelised Terragrunt execution and elimination of manual verification steps accounted for most of the improvement.
Configuration drift eliminated. Nightly drift detection ensured that what was declared in code was what was running.
Onboarding time for new engineers halved. New team members no longer needed to understand 2,500 bespoke job files. They learned four module interfaces and the Terragrunt hierarchy.
Zero outages caused by the migration itself.

Lessons learnt

The technical work — writing modules, scripting imports, validating plans — was the straightforward part. The harder challenge was building consensus within the team that the migration was worth the disruption. Engineers are rightly cautious about changing the deployment mechanism for production payment processing systems.

What worked was demonstrating value incrementally. I migrated internal services first, showed the team the cleaner workflow and the 80% deployment time reduction, and let the benefits speak for themselves. By the time I proposed migrating production services, the team was actively requesting it.