Multi-Datacentre Disaster Recovery with HashiStack
Trading firms have a straightforward relationship with downtime: it costs money. Not in the abstract, brand-damage sense that applies to most businesses, but in the direct, quantifiable sense that if your systems are offline during market hours, you cannot execute your strategies. When Systematica Investments asked me to design a disaster recovery architecture for the firm’s core trading infrastructure, the requirement was unambiguous: if an entire data centre goes dark, trading continues.
The existing landscape
Systematica’s trading infrastructure ran on HashiCorp’s stack — Nomad for workload orchestration, Consul for service discovery and configuration, and Vault for secrets management. The firm operated out of a primary collocated data centre with low-latency connectivity to execution venues. This setup had served well, but it represented a single point of failure.
The infrastructure was mature and well-operated. Nomad managed several hundred jobs spanning market data ingestion, signal generation, order management, and post-trade processing. Consul provided service mesh connectivity between components. Vault handled credentials for databases, message brokers, and external API integrations.
What did not exist was a coherent plan for what would happen if the primary site became unavailable. There were backups, certainly, but backups and disaster recovery are fundamentally different things. A backup tells you that your data is safe. Disaster recovery tells you that your business keeps running.
Design principles
I established four principles to guide the architecture:
Active-passive with one-click promotion. A full active-active setup across sites would have introduced consistency challenges that are particularly dangerous in trading systems. Instead, I designed an active-passive model where the secondary site — spanning a second collocated data centre and AWS — maintained a warm standby that could be promoted to active with a single command.
Infrastructure parity. The secondary site would run identical configurations, identical Nomad job definitions, and identical Consul and Vault setups. No snowflakes, no “DR-specific” workarounds. If it runs in primary, it runs identically in secondary.
Automated failover for stateless workloads. Services that did not maintain local state — the majority of the trading pipeline — would fail over automatically via Consul DNS and Nomad’s multi-datacentre federation. Stateful components would require a controlled promotion process.
Regular, tested failover drills. An untested DR plan is not a plan; it is a hypothesis. I built the architecture with the explicit goal of running quarterly failover drills during non-market hours.
Implementation
Nomad federation across two data centres and AWS
Nomad supports multi-datacentre federation natively. I deployed Nomad clusters in the secondary collocated data centre and in AWS, federating them with the primary cluster using WAN gossip. This gave us a single control plane spanning all three sites, with the ability to target job placements to specific datacentres using constraint stanzas.
For the warm standby, I created a parallel set of Nomad job definitions with a count = 0 allocation in the secondary sites. The one-click failover script updated the secondary jobs to their target counts and set the primary jobs to zero. The Nomad scheduler handled the rest. Critically, the same script also handled failback — restoring the primary site to active once the issue was resolved, with the same single-command simplicity.
Consul multi-datacentre
Consul’s WAN federation provided cross-datacentre service discovery. Services registered in the primary datacentre were resolvable from the secondary using Consul’s prepared queries with failover configuration. I configured critical services — the order management gateway, market data distributors, and the risk engine — with prepared queries that would automatically resolve to the secondary datacentre if the primary became unhealthy.
Health checks were the critical detail. I tuned Consul’s health check intervals and deregistration thresholds to balance between false positives (triggering unnecessary failovers) and detection speed (catching genuine failures quickly). After extensive testing, I settled on a configuration that would detect a complete site failure within 30 seconds and begin redirecting traffic within 45.
Vault replication
Vault’s performance replication feature synchronised secrets, policies, and token metadata from the primary cluster to the secondary. This was the most operationally sensitive component — a failover that leaves services unable to authenticate is not a failover at all.
I configured Vault with auto-unseal using AWS KMS, with the secondary cluster using a KMS key in a different AWS region. This ensured that even an AWS regional outage affecting the primary’s KMS key would not prevent the secondary Vault from operating.
Data replication
Stateful components — primarily PostgreSQL databases and a message broker — used asynchronous replication to the secondary site. I configured PostgreSQL streaming replication with a target lag of under five seconds, monitored via a custom Consul health check that would flag if replication lag exceeded the threshold.
The failover runbook
I wrote a detailed failover runbook and implemented it as a single idempotent script. The full failover sequence was:
- Confirm primary site failure (automated detection plus human verification).
- Promote Vault secondary to primary.
- Verify Consul prepared queries are resolving to secondary.
- Scale up Nomad jobs in secondary datacentre.
- Verify database replication state and promote replicas if needed.
- Run automated smoke tests against all critical services.
- Confirm trading readiness with the operations team.
In drills, this sequence completed in under eight minutes. Failback followed the same process in reverse.
Testing and validation
We conducted the first full failover drill six weeks after completing the implementation. It surfaced three issues: a Consul health check that was too aggressive, a Nomad job that had a hardcoded IP address instead of a Consul service reference, and a Vault policy that was missing from the replication filter. All were fixed within a day.
Subsequent quarterly drills completed cleanly. The fastest recorded failover was five minutes and forty seconds from initiation to trading readiness confirmation.
Results
- Recovery time objective (RTO) of under 10 minutes, validated through quarterly drills.
- Recovery point objective (RPO) of under 10 seconds for all stateful components under normal replication conditions.
- One-click failover and failback — a single command to switch active sites in either direction, reducing operator error during high-stress scenarios.
- Zero unplanned failover events in the eighteen months since deployment — but the confidence that the system works when needed is the actual deliverable.
- Regulatory compliance with business continuity requirements from the firm’s regulatory obligations.
Reflections
The most valuable outcome of this project was not the DR architecture itself but the operational discipline it imposed. Building for failover forced us to eliminate hardcoded assumptions, document dependencies, and treat infrastructure configuration as genuinely reproducible code. The production environment became more robust as a direct side effect of designing for failure.