Building Firm-Wide Observability with Datadog

Observability at a quantitative trading firm is not optional. When a model misbehaves or a data pipeline stalls, the difference between catching it in seconds versus minutes can be measured in real money. When I took on this project at Systematica Investments, the firm had monitoring — but it was fragmented, inconsistent, and riddled with blind spots.

The starting point

The infrastructure landscape was heterogeneous, as it tends to be at firms that have grown organically over a decade. We had EKS clusters running research workloads on Fargate, a fleet of bare-metal servers in a collocated data centre running trading systems under HashiCorp Nomad, a handful of legacy VM-based services, and various SaaS integrations.

Each team had gravitated towards its own monitoring approach. The research team used ad-hoc CloudWatch dashboards. The trading infrastructure team relied on a self-hosted Prometheus instance that no one had upgraded in two years. Some services logged to stdout, others to files on disk, and a few wrote directly to Elasticsearch via a Logstash pipeline that was held together with hope.

There was no unified view. When an incident occurred, the first thirty minutes were typically spent working out where to look.

Choosing Datadog

The decision to adopt Datadog as the central platform was driven by practical considerations rather than brand preference. We needed a platform that could ingest logs, metrics, and traces from highly diverse sources — Kubernetes pods, Nomad jobs, bare-metal daemons, AWS services — without requiring us to build and maintain a bespoke collection pipeline for each. Datadog’s agent model and broad integration catalogue made it the pragmatic choice.

Crucially, we also needed a solution that would not itself become an operational burden. Running a self-hosted observability stack at the scale and reliability level required for a trading firm would have meant dedicating significant engineering time to a system that is not our core business. Datadog’s managed model freed us to focus on instrumentation rather than infrastructure.

The architecture

I designed the observability architecture in three layers:

Collection. I deployed the Datadog Agent and Fluent Bit across all infrastructure tiers. On EKS Fargate, the Datadog Agent and Fluent Bit ran as sidecar containers — Fargate does not support DaemonSets, so each pod carried its own lightweight collectors. On bare-metal Nomad nodes, both ran as system jobs, tailing container logs from the Docker socket and system logs from journald. For legacy VMs, I deployed them via Ansible with file-input plugins configured per service.

Processing and routing. Fluent Bit served as the universal log router. I configured a dual-output pipeline: logs were forwarded simultaneously to Datadog for real-time search, alerting, and dashboarding, and to AWS S3 for long-term archival. The S3 tier used lifecycle policies to transition logs to Glacier after 90 days, giving us cost-effective retention for compliance and forensic purposes without paying Datadog’s per-GB ingestion costs on historical data.

Fluent Bit’s filter plugins handled log enrichment at the edge. I configured filters to inject metadata — Kubernetes namespace, Nomad job name, data centre identifier, environment tag — before forwarding. This meant logs arrived pre-enriched at both destinations, making them immediately searchable and filterable without relying on server-side processing pipelines.

Visualisation and alerting. I built a hierarchy of Datadog dashboards: a top-level firm overview, per-team dashboards, and per-service detail views. The firm overview displayed key health indicators — deployment frequency, error rates, p99 latencies, and infrastructure utilisation — giving leadership a single screen to assess operational health.

Alerting was configured through Terraform using the Datadog provider, which meant alert definitions lived in version control alongside the infrastructure they monitored. I established sensible defaults — error rate thresholds, latency anomaly detection, log volume spike detection — and gave each team the ability to extend with their own service-specific monitors.

The Fargate challenge

EKS Fargate presented the most significant technical challenge. Because Fargate nodes are managed by AWS and inaccessible to DaemonSets, the standard Datadog Agent deployment pattern does not work. Metrics collection required the Datadog Agent to run as a sidecar, and log collection required Fluent Bit as a separate sidecar — meaning each pod specification needed two additional containers.

I automated this through a Kubernetes mutating admission webhook that injected the sidecar containers automatically. Teams did not need to modify their pod specifications; the webhook added the Fluent Bit and Datadog sidecars at admission time, configured with the correct API keys, tags, and log routing rules.

This approach kept the developer experience clean while ensuring consistent instrumentation across all Fargate workloads.

Structured logging standard

For structured logging, I established a firm-wide standard: JSON logs with mandatory fields for service name, environment, severity, and correlation ID. I wrote a shared logging library wrapper that teams could adopt incrementally, and provided migration guides for each major language runtime used internally (Python, Go, and Java).

The correlation ID field proved especially valuable during incident response. By propagating a single identifier through the entire request chain, engineers could trace a transaction from the API gateway through message brokers to downstream services with a single Datadog query.

Rollout and adoption

I rolled out the platform team-by-team over three months, starting with the platform team’s own services (where I could iterate quickly) and expanding to research, trading, and data engineering. Each onboarding involved a short workshop covering the logging standard, dashboard navigation, and how to create custom monitors.

Results

Within six months of full rollout:

Mean time to detection for production incidents dropped from approximately 12 minutes to under 90 seconds.
Mean time to resolution improved by roughly 40%, largely because engineers no longer spent the first phase of incident response hunting for relevant data.
Log search latency went from “check three different systems” to sub-second queries across the entire firm.
Long-term log costs reduced by 70% compared to retaining all logs in Datadog, thanks to the S3 archival tier.
Dashboard adoption exceeded expectations — teams created over 60 custom dashboards beyond the ones I provided, indicating genuine engagement rather than grudging compliance.

Lessons learnt

The hardest part of building an observability platform is not the technology. It is persuading teams to adopt consistent practices — structured logging, meaningful tagging, correlation IDs — when they already have something that “works well enough.” The approach that succeeded was making the new path the easiest path: automate instrumentation, provide sensible defaults, and make the dashboards genuinely useful from day one.