This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We build products people genuinely love. Our features are impactful, our business is growing, and … it’s pretty great! We also love Datadog too – but let’s be honest: we have been operating on “ship it first, check the bill later”. And the bill has grown to the point we're actually looking for someone who can help us think through how to manage this in a more scalable way. We need a hero. A detective. Someone with a deep-seated love for logs, metrics, and most importantly, savings. You are not just an Infra Engineer; you are an economic covert ops specialist. Your glorious mission is to make our Datadog spend dramatically and sustainably go down. We're talking down down. The bill should look like it's been body-slammed by a professional wrestler. You will be embedded within the Infrastructure team, and will have the autonomy to look across every service to streamline and purge that which needs streamlining and purging. As you rack up wins, you'll increasingly become the person we introduce at company meetings as, 'The reason we could spend $$ on that nice company offsite.'
Job Responsibility:
Mitigation of myriad metrics: Hunt down and decommission all high-cardinality custom metrics that no one actually uses, replacing them with sane, aggregated alternatives, or build a system that insulates us from this risk area entirely
Liberation from legions of logs: Audit the log ingestion for every service. You'll work with engineering teams to tune logging levels, apply intelligent sampling and exclusion filters at the source (i.e., the agent), and implement better categorization and archiving strategies
Analysis of Performance Monitoring (APM): Analyze our APM and trace ingestion and ensure it’s smartly used. You'll champion distributed tracing strategies that are both informative and economical
Standardization: Use automation to enforce cost-saving policies across our entire fleet, ensuring developers can't accidentally check in a new, expensive monitoring configuration
Evangelization: Be the champion for cost-aware engineering. Create internal documentation, run 'Datadog Dojo' workshops, and embed the mindset of 'monitor what matters' across the entire engineering organization
Requirements:
3+ years as an Infrastructure, DevOps, or Site Reliability Engineer
Expert-level, obsessive knowledge of Datadog's pricing model and platform architecture
Deep proficiency with AWS and Kubernetes
Strong programming skills for infrastructure automation
The courage to tell a founder or principal engineer that their favorite metric is financially irresponsible
Nice to have:
Experience with other monitoring/observability tools (Prometheus, Grafana, Honeycomb, Splunk) and a view on whether we should be using any of them to displace some Datadog functionality
Experience implementing OpenTelemetry standards and agents for cost-effective vendor neutrality
A proven track record of actually reducing cloud costs, not just talking about it