CrawlJobs Logo

Observability Lead

United States, Chicago 175000.00 - 250000.00 USD / Year · Job Posted March 13, 2026
Apply Position
Job Link Share

Job Description

We are seeking an Observability Lead to own the strategy, execution, and technical direction of CTC's observability platform. In this role, you will lead a small, high-impact team responsible for the tools and systems that give our engineers, quants, and traders visibility into the health and performance of critical infrastructure and applications.

Job Responsibility

  • Define and drive the observability roadmap
  • Lead the design, implementation, and continuous improvement of monitoring, alerting, logging, tracing, and metrics infrastructure at scale
  • Own the end-to-end developer experience of observability tooling
  • Manage and grow a small team of engineers
  • Partner with infrastructure, platform, and application teams
  • Establish and enforce best practices for instrumentation, SLOs, alert quality, and operational readiness
  • Evaluate emerging tools, frameworks, and approaches in the observability space

Requirements

  • 8+ years of technical engineering experience
  • At least 3 years focused on observability, monitoring, or site reliability engineering
  • Demonstrated expertise designing, building, and operating observability platforms at scale
  • Deep experience with Datadog and OpenTelemetry strongly preferred
  • Proven experience leading or managing a small engineering team
  • Strong understanding of distributed systems and micro-services architectures
  • Hands-on experience with Kubernetes and bare-metal infrastructure
  • Advanced programming proficiency in at least one of Python, Go, or Java
  • Familiarity with C++ or low-latency systems is a strong plus
  • A product-oriented mindset
  • Exceptional communication skills
  • Financial sector experience (trading, prop trading, hedge funds) and familiarity with low-latency, high-reliability systems are strongly preferred
  • Advanced degree (MS, PhD) in Computer Science, Engineering, or related field is a plus

Nice to have

  • Familiarity with C++ or low-latency systems
  • Financial sector experience (trading, prop trading, hedge funds)
  • Advanced degree (MS, PhD) in Computer Science, Engineering, or related field

What we offer

  • Generous medical coverage
  • Paid parental leave
  • Free breakfast and lunch
  • Healthy snacks
  • Wellness reimbursement
  • Quarterly recharge days

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Observability Lead

8 matching positions

Lead Observability Engineer

We are seeking a Lead Observability Engineer to join the team, and be able to wo...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of engineering experience in cloud observability platforms, infrastructure, and telemetry systems
  • Deep experience in alerting, notifications, and monitoring at scale
  • Advanced expertise with ClickHouse, or similar high-performance analytical databases, for telemetry storage and querying
  • Hands-on experience migrating telemetry/storage solutions (preferably from Cosmos DB to ClickHouse or equivalent)
  • Solid understanding of telemetry pipelines, cloud-native monitoring, and best practices
  • Experience with dashboarding and visualization tools (Grafana, Kibana, or similar)
  • Strong scripting and automation skills (Python, Bash, Terraform or equivalent)
  • Proven collaboration and communication skills across cross-functional teams.
Job Responsibility
Job Responsibility
  • Lead the migration and transformation of telemetry storage from custom Cosmos DB solutions to ClickHouse, building a scalable and reliable end-to-end observability platform
  • Architect, implement, and maintain alerting and notification systems integrated with ClickHouse for critical services and applications
  • Develop, deploy, and operate high-throughput telemetry pipelines, ensuring accurate and actionable monitoring across cloud environments
  • Collaborate with engineering and product teams to define and champion observability best practices
  • Work with DevOps and development teams to automate collection, ingestion, and retention policies for logs, metrics, and traces
  • Drive continuous improvement in system performance, stability, and reliability through effective observability
  • Participate in on-call rotations, incident response, and root cause analysis to enhance monitoring and alerting capabilities.
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
  • Fulltime
Read More
Arrow Right

Lead Observability Platform Engineer

Capital One is looking for an Observability Platform Engineer to join our Associ...
Location
Location
United States , Plano; McLean; Richmond
Salary
Salary:
149800.00 - 188100.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma, GED, or equivalent certification
  • At least 3 years of experience creating reports and building alert monitors
  • At least 3 years working with macOS and Windows platforms
  • Strong analytical and technical skills
  • Ability to foster collaborative, open, working relationships with technology groups and other stakeholders, including vendor relationships
  • Demonstrated clear communication skills and ability to interact effectively at all levels of an organization, and to influence senior management and executives
  • Strong knowledge of syntax structures for reporting languages, such as SQL or Opal, and good familiarity with parsing data.
Job Responsibility
Job Responsibility
  • Work with partner teams to update configurations for our log collectors on our Windows and Mac endpoints
  • Work with stakeholders to identify, discuss and prioritize log ingestion strategies
  • Build complex dashboards that tell stories about the health of our endpoints, and identify opportunities for improvements
  • Create monitors that alert platform teams when changes to the environment may be impacting the health of devices and user experiences
  • Create reports that detail the performance of applications on our endpoints, and applications being considered for future deployment
  • Assist platform teams with issue triage by providing complex data and log analysis where needed
  • Use data to tell stories to our senior leaders, help to drive vendor and product roadmaps
  • Help create processes and strategies that can validate changes in performance across operating system and product version updates
What we offer
What we offer
  • Performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • A comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being
  • Fulltime
Read More
Arrow Right

Lead Observability Engineer

Lead Observability Engineer role focusing on the Elastic Observability Platform,...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
blueyonder.com Logo
Blue Yonder
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, MIS, or equivalent experience
  • 7–10+ years of experience in observability engineering, SRE, monitoring platform ownership, or infrastructure operations
  • Deep, hands-on expertise with Elastic Stack (Elasticsearch, Kibana, Logstash, Beats/Elastic Agent, APM)
  • Strong architectural knowledge of cloud (Azure/AWS) and hybrid observability patterns
  • Experience leading observability for infrastructure, cloud platforms, network systems, Kubernetes, and Microsoft 365
  • Proven experience designing monitoring for SaaS platforms (Workday, Salesforce, ServiceNow)
  • Advanced scripting/automation experience (Python, PowerShell, Bash)
  • Strong knowledge of API integrations, data pipelines, and log-flow engineering
  • Experience leading incident diagnostics and delivering visibility for RCA and operational improvement
  • Strong analytical, architectural, and troubleshooting skills with a platform-owner mindset
Job Responsibility
Job Responsibility
  • Receives work assignments through the ticketing system or from senior leadership
  • Provides Tier-4 engineering expertise, platform ownership, and technical leadership for all observability capabilities across hybrid cloud, on-premises, and SaaS environments
  • Leads the design, architecture, and maturity of the enterprise observability ecosystem with a primary focus on the Elastic Observability Platform
  • Drives the enterprise strategy for logging, metrics, traces, synthetics, and alerting—including governance, standardization, and performance optimization
  • Partners closely with Cloud, Infrastructure, Security, Enterprise Applications, and SRE leadership to define observability frameworks
  • Ensures observability platforms meet enterprise requirements for security, performance, availability, compliance, and scalability
  • Oversees monitoring implementations for key SaaS applications including Workday, Salesforce, ServiceNow, and Microsoft 365
  • Provides guidance, mentorship, and direction to observability engineers, SREs, and operational teams
  • Acts as a strategic advisor during major incidents by providing real-time diagnostics, correlation insights, and driving RCA improvements
  • Required to provide on-call support during off-hours on weekdays, weekends, and holidays on a rotating basis
  • Fulltime
Read More
Arrow Right

Observability Lead – Elastic (ELK) Stack

We are seeking a highly experienced and visionary Observability Lead to spearhea...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
imss.co.in Logo
Integra Micro Software Services
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Information Technology (IT), or a closely related technical field
  • Minimum of 8+ years of professional experience dedicated to observability, system monitoring, or infrastructure management practices
  • 3+ years of direct, hands-on experience specifically managing and engineering solutions using the full Elastic Stack (Elasticsearch, Kibana, Logstash/Beats, Elastic APM, and Fleet/Elastic Agent)
  • Strong, practical understanding of fundamental observability concepts, including the collection and analysis of logs, metrics, traces, and synthetic monitoring
  • Expertise in implementing OpenTelemetry, configuring distributed tracing, and carrying out telemetry instrumentation within complex microservice environments
  • Proven experience working with complementary modern monitoring and containerization tools such as Kubernetes, Docker, Prometheus, and Grafana
  • Demonstrated proficiency in managing system configurations using YAML-based configurations
  • Extensive experience in performance optimization, advanced data visualization, and sophisticated dashboarding using Kibana
Job Responsibility
Job Responsibility
  • Spearhead our monitoring and infrastructure management initiatives
  • Drive the strategy and implementation of robust observability solutions
  • Ensure system reliability, performance, and insightful data visualization
What we offer
What we offer
  • Innovation Focused culture
  • Collaborative Environment
  • Professional Development through continuous learning programs, certifications, and mentorship opportunities
  • Work-Life Integration with competitive benefits and policies
Read More
Arrow Right

Lead Integration & Observability Specialist

The Lead Integration & Observability Specialist will design and implement observ...
Location
Location
India , Coimbatore
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of overall IT experience
  • 5+ years of relevant experience in Observability / Monitoring / Reliability Engineering
  • Strong hands-on experience with enterprise observability tools, such as: IBM Instana, Dynatrace, AppDynamics, Prometheus, Grafana
  • Expertise in: Monitoring and alerting design
  • Log management and analysis
  • Metrics and distributed tracing
  • Health checks and SLO/SLI concepts
  • Experience monitoring AWS/Azure workloads
  • Strong troubleshooting and incident analysis skills
  • Experience defining operational and non-functional requirements
Job Responsibility
Job Responsibility
  • Lead the implementation of enterprise observability for applications, APIs, services, batch jobs, and data pipelines
  • Design and standardize monitoring, alerting, logging, metrics, and health checks across distributed systems
  • Integrate observability platforms with incident management and automation tools to support proactive issue detection and remediation
  • Support reliability and availability of integration platforms built on AWS/Azure
  • Perform advanced troubleshooting using logs, metrics, and traces to resolve production issues
  • Define operational readiness standards and non-functional requirements
  • Mentor engineers on observability best practices and platform usage
  • Collaborate with product, support, and operations teams to improve service stability and delivery
Read More
Arrow Right

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
What we offer
What we offer
  • Tailored benefits that support your physical, emotional, and financial wellbeing
  • Continuous growth and development opportunities
  • Flexible work options
  • Fulltime
Read More
Arrow Right

Program Lead: Product Operations - AI Observability

The AI Observability Program Leader will own the end-to-end strategy, design, an...
Location
Location
United States , Sunnyvale
Salary
Salary:
162000.00 - 180000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Technical Program Management, Product Operations, AI Quality, or Observability
  • Bachelor’s degree in Engineering, Computer Science, Data Science, or a related technical field.
Job Responsibility
Job Responsibility
  • Architect Observability Frameworks: Own the strategy for understanding AI agentic reasoning, enabling deep analysis of step-by-step agent decision-making
  • Drive Autoeval Strategy: Design and roll out automated evaluation systems (LLM-as-a-judge) to provide a scalable, high-confidence "pulse" on AI performance across conversational and voice interfaces
  • Define Micrometrics: Develop granular signals within agentic activity—identifying latent failures, reasoning loops, or tool-calling inefficiencies—to drive product improvements
  • Lead Pre-Launch Simulation: Partner with Product & Engineering to build and maintain simulation environments that test AI agents against edge cases before deployment, and democratise these tools with Operations teams
  • Cross-Functional Technical Partnership: Act as the primary liaison between Product, Engineering, and Data Science to ensure observability tooling is integrated into the development lifecycle and directly informs release "Go/No-Go" decisions
  • Insight Synthesis: Package complex technical observability data into clear, actionable narratives for leadership, highlighting specific failure patterns and opportunities for CX improvement
  • Operational Excellence: Establish the standards and tooling for how AI performance is reported globally, ensuring consistency across different regions and support modalities.
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • All full-time employees are eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at link).
  • Fulltime
Read More
Arrow Right

Technical Architect

Lead the design, modernization, and implementation of scalable, secure, and resi...
Location
Location
United States , Armonk
Salary
Salary:
247319.00 - 250000.00 USD / Year
nytimes.com Logo
The New York Times
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent in Computer Science, Information Technology, Engineering or related and five (5) years of experience as a Consultant Architect, Virtualization Architect, Senior Cloud Architect or related
  • Five (5) years of experience must include utilizing Hybrid Cloud, AWS, Azure, Red Hat Linux, Terraform, Ansible, Python, VMware Cloud Foundation (VCF) Stack
Job Responsibility
Job Responsibility
  • Lead the design, modernization, and implementation of scalable, secure, and resilient hybrid cloud and containerized infrastructure platforms
  • Define and lead the technical architecture strategy for hybrid cloud, container orchestration (Kubernetes, RedHat OpenShift, VMware Tanzu), and virtualized environments (VMware, Nutanix, RedHat)
  • Architect secure and scalable infrastructure across private, public, and hybrid cloud ecosystems
  • Evaluate, design, and implement solutions for computing, storage, networking, identity, and availability zones across global regions
  • Design and implement Kubernetes, RedHat OpenShift clusters across multi-cloud and on-prem environments, including CI/CD integration, policy enforcement, and workload orchestration
  • Define governance, observability, and security patterns for containerized workloads
  • Lead Infrastructure-as-Code (IaC) initiatives using Terraform, Ansible, GitOps, GitHub, PowerShell, and Python
  • Enable self-service infrastructure capabilities through automation frameworks and developer platforms
  • Partner with DevSecOps, SRE, Infrastructure Operations, Security, and Datacenter Operation teams to scope, define, size, and execute application onboarding, modernization, and consolidation initiatives
  • Mentor engineering teams and influence enterprise architecture (EA) roadmaps
  • Fulltime
Read More
Arrow Right