CrawlJobs Logo

Senior Software Engineer - Observability and Reliability

United States, New York City 150000.00 - 220000.00 USD / Year · Job Posted February 18, 2026
Apply Position
Job Link Share

Job Description

We are growing the engineering team and looking for engineers who have the chops to build and deliver world-class technology. You will be part of a talented team of engineers with a shared mission to make data easily accessible.

Job Responsibility

  • Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
  • Build with modern tools and languages like Go, Open Telemetry and Kubernetes
  • Participate in on-call rotation and ensure uptime of services
  • Create runtime tools/processes that optimize cloud triaging and limit downtime
  • Define best practices around making our systems and services measurable
  • Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time

Requirements

  • Strong Computer Science fundamentals
  • 5+ years industry experience building and maintaining high-quality software, especially software other engineers use
  • You apply a product mindset to infrastructure systems and feel accomplished enabling others
  • Desire to be a great teammate and have fun at work
  • Strong sense of craftsmanship, and a healthy academic curiosity

Nice to have

  • Experience building systems for data analytics
  • Distributed systems monitoring and profiling skills
  • Knowledge of cloud application security models
  • Administered cloud service infrastructure (GCP, AWS, Azure)
  • Startup experience

What we offer

  • Equity
  • Generous health benefits
  • Flexible time off policy
  • Paid bonding time for all new parents
  • Traditional and Roth 401k
  • Commuter and FSA benefits
  • Lunch Program
  • Dog friendly office

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Software Engineer - Observability and Reliability

8 matching positions

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 220000.00 USD / Year
sigmacomputing.com Logo
Sigma Computing
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong Computer Science fundamentals
  • 5+ years industry experience building and maintaining high-quality software, especially software other engineers use
  • You apply a product mindset to infrastructure systems and feel accomplished enabling others
  • Desire to be a great teammate and have fun at work
  • Strong sense of craftsmanship, and a healthy academic curiosity
Job Responsibility
Job Responsibility
  • Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
  • Build with modern tools and languages like Go, Open Telemetry and Kubernetes
  • Participate in on-call rotation and ensure uptime of services
  • Create runtime tools/processes that optimize cloud triaging and limit downtime
  • Define best practices around making our systems and services measurable
  • Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time
What we offer
What we offer
  • Equity
  • Generous health benefits
  • Flexible time off policy
  • Paid bonding time for all new parents
  • Traditional and Roth 401k
  • Commuter and FSA benefits
  • Lunch Program
  • Dog friendly office
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...
Location
Location
United States , San Francisco
Salary
Salary:
170000.00 - 215000.00 USD / Year
sigmacomputing.com Logo
Sigma Computing
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong Computer Science fundamentals
  • 5+ years industry experience building and maintaining high-quality software, especially software other engineers use
  • You apply a product mindset to infrastructure systems and feel accomplished enabling others
  • Desire to be a great teammate and have fun at work
  • Strong sense of craftsmanship, and a healthy academic curiosity
Job Responsibility
Job Responsibility
  • Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
  • Build with modern tools and languages like Go, Open Telemetry and Kubernetes
  • Participate in on-call rotation and ensure uptime of services
  • Create runtime tools/processes that optimize cloud triaging and limit downtime
  • Define best practices around making our systems and services measurable
  • Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time
What we offer
What we offer
  • Equity
  • Generous health benefits
  • Flexible time off policy
  • Paid bonding time for all new parents
  • Traditional and Roth 401k
  • Commuter and FSA benefits
  • Lunch Program
  • Dog friendly office
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 220000.00 USD / Year
sigmacomputing.com Logo
Sigma Computing
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong Computer Science fundamentals
  • 5+ years industry experience building and maintaining high-quality software, especially software other engineers use
  • You apply a product mindset to infrastructure systems and feel accomplished enabling others
  • Desire to be a great teammate and have fun at work
  • Strong sense of craftsmanship, and a healthy academic curiosity
Job Responsibility
Job Responsibility
  • Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
  • Build with modern tools and languages like Go, Open Telemetry and Kubernetes
  • Participate in on-call rotation and ensure uptime of services
  • Create runtime tools/processes that optimize cloud triaging and limit downtime
  • Define best practices around making our systems and services measurable
  • Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time
What we offer
What we offer
  • Equity
  • Generous health benefits
  • Flexible time off policy
  • Paid bonding time for all new parents
  • Traditional and Roth 401k
  • Commuter and FSA benefits
  • Lunch Program
  • Dog friendly office
  • Stock options
  • Fulltime
Read More
Arrow Right

Senior Software Engineer and Software Engineer II

OneDrive and SharePoint are rapidly growing services at the center of Microsoft'...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Experience in related to cloud scale distributed design and patterns
  • The ability to deliver informed designs and plans ahead of production and execution
  • Knowledge of others' expertise and the ability to involve multiple players (within and outside the organization) in the creation or development of novel products, processes, or research streams
Job Responsibility
Job Responsibility
  • Design and deliver systems that enable partners and ISVs to migrate from other cloud providers, improve core systems performance and efficiencies, and ensure zero customer impact throughout the change management cycle
  • Deliver systems to meet our business continuity planning goals, provide telemetry for optimizing the service and drive our response time for detecting and resolving service issues down
  • Create, implement, optimize, debug, refactor, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
  • Contribue to the identification of dependencies, and the development of design documents for a product area with little oversight
  • Helps to identify other teams and technologies that will be leveraged, how they will interact, and when one's system may provide support to others
  • Contributes to determining back-end dependencies associated with product, application, service, or platform functionality for product features
  • Understands downstream effects of solutions and work provided
  • Helps to identify areas of dependency and overlap with other teams or team members and drives coordination
  • Remain current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
  • Reviews work items to deepen knowledge of product features in partnership with appropriate stakeholders (e.g., project managers) and executes project plans, release plans, and work items
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Observability

You will work on core observability systems (metrics, logs, traces) while also d...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in software engineering, building distributed, high-throughput systems or observability platforms
  • 4+ years of Go/Golang experience
  • our observability ecosystem is built on Go, making it the most effective language for this role
  • Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics)
  • Deep understanding of distributed systems and data models
  • Hands-on experience with Kubernetes and cloud platforms (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Extend and integrate open-source observability systems, and when necessary, structurally overhaul core components, such as storage layers and query paths, to enhance the performance, reliability, and usability of these tools at scale
  • Build services to improve performance, usability, reliability, and cost efficiency
  • Implement features like pre-aggregation, downsampling, and sampling to reduce load and accelerate queries
  • Create developer-facing capabilities for metrics, logs, and traces usage, data quality, and cost management
  • Automate onboarding, dashboards, alerting, and tracing
  • Collaborate across platform and infrastructure teams to integrate observability into Roku’s cloud-native stack
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Observability

As a Senior Software Engineer, you will be directly responsible for Palantir’s o...
Location
Location
United States , New York
Salary
Salary:
135000.00 - 200000.00 USD / Year
palantir.com Logo
Palantir Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience
  • 2+ years of experience contributing to the system design or architecture (architecture, design patterns, reliability and scaling) of new and existing systems
  • 1+ years of experience as a mentor, tech lead Or leading an engineering team
  • Strong coding skills in Go, Java, or equivalent
  • Experience designing, building, and operating high-scale observability or infrastructure systems
  • Bachelor's degree in Computer Science or equivalent
  • Active US Security clearance, or eligibility and willingness to obtain a US Security clearance
Job Responsibility
Job Responsibility
  • Partner with our extended leadership team to set and define a technical strategy for your team aligned with the wider team strategy
  • Build and champion a long-term tech roadmap to reduce operational burden, ensure scalability, reduce risk, and guide your team towards step-changes whenever possible
  • Be technically involved and engage in substantive discussion when reviewing technical roadmaps and project implementation with the team
  • Work closely with teammates and stakeholders to enable sustainable and timely delivery of technical solutions to address business needs
  • Facilitate partnerships between engineering teams and operators to build innovative products that help Palantir scale
  • Act as a multiplier for other engineers on the team. Define where the technical bar should be, and help engineers achieve it. Lead engineers and accelerate their growth by providing thoughtful feedback, technical mentorship, and effectively manage performance
  • Foster a non-hierarchical exchange of ideas
  • valuing the idea rather than the individual who communicates it
What we offer
What we offer
  • Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
  • Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
  • Commuter benefits
  • Relocation assistance
  • Take what you need paid time off, not accrual based
  • 2 weeks paid time off built into the end of each year (subject to team and business needs)
  • 10 paid holidays throughout the calendar year
  • Supportive leave of absence program including time off for military service and medical events
  • Paid leave for new parents and subsidized back-up care for all parents
  • Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
  • Strong proficiency in Go
  • familiarity with Python and Helm is a plus
  • Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
  • Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
  • Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
  • Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
  • Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
  • Excellent analytical and problem-solving skills with a methodical approach to debugging
  • Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams
Job Responsibility
Job Responsibility
  • Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
  • Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
  • Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
  • Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
  • Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
  • Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
  • Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability
Read More
Arrow Right

Senior and Principal Software Engineer - Core AI

Core AI is at the forefront of Microsoft’s mission to redefine how software is b...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Go, Java, or Python
  • OR equivalent experience
  • 6+ years technical engineering experience designing and delivering highly available, large-scale cloud services and distributed systems
  • Experience building AI or ML related applications
  • 1+ years of technical engineering experience with machine learning or Artificial Intelligence (AI) systems
Job Responsibility
Job Responsibility
  • Design, implement and deliver AI services to support product offerings for large-scale agent observability
  • Collaborate closely with product management and partner teams to align technical direction with business goals
  • Take end-to-end responsibility for the development lifecycle and production readiness of the services you build and drive the team’s DevOps culture
  • Engage with customers to gather feedback and resolve complex issues
  • Understand Microsoft businesses and collaborate with stakeholders towards cohesive, end-to-end experiences for Microsoft customers
  • Innovate on technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right