Senior Software Engineer - Observability and Reliability Job at Sigma Computing (New York City)

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...

Location

United States , San Francisco

Salary:

150000.00 - 220000.00 USD / Year

Sigma Computing

Expiration Date

Until further notice

Requirements

Strong Computer Science fundamentals
5+ years industry experience building and maintaining high-quality software, especially software other engineers use
You apply a product mindset to infrastructure systems and feel accomplished enabling others
Desire to be a great teammate and have fun at work
Strong sense of craftsmanship, and a healthy academic curiosity

Job Responsibility

Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
Build with modern tools and languages like Go, Open Telemetry and Kubernetes
Participate in on-call rotation and ensure uptime of services
Create runtime tools/processes that optimize cloud triaging and limit downtime
Define best practices around making our systems and services measurable
Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time

What we offer

Equity
Generous health benefits
Flexible time off policy
Paid bonding time for all new parents
Traditional and Roth 401k
Commuter and FSA benefits
Lunch Program
Dog friendly office

Fulltime

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...

Location

United States , San Francisco

Salary:

170000.00 - 215000.00 USD / Year

Sigma Computing

Expiration Date

Until further notice

Requirements

Strong Computer Science fundamentals
5+ years industry experience building and maintaining high-quality software, especially software other engineers use
You apply a product mindset to infrastructure systems and feel accomplished enabling others
Desire to be a great teammate and have fun at work
Strong sense of craftsmanship, and a healthy academic curiosity

Job Responsibility

Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
Build with modern tools and languages like Go, Open Telemetry and Kubernetes
Participate in on-call rotation and ensure uptime of services
Create runtime tools/processes that optimize cloud triaging and limit downtime
Define best practices around making our systems and services measurable
Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time

What we offer

Equity
Generous health benefits
Flexible time off policy
Paid bonding time for all new parents
Traditional and Roth 401k
Commuter and FSA benefits
Lunch Program
Dog friendly office

Fulltime

Senior Software Engineer - Observability and Reliability

We are growing the engineering team and looking for engineers who have the chops...

Location

United States , San Francisco

Salary:

150000.00 - 220000.00 USD / Year

Sigma Computing

Expiration Date

Until further notice

Requirements

Strong Computer Science fundamentals
5+ years industry experience building and maintaining high-quality software, especially software other engineers use
You apply a product mindset to infrastructure systems and feel accomplished enabling others
Desire to be a great teammate and have fun at work
Strong sense of craftsmanship, and a healthy academic curiosity

Job Responsibility

Build observability tools and platforms, including: metrics, logging, distributed tracing, dashboarding, alerting, application performance management
Build with modern tools and languages like Go, Open Telemetry and Kubernetes
Participate in on-call rotation and ensure uptime of services
Create runtime tools/processes that optimize cloud triaging and limit downtime
Define best practices around making our systems and services measurable
Collaborate with peers and stakeholders through design and code reviews to ensure best practices amongst available technologies. We expect successful candidates to be coding a majority of their time

What we offer

Equity
Generous health benefits
Flexible time off policy
Paid bonding time for all new parents
Traditional and Roth 401k
Commuter and FSA benefits
Lunch Program
Dog friendly office
Stock options

Fulltime

Senior Software Engineer and Software Engineer II

OneDrive and SharePoint are rapidly growing services at the center of Microsoft'...

Location

United States , Redmond

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience in related to cloud scale distributed design and patterns
The ability to deliver informed designs and plans ahead of production and execution
Knowledge of others' expertise and the ability to involve multiple players (within and outside the organization) in the creation or development of novel products, processes, or research streams

Job Responsibility

Design and deliver systems that enable partners and ISVs to migrate from other cloud providers, improve core systems performance and efficiencies, and ensure zero customer impact throughout the change management cycle
Deliver systems to meet our business continuity planning goals, provide telemetry for optimizing the service and drive our response time for detecting and resolving service issues down
Create, implement, optimize, debug, refactor, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
Contribue to the identification of dependencies, and the development of design documents for a product area with little oversight
Helps to identify other teams and technologies that will be leveraged, how they will interact, and when one's system may provide support to others
Contributes to determining back-end dependencies associated with product, application, service, or platform functionality for product features
Understands downstream effects of solutions and work provided
Helps to identify areas of dependency and overlap with other teams or team members and drives coordination
Remain current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
Reviews work items to deepen knowledge of product features in partnership with appropriate stakeholders (e.g., project managers) and executes project plans, release plans, and work items

Fulltime

Senior Software Engineer, Observability

You will work on core observability systems (metrics, logs, traces) while also d...

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

8+ years in software engineering, building distributed, high-throughput systems or observability platforms
4+ years of Go/Golang experience
our observability ecosystem is built on Go, making it the most effective language for this role
Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics)
Deep understanding of distributed systems and data models
Hands-on experience with Kubernetes and cloud platforms (AWS, GCP, Azure)

Job Responsibility

Extend and integrate open-source observability systems, and when necessary, structurally overhaul core components, such as storage layers and query paths, to enhance the performance, reliability, and usability of these tools at scale
Build services to improve performance, usability, reliability, and cost efficiency
Implement features like pre-aggregation, downsampling, and sampling to reduce load and accelerate queries
Create developer-facing capabilities for metrics, logs, and traces usage, data quality, and cost management
Automate onboarding, dashboards, alerting, and tracing
Collaborate across platform and infrastructure teams to integrate observability into Roku’s cloud-native stack

What we offer

global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)

Fulltime

Senior Software Engineer - Observability

As a Senior Software Engineer, you will be directly responsible for Palantir’s o...

Location

United States , New York

Salary:

135000.00 - 200000.00 USD / Year

Palantir Technologies

Expiration Date

Until further notice

Requirements

5+ years of professional software development experience
2+ years of experience contributing to the system design or architecture (architecture, design patterns, reliability and scaling) of new and existing systems
1+ years of experience as a mentor, tech lead Or leading an engineering team
Strong coding skills in Go, Java, or equivalent
Experience designing, building, and operating high-scale observability or infrastructure systems
Bachelor's degree in Computer Science or equivalent
Active US Security clearance, or eligibility and willingness to obtain a US Security clearance

Job Responsibility

Partner with our extended leadership team to set and define a technical strategy for your team aligned with the wider team strategy
Build and champion a long-term tech roadmap to reduce operational burden, ensure scalability, reduce risk, and guide your team towards step-changes whenever possible
Be technically involved and engage in substantive discussion when reviewing technical roadmaps and project implementation with the team
Work closely with teammates and stakeholders to enable sustainable and timely delivery of technical solutions to address business needs
Facilitate partnerships between engineering teams and operators to build innovative products that help Palantir scale
Act as a multiplier for other engineers on the team. Define where the technical bar should be, and help engineers achieve it. Lead engineers and accelerate their growth by providing thoughtful feedback, technical mentorship, and effectively manage performance
Foster a non-hierarchical exchange of ideas
valuing the idea rather than the individual who communicates it

What we offer

Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
Commuter benefits
Relocation assistance
Take what you need paid time off, not accrual based
2 weeks paid time off built into the end of each year (subject to team and business needs)
10 paid holidays throughout the calendar year
Supportive leave of absence program including time off for military service and medical events
Paid leave for new parents and subsidized back-up care for all parents
Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation

Fulltime

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...

Location

India , Bangalore

Salary:

Not provided

JFrog

Expiration Date

Until further notice

Requirements

7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
Strong proficiency in Go
familiarity with Python and Helm is a plus
Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
Excellent analytical and problem-solving skills with a methodical approach to debugging
Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams

Job Responsibility

Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability

Senior and Principal Software Engineer - Core AI

Core AI is at the forefront of Microsoft’s mission to redefine how software is b...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Go, Java, or Python
OR equivalent experience
6+ years technical engineering experience designing and delivering highly available, large-scale cloud services and distributed systems
Experience building AI or ML related applications
1+ years of technical engineering experience with machine learning or Artificial Intelligence (AI) systems

Job Responsibility

Design, implement and deliver AI services to support product offerings for large-scale agent observability
Collaborate closely with product management and partner teams to align technical direction with business goals
Take end-to-end responsibility for the development lifecycle and production readiness of the services you build and drive the team’s DevOps culture
Engage with customers to gather feedback and resolve complex issues
Understand Microsoft businesses and collaborate with stakeholders towards cohesive, end-to-end experiences for Microsoft customers
Innovate on technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products
Embody our culture and values

Fulltime

Select Country

Senior Software Engineer - Observability and Reliability

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Software Engineer - Observability and Reliability

Senior Software Engineer - Observability and Reliability

Senior Software Engineer - Observability and Reliability

Senior Software Engineer - Observability and Reliability

Senior Software Engineer and Software Engineer II

Senior Software Engineer, Observability

Senior Software Engineer - Observability

Senior Software Engineer - Infrastructure Reliability

Senior and Principal Software Engineer - Core AI

Our AI answers in your language