Senior Software Engineer, Site Reliability Job at Babylist

Senior Software Engineer - Site Reliability Engineer/SRE

Software-defined vehicles represent a new paradigm for automakers and consumers,...

Location

United States , Sunnyvale

Salary:

152100.00 - 232900.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Azure experience is a must
5+ years of hands-on DevOps experience with at least one of the public cloud providers - Azure (preferred), AWS, GCP
Excellent skills with Terraform
Experience with monitoring and log aggregation frameworks such as Logstash, Splunk, DataDog, ElasticSearch, and Kibana
Strong CS fundamentals, including OO concepts, data structures, algorithms, and distributed systems
Experience with Bash, PowerShell, Python, Go, or Groovy
On-call and fire-fighting experience
Experience with modern site reliability practice including but not limited to post mortem, SLO/SLI, Tracing, Synthetic monitoring, etc.

What we offer

This position may be eligible for relocation benefits

Fulltime

Software engineer 2 / Senior Software engineer - Azure Data

Microsoft's Azure Data engineering team is leading the transformation of analyti...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience with the Azure stack including Storage, Compute, Networking, Fabric, Purview, Synapse, AKS, DevOps, Data Factory, or Power BI
Experience with big data technologies such as Spark, Kafka, Hadoop, or HBase
Experience building data lake or data engineering products, tools, or pipelines
Familiarity with container-based architectures (Docker, Kubernetes)
Ability to debug complex distributed systems on Linux and/or Windows platforms

Job Responsibility

Write extensible, maintainable code in C#, Java, Scala, or Python for Fabric Materialized Lake View services and HDInsight components
Use AI tools and coding best practices across the development lifecycle
Design data refresh, scheduling, and query optimisation features with minimal supervision
Review code from teammates for correctness, test coverage, security risks, and adherence to team standards
Coach junior engineers through code reviews
Debug complex issues in distributed systems running on Azure, Linux, and Windows
Run live site operations on a rotational, on-call basis
Integrate logging and instrumentation to gather telemetry on system health, performance, reliability, and security
Work with product managers, technical leads, and partners across geographies to define customer requirements for Materialized Lake View features

Fulltime

Software Engineer II & Senior Software Engineer

Security represents the most critical priorities for our customers in a world aw...

Location

United States , Redmond

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, C, C++, C#, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience with Troubleshoot and optimize automation, reliability, and monitoring for Live Site running as part of an on-call rotation owned by engineering team
Experience with distributed systems, messaging systems like Kafka etc - Large scale system design

Job Responsibility

Lead the architecture, design and implementation of services for extremely high scale, throughput, durability, and low latency
Innovate and make service deployment and maintenance an efficient well-oiled machine that provides excellent reliability with minimal manual engineer intervention
Ability to conduct in-depth triage, troubleshooting, and forensics across all facets of the cloud stack while executing processes corrective action and continual service improvement
Drive Infrastructure security improvements for mission critical high scale workloads
Lead the definition of requirements, KPIs, priorities and planning of engineering deliverables
Mentor and grow the energetic, diverse, and driven team with a good mix of senior and mid-level

Fulltime

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
8+ years of experience in backend service engineering, including work on high-scale infrastructures
Proficiency in one or more systems programming languages such as C#, C++
1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
2+ years of experience building software for scale, performance, and reliability
Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)

Job Responsibility

Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
Design and implement scalable backend services optimized for machine learning workflows and large language model integration
Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience

Fulltime

Senior Software Engineer and Principal Software Engineer

We are building a planet-scale multi-modal database and infrastructure for execu...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, or Java
OR Equivalent experience
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java
OR equivalent experience
Experience in shipping products and scalable, reliable services
Currently programming/coding in your current or most recent role
Hands on experience with asynchronous programming and concurrency (threads, tasks, futures, async/await)
Experience with Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and/or Google Kubernetes Engine (GKE)
Experience in building database engines, query engines, indexing solutions (columnar, full-text, vector), at scale
Experience with programming CUDA, AI systems at scale

Job Responsibility

Independently execute in the face of ambiguity
Leads identification of dependencies and the development of design documents for a product, application, service, or platform
Writes efficient systems code and able to debug distributed systems
Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions

Fulltime

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...

Location

United Kingdom; United States; Canada

Salary:

Not provided

Mozilla

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
Excellent async written communication skills
comfortable working with a geographically distributed team
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes

Job Responsibility

Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
Diagnose and debug production incidents
drive root-cause analysis and post-incident improvements to prevent recurring problems
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
Contribute to runbooks, architecture documentation, and team processes

What we offer

Fully remote work & schedule flexibility
Company-provided laptop
Annual bonus program
Monthly remote work stipend
Annual professional development stipend
Industry conferences
Company all-hands and team gatherings
24 days PTO per year (prorated)
Birthday
Year-end company shutdown

Fulltime

Senior Site Reliability Engineer

Are you interested in working on cutting-edge cloud security products Would you ...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination
This position requires successful verification of the stated security clearance to meet federal government customer requirements
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Ensure 24x7 Service Reliability: Act as a Designated Responsible Individual (DRI) in an on-call rotation, leading incident response and resolution to maintain uptime and performance for Microsoft's most critical services
Support and Automate Deployments: Execute and improve manual operations and deployments for our products, while designing automation to scale and streamline those processes across environments
Build Scalable Systems: Develop automation for monitoring, alerting, debugging, and deployment to reduce manual effort and accelerate safe, reliable delivery
Drive Compliance and Security: Ensure systems meet Microsoft's standards for security, privacy, and accessibility, especially when onboarding new technologies
Lead Post-Incident Learning: Conduct postmortems, share insights, and implement solutions that prevent recurrence—fostering a culture of learning and continuous improvement
Collaborate Across Teams: Partner with engineering and product teams to align reliability goals with customer needs and deliver seamless user experiences
Stay Ahead Technically: Continuously invest in your technical growth to improve system availability, observability, and performance at scale
Embody our company's Culture and Values

Fulltime

Senior Site Reliability Engineer

Do you want to be at the heart of cloud computing? The Compute team is at the co...

Location

United States , Reston

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Verification of U.S. citizenship due to citizenship-based legal restrictions

Job Responsibility

Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of service fabric services while also driving consistency in monitoring and operations at scale
Drives development of design documents for a product, application, service, or platform
Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
Leverages subject-matter expertise of product features and partners with appropriate stakeholders to drive a workgroup's project plans, release plans, and work items
Take full ownership of assigned services, actively contributing to its enhancement across all cloud environments
Identify opportunities for automation and optimization within the cloud to better support customers

What we offer

Certain roles may be eligible for benefits and other compensation

Fulltime

Select Country

Senior Software Engineer, Site Reliability

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?