CrawlJobs Logo

Senior Software Engineer - Chaos Engineering

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

The High Availability (HA) team part of M365 Core, is seeking a Senior Software Engineer - Chaos Engineering. This role is crucial as HA has been a cornerstone of the Substrate backend solution. We continue to explore opportunities for improving and optimizing service reliability. Our continuous strive to provide best service to our customers goes beyond just optimizing the storage stack solution. We work relentlessly on reducing Microsoft capital and operational expenses, as we continue to explore more paths for optimization while maintaining reliable 4.5 9s availability. To achieve that HA has extended its charter beyond traditional database availability and redundancy solution - towards optimizing power efficiency, platform costs, networking costs. The latter will be the major focus of a talented engineer who decides to join our team. Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. As part of Chaos team in HA, you will be working closely with partners (Azure, EXO-Exchange Online, MSR-Microsoft Research) to build the next generation of Chaos platform for Substrate. The platform will validate the resilience, architecture choices, predictability and even monitoring and incident response processes of critical components in M365 distributed systems. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Job Responsibility:

  • Own feature projects that directly impact behavior of High Availability component of Exchange Online (EXO) that reliably provides 4.5 9s of availability
  • Write production, monitoring, and test code, create reports and conduct performance analysis of storage engine, database replication, networking layer
  • Research Chaos experiments, identifying opportunities for testing and operational readiness of critical service components
  • Engage with EXO, Azure, and MSR partners to build interfaces for a modern Chaos experience, improve service resilience, improve predictability and observability of M365 distributed systems
  • Embody our Culture and Values

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 3+ years of software design and development experience with backend services
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice to have:

  • Cloud and services experience
  • Azure cloud experience is a plus
  • Experience writing services and micro-services on middle- or back-end tier
  • Experience with networking layer optimization and tuning, deploying and maintaining large scale cluster products, defining and testing performance characteristics of backend solutions
  • Analytical skills with systematic and structured approach to software design
  • Experience building reliable and well-tested code.

Additional Information:

Job Posted:
February 10, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Software Engineer - Chaos Engineering

Senior Software Quality Engineer (SDET)

We are looking for a highly skilled Senior Software Quality Engineer (SDET) to l...
Location
Location
United States , Mountain View
Salary
Salary:
210000.00 - 257000.00 USD / Year
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in web and mobile testing, with a strong emphasis on test automation
  • Proven expertise in designing and maintaining scalable test automation frameworks
  • Hands-on experience with mobile testing frameworks such as XCUITest (iOS) and Espresso (Android), and web frameworks like Playwright
  • Strong understanding of testing across microservices, APIs, and distributed systems
  • Ability to analyze and debug complex test failures, automation issues, and defects efficiently
  • Familiarity with generative AI applications in quality engineering (test case generation, API contract validation, log intelligence, etc.)
  • Passion for leveraging AI to reduce manual effort, increase coverage, and accelerate release cycles
  • Proven experience supporting weekly release cycles with a mix of manual and automated regression testing
  • Strong analytical, debugging, and problem-solving skills
  • Experience collaborating with global teams across multiple time zones
Job Responsibility
Job Responsibility
  • Lead the design, development, and execution of comprehensive test plans and test cases across frontend (web & mobile), backend services, APIs, and databases
  • Implement industry best practices in manual and automated testing to ensure exceptional product quality, reliability, scalability, and performance
  • Identify, document, and track software defects and inconsistencies with a data-driven, proactive approach to prevention and continuous improvement
  • Introduce and operationalize AI-based testing techniques
  • Integrate AI code analysis, anomaly detection, and observability insights into quality workflows to improve speed, coverage, and accuracy
  • Evaluate and implement emerging AI-driven QA tools to evolve the quality engineering ecosystem
  • Champion an AI-first quality culture by promoting experimentation, learning, and collaboration across engineering teams
  • Design, build, optimize, and maintain scalable automation frameworks using Playwright, Appium, Espresso, XCUITest, REST Assured, and other relevant tools
  • Integrate automated tests into CI/CD pipelines (Jenkins, GitHub Actions, etc.) to ensure fast, reliable, and safe deployments and releases on both apps(iOS and Android) and services
  • Build tooling that empowers developers with self-service test execution, reporting, and analysis
What we offer
What we offer
  • equity
  • healthcare
  • internet/cell phone reimbursement
  • learning and development stipend
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, SRE

Abridge’s services and engineering team are in hyperscale mode. We are looking f...
Location
Location
United States , SF Office, NYC Office
Salary
Salary:
210800.00 - 248000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling
  • At least 2 years experience as a back-end engineer focused on system performance and scalability
  • Experience reducing latency in software by multiples through leveraging observability and profiling tools
  • Experience building on Kubernetes and scaling compute services on Kubernetes
  • experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc
  • Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies
  • Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust)
  • Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations
  • Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure
  • Must be willing to travel up to 10%
Job Responsibility
Job Responsibility
  • Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them
  • Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments
  • Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability
  • Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization
  • Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure
  • Work with application teams to establish and adopt SLOs and error budgets, and drive better metrics for application health that can drive automated canary releases, improved health monitoring, and better engineering practices
  • Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization
  • Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools
  • Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer

Wells Fargo is seeking a Senior Software Engineer.
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
February 12, 2026
Flip Icon
Requirements
Requirements
  • 4+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Strong ability to independently develop automation scripts using tools such as Selenium and RPA frameworks.
  • Proficiency in creating and executing performance test scripts using LoadRunner, Performance Center, JMeter, BlazeMeter, or similar tools.
  • Hands‑on experience designing and analyzing load, stress, soak, break, and chaos testing scenarios.
  • Working knowledge of CI/CD practices with experience implementing Jenkins pipelines.
  • Familiarity with service virtualization tools and techniques to simulate system dependencies.
  • Strong capability to analyze performance bottlenecks using monitoring and diagnostic tools and provide detailed root cause analysis.
  • Proficiency with APM and observability tools such as AppDynamics, Splunk, Elastic, Dynatrace, JFR, JMC, or MAT.
  • Solid Linux skills, including log analysis, file operations, process monitoring, and system resource evaluation.
  • General understanding of AI concepts and their relevance to performance engineering.
Job Responsibility
Job Responsibility
  • Lead moderately complex initiatives and deliverables within technical domain environments
  • Contribute to large scale planning of strategies
  • Design, code, test, debug, and document for projects and programs associated with technology domain, including upgrades and deployments
  • Review moderately complex technical challenges that require an in-depth evaluation of technologies and procedures
  • Resolve moderately complex issues and lead a team to meet existing client needs or potential new clients needs while leveraging solid understanding of the function, policies, procedures, or compliance requirements
  • Collaborate and consult with peers, colleagues, and mid-level managers to resolve technical challenges and achieve goals
  • Lead projects and act as an escalation point, provide guidance and direction to less experienced staff
  • Contribute to developing engineering standards and companywide best practices for building large‑scale, complex technology solutions.
  • Design, develop, test, debug, and document automation and performance engineering components across high‑volume, distributed applications.
  • Influence engineering direction by evaluating emerging technologies and applying industry best practices to drive new initiatives.
  • Fulltime
!
Read More
Arrow Right

Senior Engineering Manager, SRE

Abridge’s services and engineering teams are in hyperscale mode, and multiplying...
Location
Location
United States , San Francisco; New York; Pittsburgh
Salary
Salary:
250000.00 - 290000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as a manager in rapidly growing organizations including at least 1 year as a manager of managers
  • Seeking an extremely challenging role that will push you beyond your limits, where failures are inevitable and not to be feared
  • Seeking a senior leadership role to develop people, environments, and impact - not ego, accolades, or ladder climbing
  • Able to ask for help, fail fast and admit defeat
  • get yourself and others out of their comfort zone
  • Track record of leading performance engineering including load test and chaos engineering, large scale distributed telemetry implementation, major architectural and software refactors, engineering velocity, and full stack development
  • Experience running production workloads in more than one cloud provider (at a time, or across your experience)
  • Experience managing workloads across containerized solutions, Kubernetes, and CNCF-approved tooling such as Argo, istio, OTel, and more
  • Thought leader in platform building, with a strong desire to represent Abridge as a reliability engineering leader in the tech industry
  • Genuine passion for Abridge’s mission to improve healthcare in America and across the world
Job Responsibility
Job Responsibility
  • Visionary leadership: Scope, resource, evangelize, and execute a company-wide reliability and engineering velocity roadmap across environments and clouds, real-time streaming infrastructure under immense scale, compute as well as AI -at-edge infrastructure, and the most ambitious cloud security roadmap in the entire tech industry
  • Collaborate with department heads across product engineering, security, product management, commercial, and more to develop, align, and execute an extremely ambitious strategic roadmap
  • Gifted tactician: Work at the level of small tiger teams to unblock, enable, and drive execution and solutioning
  • Juggle several ambiguous and tricky problems at a time
  • Recruiter extraordinaire: Scale out your team to meet this roadmap - both ICs and managers
  • Attract top talent and hire quickly while maintaining a consistently high bar
  • Iterate on the hiring process along with other leaders, improve diversity and equity, retain and maximize the effectiveness of an extremely senior team, and make strategic bets on the people that will take us to the next level
  • Mentor to the mentors: Develop their careers, create top-of-ladder development opportunities, and continuously raise the bar for your staff as well as your peers and leaders in their abilities and awareness
  • Earn their trust, lead by example, be a doctor rather than a judge for organizational and people challenges, and help establish and maintain a hivemind, de-siloed culture across all engineering pods
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Principle SRE

The Principal Site Reliability Engineer will be a senior technical expert respon...
Location
Location
India , Pune
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering or infrastructure roles
  • at least 5 years focused on reliability engineering or SRE
  • proven experience building and operating fault-tolerant, highly available systems at scale
  • strong knowledge of distributed systems, resiliency patterns (circuit breakers, retries, failover), and disaster recovery strategies
  • expertise across infrastructure (compute, storage, networking), application architecture, databases, and integration patterns
  • ability to troubleshoot complex technical issues across distributed systems and perform deep root cause analysis
  • skilled at working with development, operations, and architecture teams to embed reliability into design and delivery
Job Responsibility
Job Responsibility
  • Drive strategies to improve reliability, maintainability, and scalability across payment flows and platform components
  • conduct deep technical assessments of system architectures, identifying risks and recommending improvements for fault tolerance and disaster recovery
  • act as a senior escalation point for production incidents, lead RCA, and implement permanent fixes to prevent recurrence
  • define and enforce reliability patterns, frameworks, and best practices
  • advocate and implement chaos engineering principles to validate system resilience under real-world failure scenarios
  • design and implement full-stack observability solutions, including metrics, logging, distributed tracing, and alerting
  • develop automation for failover, capacity management, and self-healing mechanisms to reduce operational risk
  • partner with development, infrastructure, and production support teams to embed reliability into the SDLC
  • analyze service risk assessments and production incidents to identify systemic issues and drive long-term improvements
  • promote operational excellence and a mindset of designing for failure across all engineering teams
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Engineering Manager, SRE

Abridge’s services and engineering teams are in hyperscale mode, and multiplying...
Location
Location
United States , San Francisco
Salary
Salary:
220000.00 - 260000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3 - 6+ years as a manager in rapidly growing organizations including at least 1 year as a manager of managers
  • Seeking an extremely challenging role that will push you beyond your limits, where failures are inevitable and not to be feared
  • Seeking a senior leadership role to develop people, environments, and impact - not ego, accolades, or ladder climbing
  • Able to ask for help, fail fast and admit defeat
  • get yourself and others out of their comfort zone
  • Track record of leading performance engineering including load test and chaos engineering, large scale distributed telemetry implementation, major architectural and software refactors, engineering velocity, and full stack development
  • Experience running production workloads in more than one cloud provider (at a time, or across your experience)
  • Experience managing workloads across containerized solutions, Kubernetes, and CNCF-approved tooling such as Argo, istio, OTel, and more
  • Thought leader in platform building, with a strong desire to represent Abridge as a reliability engineering leader in the tech industry
  • Genuine passion for Abridge’s mission to improve healthcare in America and across the world
Job Responsibility
Job Responsibility
  • Visionary leadership: Scope, resource, evangelize, and execute a company-wide reliability and engineering velocity roadmap across environments and clouds, real-time streaming infrastructure under immense scale, compute as well as AI -at-edge infrastructure, and the most ambitious cloud security roadmap in the entire tech industry. Collaborate with department heads across product engineering, security, product management, commercial, and more to develop, align, and execute an extremely ambitious strategic roadmap
  • Gifted tactician: Work at the level of small tiger teams to unblock, enable, and drive execution and solutioning. Juggle several ambiguous and tricky problems at a time
  • Recruiter extraordinaire: Scale out your team to meet this roadmap - both ICs and managers. Attract top talent and hire quickly while maintaining a consistently high bar. Iterate on the hiring process, improve diversity and equity, retain and maximize the effectiveness of an extremely senior team
  • Mentor to the mentors: Develop their careers, create top-of-ladder development opportunities, and continuously raise the bar for your staff as well as your peers and leaders in their abilities and awareness. Earn their trust, lead by example, be a doctor rather than a judge for organizational and people challenges, and help establish and maintain a hivemind, de-siloed culture across all engineering pods
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As Padran Information Technologies, we are looking for teammates who are focused...
Location
Location
Turkey , İstanbul
Salary
Salary:
Not provided
padran.com Logo
Padran Information Technologies Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A minimum of Bachelor’s degree in Computer Science, Engineering, or a related field
  • 5+ years of experience in SRE, Reliability Engineering, or large-scale systems operations
  • Strong expertise in designing and maintaining highly available, fault-tolerant, and distributed systems
  • Deep understanding of SLIs, SLOs, and SLAs
  • proven track record of driving reliability metrics
  • Hands-on experience with performance tuning, capacity planning, and incident response strategies
  • Proficiency in monitoring, logging, and tracing tools such as Newrelic, Datadog, Prometheus, Grafana, OpenTelemetry, ELK
  • Strong programming or scripting experience (Go, Python, Bash, or similar) for building automation and internal tools
  • Experience with Kubernetes, container orchestration, and hybrid/multi-cloud infrastructure
  • Solid networking fundamentals, troubleshooting, and production-level debugging expertise
Job Responsibility
Job Responsibility
  • Defining and driving reliability goals (SLIs/SLOs/SLAs) for services and leading efforts to achieve them
  • Designing scalable, fault-tolerant systems, and leading disaster recovery, backup, and failover planning
  • Owning incident management processes: leading major incident response, root cause analysis, and postmortems
  • Implementing chaos engineering practices to proactively identify weaknesses and strengthen system resilience
  • Building and maintaining observability stacks (metrics, logging, tracing) to enable proactive detection and troubleshooting
  • Partnering with development teams to embed reliability-focused design patterns into software architecture
  • Developing automation tools and self-healing systems to reduce toil and improve operational efficiency
  • Documenting runbooks, playbooks, and operational best practices to standardize processes across the organization
What we offer
What we offer
  • Opportunity to work with leading companies in Turkey
  • Opportunity to use industry-leading technologies with our business partners Microsoft, IBM, AWS and Open Text
  • Career development and certification opportunities as an ISTQB accredited training center
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Cloud Data Storage

Cloud Data Store (CDS) owns the storage, retrieval, and lifecycle of all workflo...
Location
Location
United States
Salary
Salary:
180000.00 - 225000.00 USD / Year
temporal.io Logo
Temporal
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 or more years of experience as an 'Arranger' and/or 'Builder/Enhancer' of highly scalable distributed systems
  • Solid computer science fundamentals in distributed systems concepts including multi-threading and concurrency
  • Experience writing concurrent code in production with languages like Go or Java or other applicable languages with skill level as 'high end of Intermediate' and/or 'Advanced' or 'Expert' levels
  • Experience building and running services on AWS
Job Responsibility
Job Responsibility
  • Design & build distributed data systems – craft APIs, schemas, and replication paths that keep petabytes of workflow history durable and query-able
  • Drive reliability & performance – own SLOs, create chaos-test plans, profile hot paths, and lead incident reviews
  • Technical leadership – break down roadmap epics, mentor mid-level engineers, steward design docs through RFC
  • Cross-team collaboration – partner with the Server, Cloud, and DX teams to land features end-to-end
What we offer
What we offer
  • Unlimited PTO, 12 Holidays + 2 Floating Holidays
  • 100% Premiums Coverage for Medical, Dental, and Vision
  • AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
  • Empower 401K Plan
  • Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more
  • $3,600 / Year Work from Home Meals
  • $1,500 / Year Career Development & Learning
  • $1,200 / Year Lifestyle Spending Account
  • $1,000 / Year In-Home Office Setup
  • $500 / Year Professional Memberships
  • Fulltime
Read More
Arrow Right