Senior Software Engineer - Chaos Engineering Job at Microsoft Corporation (Redmond)

Senior Software Engineer - SRE

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments
Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring
Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings
Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies
Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
Experience with CI/CD automation, including GitLab pipelines and other related tools
Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments

Job Responsibility

Design & Infrastructure
Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements
Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises
SRE Process & Principles Implementation
Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability
Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions
Reliability Engineering & Infrastructure
Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time
Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms
Observability, Monitoring & Reporting

What we offer

global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off in accordance with local leave policies

Fulltime

Senior Software Engineer, SRE

Abridge’s services and engineering team are in hyperscale mode. We are looking f...

Location

United States , SF Office, NYC Office

Salary:

210800.00 - 248000.00 USD / Year

Abridge

Expiration Date

Until further notice

Requirements

8+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling
At least 2 years experience as a back-end engineer focused on system performance and scalability
Experience reducing latency in software by multiples through leveraging observability and profiling tools
Experience building on Kubernetes and scaling compute services on Kubernetes
experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc
Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies
Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust)
Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations
Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure
Must be willing to travel up to 10%

Job Responsibility

Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them
Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments
Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability
Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization
Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure
Work with application teams to establish and adopt SLOs and error budgets, and drive better metrics for application health that can drive automated canary releases, improved health monitoring, and better engineering practices
Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization
Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools
Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale

What we offer

Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
Paid Parental Leave: Generous paid parental leave for all full-time employees
Family Forming Benefits: Resources and financial support to help you build your family
401(k) Matching: Contribution matching to help invest in your future
Personal Device Allowance: Tax free funds for personal device usage
Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals

Fulltime

Lead Software Engineer - Java Full Stack + GENAI

About this role: Wells Fargo is seeking a Lead Software Engineer In this role...

Location

India , Hyderabad

Salary:

Not provided

Wells Fargo

Expiration Date

July 19, 2026

Requirements

5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of Software Engineering experience as a JAVA full stack developer (Spring, Spring Boot, Oracle, and UI)
Bachelors in engineering or equivalent with above mentioned years of experience
Experience with GenAI tools – Co Pilot
Strong analytical, verbal, written communication, and interpersonal skills
Strong knowledge on Agile Product development methodologies and collaborating with multiple stakeholders to deliver the quality products in a timely manner
Hands on experience on building microservices using Spring boot, Kafka, REST APIs, ORM, SQL/NO-SQL Databases
Strong knowledge and hands-on on designing highly secure, scalable, resilient, and performant applications using Java/J2EE design patterns, 12-factor app principles, cloud-native patterns, and practices
Deep understanding of application performance management, memory management, multi-threading patterns and practices
Strong knowledge of foundational skills: Data Structures, Design Patterns, OOPs, SOLID principles, and secure coding practices

Job Responsibility

Lead complex technology initiatives including those that are companywide with broad impact
Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
Design, code, test, debug, and document for projects and programs
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor

Fulltime

Senior Engineering Manager, SRE

Abridge’s services and engineering teams are in hyperscale mode, and multiplying...

Location

United States , San Francisco; New York; Pittsburgh

Salary:

250000.00 - 290000.00 USD / Year

Abridge

Expiration Date

Until further notice

Requirements

6+ years as a manager in rapidly growing organizations including at least 1 year as a manager of managers
Seeking an extremely challenging role that will push you beyond your limits, where failures are inevitable and not to be feared
Seeking a senior leadership role to develop people, environments, and impact - not ego, accolades, or ladder climbing
Able to ask for help, fail fast and admit defeat
get yourself and others out of their comfort zone
Track record of leading performance engineering including load test and chaos engineering, large scale distributed telemetry implementation, major architectural and software refactors, engineering velocity, and full stack development
Experience running production workloads in more than one cloud provider (at a time, or across your experience)
Experience managing workloads across containerized solutions, Kubernetes, and CNCF-approved tooling such as Argo, istio, OTel, and more
Thought leader in platform building, with a strong desire to represent Abridge as a reliability engineering leader in the tech industry
Genuine passion for Abridge’s mission to improve healthcare in America and across the world

Job Responsibility

Visionary leadership: Scope, resource, evangelize, and execute a company-wide reliability and engineering velocity roadmap across environments and clouds, real-time streaming infrastructure under immense scale, compute as well as AI -at-edge infrastructure, and the most ambitious cloud security roadmap in the entire tech industry
Collaborate with department heads across product engineering, security, product management, commercial, and more to develop, align, and execute an extremely ambitious strategic roadmap
Gifted tactician: Work at the level of small tiger teams to unblock, enable, and drive execution and solutioning
Juggle several ambiguous and tricky problems at a time
Recruiter extraordinaire: Scale out your team to meet this roadmap - both ICs and managers
Attract top talent and hire quickly while maintaining a consistently high bar
Iterate on the hiring process along with other leaders, improve diversity and equity, retain and maximize the effectiveness of an extremely senior team, and make strategic bets on the people that will take us to the next level
Mentor to the mentors: Develop their careers, create top-of-ladder development opportunities, and continuously raise the bar for your staff as well as your peers and leaders in their abilities and awareness
Earn their trust, lead by example, be a doctor rather than a judge for organizational and people challenges, and help establish and maintain a hivemind, de-siloed culture across all engineering pods

What we offer

Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
Paid Parental Leave: Generous paid parental leave for all full-time employees
Family Forming Benefits: Resources and financial support to help you build your family
401(k) Matching: Contribution matching to help invest in your future
Personal Device Allowance: Tax free funds for personal device usage
Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals

Fulltime

Senior Reliability Engineer

Barbaricum is seeking an experienced Senior Site Reliability Engineer to support...

Location

United States , Washington

Salary:

Not provided

Barbaricum

Expiration Date

Until further notice

Requirements

Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience
Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices
Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies
Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks
Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments
Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification
Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making
Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments
Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues
Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders

Job Responsibility

Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements
Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility
Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery
Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability
Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions
Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization
Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact
Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence
Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards
Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance

Fulltime

Senior Site Reliability Engineer

We are looking for an Senior Site Relability Engineer to join our growing engine...

Location

Salary:

Not provided

Airalo

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Engineering or a similar discipline
5+ years of experience as a Site Reliability Engineer or in a similar role
3+ years of experience with AWS services including strong knowledge of container orchestration
2+ years of Kubernetes experience
Deep understanding of observability principles and tools like (Prometheus, Datadog, OpenTelemetry)
Experience with leading incident management and complex postmortem analysis
Experience and interest in managing infrastructure as code (Terraform)
Experience with chaos engineering and other techniques for testing system resilience
Experience with CI/CD tools such as GitHub Actions for automated delivery
Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling

Job Responsibility

Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies
Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures
Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them
Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response
Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights
Proactively identify and mitigate operational risks through chaos engineering and architecture reviews
Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC
Continuously evaluate and optimize system performance, capacity, and cost efficiency
Beyond just participating, you will refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health

What we offer

Health Insurance
work-from-anywhere stipend
annual wellness & learning credits
annual all-expenses-paid company retreat in a gorgeous destination & other benefits
Paid Rotation: We offer standby fees + overtime pay
Delayed Start: No on-call duties for your first 6 months
Rest & Recovery: Guaranteed rest periods and flexible hours following night incidents
Shared Load: Rotations are split (Weekdays vs. Weekends) to minimize fatigue

Fulltime

Senior Site Reliability Engineer

As Padran Information Technologies, we are looking for teammates who are focused...

Location

Turkey , İstanbul

Salary:

Not provided

Padran Information Technologies Inc.

Expiration Date

Until further notice

Requirements

A minimum of Bachelor’s degree in Computer Science, Engineering, or a related field
5+ years of experience in SRE, Reliability Engineering, or large-scale systems operations
Strong expertise in designing and maintaining highly available, fault-tolerant, and distributed systems
Deep understanding of SLIs, SLOs, and SLAs
proven track record of driving reliability metrics
Hands-on experience with performance tuning, capacity planning, and incident response strategies
Proficiency in monitoring, logging, and tracing tools such as Newrelic, Datadog, Prometheus, Grafana, OpenTelemetry, ELK
Strong programming or scripting experience (Go, Python, Bash, or similar) for building automation and internal tools
Experience with Kubernetes, container orchestration, and hybrid/multi-cloud infrastructure
Solid networking fundamentals, troubleshooting, and production-level debugging expertise

Job Responsibility

Defining and driving reliability goals (SLIs/SLOs/SLAs) for services and leading efforts to achieve them
Designing scalable, fault-tolerant systems, and leading disaster recovery, backup, and failover planning
Owning incident management processes: leading major incident response, root cause analysis, and postmortems
Implementing chaos engineering practices to proactively identify weaknesses and strengthen system resilience
Building and maintaining observability stacks (metrics, logging, tracing) to enable proactive detection and troubleshooting
Partnering with development teams to embed reliability-focused design patterns into software architecture
Developing automation tools and self-healing systems to reduce toil and improve operational efficiency
Documenting runbooks, playbooks, and operational best practices to standardize processes across the organization

What we offer

Opportunity to work with leading companies in Turkey
Opportunity to use industry-leading technologies with our business partners Microsoft, IBM, AWS and Open Text
Career development and certification opportunities as an ISTQB accredited training center

Fulltime

Senior Robotics Manual QA Engineer

The Axon SkyHero Firmware team in the Robotics New Ventures Pillar supports crit...

Location

Belgium , Brussels

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

Bachelor’s Degree from an accredited University
Minimum 7+ years of industry experience in Quality Assurance on complex products
Strong technical writing skills
Familiarity with using software tools, command-line and to load firmware binaries, pull log-files, etc.
Strong English speaking fluency
Basic functional coding knowledge in a leading scripting language (e.g., Python)
Comfortable on Embedded Linux and command-line interfaces (e.g., flash builds, collect logs, basic networking)
Familiarity with Git or similar source code control system
Familiarity with CI/CD systems
Alignment to Axon’s Mission Statement and willingness to work near Conductive Energy Weapons (CEW)

Job Responsibility

Be part of a high-performing team that designs and develops game-changing Robotics products to Protect Life
Derive and write comprehensive Test Plans, Test Cases while thinking critically about Customer work-flows and edge cases
Leverage Jira and Zephyr to keep work organized and team aligned on bug triage
Work with engineers to understand “Flight Tuning” outcomes and physically fly and drive drones to verify expected behaviors
Turn subjective “Touchy Feely” flight and driving drone behavior and unstable edge cases into evidence-driven, repeatable tests
Own flight/ground test execution in Brussels, including test setup, runbooks, evidence capture (e.g., logs, video), and repeatability
Be an integral part of the team and help drive the Release Calendar, Go/No-Go decisions, from a quality perspective
Execute test cases for major releases, identify regressions, and communicate clearly repro steps to engineers
Use AI Tooling to assist in writing test scripts and rudimentary automation workflows
Be a Team Player, Mentor, Strong Communicator and be ready and willing to support the Team when needed

Fulltime

Select Country

Senior Software Engineer - Chaos Engineering

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?