CrawlJobs Logo

Software Engineer, Reliability

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 490000.00 USD / Year

Job Description:

Join the engineering teams that bring OpenAI’s ideas safely to the world. The Applied Engineering team works across research, engineering, product, and design to bring OpenAI’s technology to consumers and businesses. We seek to learn from deployment and distribute the benefits of AI, while ensuring that this powerful tool is used responsibly and safely. Safety is more important to us than unfettered growth. As OpenAI continues to grow, we are looking for experienced, problem-solving engineers to ensure our systems scale. Our success depends on our ability to quickly iterate on products while also ensuring that they are performant and reliable. You will work in a deeply iterative, collaborative, fast-paced environment to bring our technology to millions of users around the world, and ensure it’s delivered with safety and reliability in mind. Successful candidates will play a crucial role in ensuring the reliability, scalability, and performance of our systems as we continue to expand. As a reliability expert, you will be at the forefront of maintaining and enhancing the stability, scalability, and performance of our rapidly evolving infrastructure. You will work closely with cross-functional teams, including software engineers, product managers, and data scientists, to build and maintain resilient systems that can handle our growing user base and workload.

Job Responsibility:

  • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands
  • Build and maintain the load, chaos and synthetic testing software leveraged by development teams to make the systems they design and operate more reliable
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability
  • Build and maintain the platform for CPU/storage, GPU, and network lifecycle management to drive efficiency, accountability and support dynamic optimization of our resources
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
  • Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability

Requirements:

  • Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
  • Proven experience as an SWE focused on reliability or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure
  • Proficiency in programming languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Knowledge of IaC tools such as Terraform or CloudFormation
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools such as DataDog, Prometheus, Grafana and Splunk
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Reliability

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...
Location
Location
United States , San Mateo
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
Job Responsibility
Job Responsibility
  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
  • Fulltime
Read More
Arrow Right

MTS Software Architecture - Reliability Engineering

Our team is searching for a Full Stack Member of Technical Staff to collaborate ...
Location
Location
United States , Frisco; Atlanta; Overland Park
Salary
Salary:
145400.00 - 262300.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree Computer Science, engineering or related field of study
  • 9+ years technical engineering experience, including full-stack web development (front-end and back-end)
  • 7+ years or experience in database schema design and writing SQL
  • 3+ years DevOps experience, including infrastructure as code
  • 4+ years hands-on experience with cloud services (AWS, Azure, GCP)
  • 3+ years experience mentoring and coaching team members
  • Expertise in multiple technologies and software stacks
  • Strong understanding of cloud capabilities and how to optimize them for team success
  • Ability to setup a completely new full stack environment from scratch including build steps and backend infrastructure
  • Proficiency in html, css, webpack, JavaScript, at least one front end framework and one backend framework
Job Responsibility
Job Responsibility
  • Imagines, designs and builds full stack web solutions including both the back end and front end
  • Code Review and mentoring of other team members
  • Imagines, designs and builds advanced scheduled jobs and micro-services defining new patterns and orchestrations
  • Imagines, designs and implements advanced data storage mechanisms using relational and non-relational data stores
  • Explores, builds and configures cloud services using infrastructure as code. Recommends new cloud services and patterns
  • Presents ideas which improve an existing system/process/service. Presents new ideas which utilize new frameworks to improve an existing system/process/service
  • Collaborates with team to break down features into user stories and estimate them
  • Awareness of technology roadmap. Updates job knowledge by tracking and understanding emerging engineering practices. Continuously learns, creates content, and teaches others specific subject areas. Informally coaches and contributes to the development of others through mentoring or in house workshops and learning sessions. Coach and develop engineers across functional teams on technology decisions. Influence technology and policy decisions made at Director+ level across organization. Understand financial decisions, including NPV and ROI, based on customer experience/business drivers. Present highly technical concepts to both technical and non-technical decision-makers
  • Provides direction on creation of reliability practices, metrics and tooling based on industry best practices and incident data
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Employee stock grants
  • Employee stock purchase plan
  • Paid time off
  • Fulltime
Read More
Arrow Right

Software Test Engineer - Reliability, Availability, Serviceability

At IBM, work is more than a job - it's using your creativity and imagination: To...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
ibm.com Logo
IBM Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree
  • Experience working in testing projects, including creation and execution of test plans and test cases
  • Experience managing and maintaining (HW and SW) Servers and Switches
  • Experience using Linux
  • Experience with common scripting languages (bash, python, JavaScript, etc.)
  • Medium to high written and oral English
Job Responsibility
Job Responsibility
  • Create and execute test plans and test cases
  • Review test plans and test cases created by other teams
  • Support and test for customer support cases
  • Ensure that overall testing is performed within the committed timeframe
  • Test code fixes for issues found during testing and customer reported problems
  • Creation and tracking of code defects found during testing, providing extra support for development if needed
  • Expand test automation using test automation framework
  • Set-up and maintenance (HW and SW) of equipment used for the testing, including servers, switches and tape libraries
  • Fulltime
Read More
Arrow Right

Sr. Engineer II, Software Engineering FE

At CVS Health, we’re building a world of health around every consumer and surrou...
Location
Location
United States , Chicago
Salary
Salary:
148949.00 - 180000.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or related field
  • six (6) years of progressively responsible, post-baccalaureate experience in a related occupation
  • Experience in building consumer-facing products using any SPA frameworks (React/Vue)
  • Experience in design first approach to software development
  • Experience in writing Jest / Vitest Unit Tests and achieving close to 100% code coverage
  • Experience working in an Agile/Devops environment
Job Responsibility
Job Responsibility
  • Contribute to all aspects of SDLC process (SCRUM, Design, Code, Test, Deploy & Maintain)
  • Collaborate with Product, UX and other Engineering teams
  • Collaborate with Platform team following Architecture best practices for scalability and reliability
  • Contribute to code review process to improve code quality
  • Mentor Engineers
  • Implement SecDevops best practices
  • and other duties as assigned
What we offer
What we offer
  • Full range of medical, dental, and vision benefits
  • 401(k) retirement savings plan
  • Employee Stock Purchase Plan
  • Fully-paid term life insurance plan
  • Short-term and long term disability benefits
  • Well-being programs
  • Education assistance
  • Free development courses
  • CVS store discount
  • Discount programs with participating partners
  • Fulltime
Read More
Arrow Right

Software Engineer II, Android Engineering

As a Software Engineer on Axon’s Robotics team, you’ll be at the forefront of tr...
Location
Location
United States , Boston
Salary
Salary:
120750.00 - 193200.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of industry experience shipping Android applications to the Google Play Store
  • Understand the ins and out of mobile phones
  • expected to lead mobile design reviews as well as the implementation of their designs to release and post-release monitoring
  • Experience with modern architecture (MVVM, MVI, etc) including unit testing
  • Android experience with Retrofit, Coroutines, Okhttp, Hilt, Jetpack Compose
  • Experience working with remote data via REST and JSON
  • Understanding and experience with networking protocols such as TCP, UDP, DHCP, DNS, Server-Sent-Events, Websockets (debugging with Wireshark or Charles a plus)
Job Responsibility
Job Responsibility
  • Lead engineering architecture and design reviews to ensure high standards in software quality
  • Collaborate with the Axon product design team to turn mobile UI designs into functional, engaging solutions
  • Drive the entire mobile software lifecycle, from prototyping to commercialization and post-launch support
  • Interface with cloud services for seamless integration across platforms
  • Set a high technical standard for the team through code and design reviews
  • Partner with Product, Design, and Engineering teams to deliver integrated solutions that meet customer needs
  • Enhance engineering processes, including sprint planning, stand-ups, and long-term planning
  • Build robust and reliable mission critical software that meets high standards for stability in mission-critical applications
  • Collaborate closely with other groups to align on goals, ensuring we deliver impactful and innovative solutions
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Development Programs
  • snacks in our offices
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Location
Location
United States , Ft. Meade
Salary
Salary:
Not provided
cipherlogix.com Logo
CipherLogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
  • Ten (10) years experience in system engineering/architecture
  • Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
  • At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
  • At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
  • Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
  • Ten (10) years experience in the cleared environment
  • Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
  • Knowledge and experience with developing distributed storage routing and querying algorithms
  • Experience in developing documentation required to support a program’s technical issues and training situations
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Compute

Play a key role in building our platform from zero to one. Partner across teams ...
Location
Location
United States
Salary
Salary:
200000.00 - 275000.00 USD / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in software engineering, with expertise in database systems, query engines, or storage systems
  • Strong coding skills at the systems level C++, Rust, Go, Python, or Java
  • Experience designing and scaling distributed systems or SaaS platforms
  • Expertise with cloud infrastructure (AWS, GCP, Azure, Kubernetes, Terraform)
  • Proven ability to lead complex projects and collaborate across functions
  • Excellent problem-solving skills, clear communication, and a strong sense of ownership
Job Responsibility
Job Responsibility
  • Design, build, and maintain the Compute layer that powers dbt’s ability to optimize queries across ingestion, transformation, and consumption
  • Lead technical architecture discussions with a focus on query engines, storage systems, and distributed database design
  • Collaborate with Product, Design, Operations, and Security to deliver well-architected, scalable compute solutions
  • Build services, APIs, and experiences that support user delight, quality, high availability, and performance
  • Tackle ambiguous, open-ended technical challenges with strategic thinking, balancing technical constraints with user needs and product goals
  • Define and drive best practices in testing, observability, and system reliability
  • Mentor engineers across the company, fostering technical growth and collaboration
  • Champion a culture of technical excellence and innovation, influencing engineering direction across multiple teams or domains
What we offer
What we offer
  • Unlimited vacation
  • 401k
  • Pension Plan
  • 16 weeks Paid Parental Leave
  • Wellness stipend
  • Home office stipend
  • Equity Stake
  • Fulltime
Read More
Arrow Right