CrawlJobs Logo

Software Engineer, Infrastructure Reliability

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

255000.00 - 385000.00 USD / Year

Job Description:

You’ll be at the heart of scaling and hardening the infrastructure that powers some of the most widely used AI systems in the world. You’ll help ensure our systems are highly reliable, observable, performant, and secure—so researchers can iterate quickly, and products like ChatGPT and the OpenAI API can serve millions of users safely and effectively. This is a hands-on, high-leverage role for engineers who thrive on ownership, love solving deep technical problems across the stack, and want to work on systems that support cutting-edge research and deploy at global scale. You’ll play a key part in shaping technical direction, proactively improving system resilience, and collaborating closely with infra, product, and research teams to turn complex infrastructure into reliable platforms.

Job Responsibility:

  • Design, build, and operate reliable and performant systems used across engineering
  • Identify and fix performance bottlenecks and inefficiencies, ensuring our infrastructure can scale to the next order of magnitude
  • Dig deep to resolve complex issues
  • Continuously improve automation to reduce manual work
  • Improve internal tooling and our developer experience
  • Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability

Requirements:

  • 4+ years of relevant industry experience
  • 2+ years leading large scale, complex projects or teams as an engineer or tech lead
  • A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
  • Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
  • Proficiency in programming / scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments
  • Strong understanding of distributed systems, networking, and database technologies
  • Excellent problem-solving skills and ability to work in a fast-paced environment

Nice to have:

  • Have a deep understanding of distributed systems principles and a proven track record in building and operating scalable and reliable systems
  • Have a keen eye for performance and optimization
  • Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks
  • Are experienced in collaborating with cross-functional teams
  • Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed
  • Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done
  • Are comfortable with ambiguity and rapid change
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Infrastructure Reliability

Software Engineer, Data Infrastructure

The Data Infrastructure team at Figma builds and operates the foundational platf...
Location
Location
United States , San Francisco; New York
Salary
Salary:
149000.00 - 350000.00 USD / Year
figma.com Logo
Figma
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, specifically in backend or infrastructure engineering
  • Experience designing and building distributed data infrastructure at scale
  • Strong expertise in batch and streaming data processing technologies such as Spark, Flink, Kafka, or Airflow/Dagster
  • A proven track record of impact-driven problem-solving in a fast-paced environment
  • A strong sense of engineering excellence, with a focus on high-quality, reliable, and performant systems
  • Excellent technical communication skills, with experience working across both technical and non-technical counterparts
  • Experience mentoring and supporting engineers, fostering a culture of learning and technical excellence
Job Responsibility
Job Responsibility
  • Design and build large-scale distributed data systems that power analytics, AI/ML, and business intelligence
  • Develop batch and streaming solutions to ensure data is reliable, efficient, and scalable across the company
  • Manage data ingestion, movement, and processing through core platforms like Snowflake, our ML Datalake, and real-time streaming systems
  • Improve data reliability, consistency, and performance, ensuring high-quality data for engineering, research, and business stakeholders
  • Collaborate with AI researchers, data scientists, product engineers, and business teams to understand data needs and build scalable solutions
  • Drive technical decisions and best practices for data ingestion, orchestration, processing, and storage
What we offer
What we offer
  • equity
  • health, dental & vision
  • retirement with company contribution
  • parental leave & reproductive or family planning support
  • mental health & wellness benefits
  • generous PTO
  • company recharge days
  • a learning & development stipend
  • a work from home stipend
  • cell phone reimbursement
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

The Infrastructure team builds foundational systems at scale. We're hundreds o b...
Location
Location
United States , New York City; San Francisco Bay Area
Salary
Salary:
171200.00 - 246000.00 USD / Year
metronome.com Logo
Metronome
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years building infrastructure systems: Hands-on experience with distributed systems, cloud infrastructure, container orchestration, data pipelines, observability, CI/CD, or other foundational platforms
  • Ownership of production systems: Track record of operating mission-critical infrastructure with strong focus on reliability, scalability, and performance
  • Force multiplier mindset: You build platforms that enable others. You create abstractions that make complex systems approachable. You think about developer experience as a first-class concern
  • Cross-functional collaboration: You partner effectively with product teams, communicate technical decisions clearly, and mentor engineers across experience levels
Job Responsibility
Job Responsibility
  • Build platforms that scale: Design and operate foundational infrastructure—Kubernetes clusters, Kafka streaming platforms, Spark batch processing, observability systems—that handle billions of events and enable Metronome to grow with minimal friction
  • Enable product velocity: Create golden paths, abstractions, and tooling that let engineers ship faster and more reliably without becoming infrastructure experts themselves
  • Enable reliability as the product: Take accountability for system uptime, performance, and correctness. Build monitoring, alerting, and incident response systems that enable the entire team catch problems before customers notice
  • Drive technical direction: Shape Metronome's infrastructure strategy, make platform-level architectural decisions, and mentor engineers across the organization
What we offer
What we offer
  • Excellent medical, dental, vision, and life insurance coverage, including a One Medical membership
  • Paid parental leave
  • FSA (Flexible spending account)
  • Retirement planning - Traditional and ROTH 401(k)
  • Flexible time off
  • Employee assistance program (mental health benefits)
  • Culture where personal growth is highly valued
  • market-benched equity
  • sales incentive pay (for eligible roles)
  • comprehensive health benefits
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

Airtable is looking for backend engineers to join the team to help improve criti...
Location
Location
United States , San Francisco; New York; Seattle
Salary
Salary:
196000.00 - 339900.00 USD / Year
airtable.com Logo
Airtable
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 8 years of industry experience
  • Experience in areas such as databases, distributed systems, service-oriented architectures, and data infrastructure
  • Strong background in computer science with a degree in CS or a related field
  • Excited about learning new technologies and applying them in a fast-changing environment
  • Based in or willing to relocate to the San Francisco Bay Area or New York City for this role
Job Responsibility
Job Responsibility
  • Proactively identify and lead significant improvements to Airtable’s infrastructure
  • Work on systems-level problems in a complex design space focusing on scalability, efficiency, reliability, and security
  • Build clean, reusable, and maintainable abstractions for engineers
  • Take full ownership of components of Airtable’s infrastructure, including reliability, performance, efficiency, and observability of production environment
What we offer
What we offer
  • Opportunity to receive benefits, restricted stock units, and may include incentive compensation
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

Airtable is looking for backend engineers to join our team to help improve criti...
Location
Location
United States , San Francisco; New York; Seattle
Salary
Salary:
148100.00 - 250000.00 USD / Year
airtable.com Logo
Airtable
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2-8 years of industry experience
  • Experience in areas such as databases, distributed systems, service-oriented architectures, and data infrastructure
  • Strong background in computer science with a degree in CS or a related field
  • Currently based or willing to relocate to the San Francisco Bay Area
  • Excited about learning new technologies and applying them in a fast-changing environment
Job Responsibility
Job Responsibility
  • Proactively identify and lead significant improvements to Airtable’s infrastructure
  • Work on systems-level problems in a complex design space where scalability, efficiency, reliability, and security matter
  • Build clean, reusable, and maintainable abstractions
  • Take full ownership of components of Airtable’s infrastructure, including responsibility for reliability, performance, efficiency, and observability of our production environment
What we offer
What we offer
  • Benefits
  • Restricted stock units
  • Incentive compensation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Infrastructure) - HyperDX

Join us in revolutionizing Observability for Developers! We’re on a mission to r...
Location
Location
Netherlands
Salary
Salary:
Not provided
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of backend engineering experience
  • Strong TypeScript and Node.js skills (bonus for additional languages)
  • Deep understanding of APIs, event-driven systems, and high-throughput data pipelines
  • Proficiency in SQL and experience working with analytical databases (ClickHouse experience a plus)
  • Experience with Docker and Kubernetes, plus Helm for managing production deployments
  • Experience with infrastructure-as-code (Terraform, Pulumi, or similar)
  • Familiarity with CI/CD pipelines, monitoring systems, and production-grade alerting practices
  • A passion for building reliable, maintainable, cloud-native systems
Job Responsibility
Job Responsibility
  • Build the core platform: Design and implement backend systems and APIs that power HyperDX, enabling engineers to ingest, query, and analyze observability data at massive scale
  • Scale deployments and infrastructure: Architect, deploy, and maintain cloud-native systems that ensure reliability, scalability, and performance. You’ll use Kubernetes, Helm, and infrastructure-as-code to make deployments simple and resilient
  • Ensure maintainability and operational excellence: Define best practices for CI/CD, monitoring, logging, and alerting. Drive automation across testing, scaling, and incident response to keep our platform healthy and developer-friendly
  • Engineer for scale: Design and operate ingestion and data processing pipelines that remain performant, resilient, and observable—even as we grow to petabyte-level workloads
  • Engage with the community: Collaborate with open-source contributors and customers, solve their challenges, and incorporate their feedback into our roadmap
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Infrastructure) - HyperDX

Join us in revolutionizing Observability for Developers! We’re on a mission to r...
Location
Location
Germany
Salary
Salary:
Not provided
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of backend engineering experience
  • Strong TypeScript and Node.js skills (bonus for additional languages)
  • Deep understanding of APIs, event-driven systems, and high-throughput data pipelines
  • Proficiency in SQL and experience working with analytical databases (ClickHouse experience a plus)
  • Experience with Docker and Kubernetes, plus Helm for managing production deployments
  • Experience with infrastructure-as-code (Terraform, Pulumi, or similar)
  • Familiarity with CI/CD pipelines, monitoring systems, and production-grade alerting practices
  • A passion for building reliable, maintainable, cloud-native systems
Job Responsibility
Job Responsibility
  • Build the core platform: Design and implement backend systems and APIs that power HyperDX, enabling engineers to ingest, query, and analyze observability data at massive scale
  • Scale deployments and infrastructure: Architect, deploy, and maintain cloud-native systems that ensure reliability, scalability, and performance. You’ll use Kubernetes, Helm, and infrastructure-as-code to make deployments simple and resilient
  • Ensure maintainability and operational excellence: Define best practices for CI/CD, monitoring, logging, and alerting. Drive automation across testing, scaling, and incident response to keep our platform healthy and developer-friendly
  • Engineer for scale: Design and operate ingestion and data processing pipelines that remain performant, resilient, and observable—even as we grow to petabyte-level workloads
  • Engage with the community: Collaborate with open-source contributors and customers, solve their challenges, and incorporate their feedback into our roadmap
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
  • Fulltime
Read More
Arrow Right