CrawlJobs Logo

Software Engineer, Site Reliability

United States, San Mateo · Job Posted December 08, 2025
Apply Position
Job Link Share

Job Description

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient. You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence. You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms. This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes.

Job Responsibility

  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts

Requirements

  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
  • Willingness to participate in on-call rotations

Nice to have

  • Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing
  • Experience with machine learning infrastructure, model serving, or distributed AI frameworks
  • Hands-on experience in security and data protection

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer, Site Reliability

8 matching positions

Senior Software Engineer - Site Reliability Engineer/SRE

Software-defined vehicles represent a new paradigm for automakers and consumers,...
Location
Location
United States , Sunnyvale
Salary
Salary:
152100.00 - 232900.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Azure experience is a must
  • 5+ years of hands-on DevOps experience with at least one of the public cloud providers - Azure (preferred), AWS, GCP
  • Excellent skills with Terraform
  • Experience with monitoring and log aggregation frameworks such as Logstash, Splunk, DataDog, ElasticSearch, and Kibana
  • Strong CS fundamentals, including OO concepts, data structures, algorithms, and distributed systems
  • Experience with Bash, PowerShell, Python, Go, or Groovy
  • On-call and fire-fighting experience
  • Experience with modern site reliability practice including but not limited to post mortem, SLO/SLI, Tracing, Synthetic monitoring, etc.
What we offer
What we offer
  • This position may be eligible for relocation benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - Site Reliability

Ironclad is the leading AI contracting platform that transforms agreements into ...
Location
Location
United States , San Francisco; New York City
Salary
Salary:
210000.00 - 235000.00 USD / Year
ironcladapp.com Logo
Ironclad
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of experience in a Site Reliability Engineering / DevOps role
  • Expert knowledge of Docker and Kubernetes, Crossplane experience is a plus
  • Strong knowledge of cloud platforms such as AWS and Google Cloud
  • Proficiency in scripting and programming languages like Python, Typescript, or Bash
  • Experience with infrastructure-as-code tools like Terraform or Pulumi
  • Strong troubleshooting and analytical skills, drive to help customers, and the ability to dive deep and learn a new product
  • Experience with CI/CD pipelines and deployment automation tools such as CircleCI and ArgoCD
  • Strong understanding of networking and security principles
Job Responsibility
Job Responsibility
  • Be part of the Cloud Platform SRE Team, focused on building our Cloud Platform using modern tools and best practices
  • Champion SRE best practices within the team and throughout the organization
  • Ensure the reliability, availability, and performance of services and infrastructure
  • Solve the whole problem. Design, implement, and maintain scalable systems
  • Automate repetitive operational tasks to streamline processes
  • Monitor system performance and troubleshoot issues proactively
  • Develop and document best practices for system operations
  • Collaborate with development teams to enhance system design
  • Manage incident responses and perform root cause analysis
  • Participate in on-call rotations to handle critical issues as they arise
What we offer
What we offer
  • 100% health coverage for employees (medical, dental, and vision), and 75% coverage for dependents with buy-up plan options available
  • Market-leading leave policies, including gender-neutral parental leave and compassionate leave
  • Family forming support through Maven for you and your partner
  • Paid time off - take the time you need, when you need it
  • Monthly stipends for wellbeing, hybrid work, and (if applicable) cell phone use
  • Mental health support through Modern Health, including therapy, coaching, and digital tools
  • Pre-tax commuter benefits (US Employees)
  • 401(k) plan with Fidelity with employer match (US Employees)
  • Regular team events to connect, recharge, and have fun
  • And most importantly: the opportunity to help build the company you want to work at
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right

Software engineer 2 / Senior Software engineer - Azure Data

Microsoft's Azure Data engineering team is leading the transformation of analyti...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Experience with the Azure stack including Storage, Compute, Networking, Fabric, Purview, Synapse, AKS, DevOps, Data Factory, or Power BI
  • Experience with big data technologies such as Spark, Kafka, Hadoop, or HBase
  • Experience building data lake or data engineering products, tools, or pipelines
  • Familiarity with container-based architectures (Docker, Kubernetes)
  • Ability to debug complex distributed systems on Linux and/or Windows platforms
Job Responsibility
Job Responsibility
  • Write extensible, maintainable code in C#, Java, Scala, or Python for Fabric Materialized Lake View services and HDInsight components
  • Use AI tools and coding best practices across the development lifecycle
  • Design data refresh, scheduling, and query optimisation features with minimal supervision
  • Review code from teammates for correctness, test coverage, security risks, and adherence to team standards
  • Coach junior engineers through code reviews
  • Debug complex issues in distributed systems running on Azure, Linux, and Windows
  • Run live site operations on a rotational, on-call basis
  • Integrate logging and instrumentation to gather telemetry on system health, performance, reliability, and security
  • Work with product managers, technical leads, and partners across geographies to define customer requirements for Materialized Lake View features
  • Fulltime
Read More
Arrow Right

Software Engineer II & Senior Software Engineer

Security represents the most critical priorities for our customers in a world aw...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, C, C++, C#, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Experience with Troubleshoot and optimize automation, reliability, and monitoring for Live Site running as part of an on-call rotation owned by engineering team
  • Experience with distributed systems, messaging systems like Kafka etc - Large scale system design
Job Responsibility
Job Responsibility
  • Lead the architecture, design and implementation of services for extremely high scale, throughput, durability, and low latency
  • Innovate and make service deployment and maintenance an efficient well-oiled machine that provides excellent reliability with minimal manual engineer intervention
  • Ability to conduct in-depth triage, troubleshooting, and forensics across all facets of the cloud stack while executing processes corrective action and continual service improvement
  • Drive Infrastructure security improvements for mission critical high scale workloads
  • Lead the definition of requirements, KPIs, priorities and planning of engineering deliverables
  • Mentor and grow the energetic, diverse, and driven team with a good mix of senior and mid-level
  • Fulltime
Read More
Arrow Right

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right