Staff Software Engineer - Site Reliability Job at Ironclad (San Francisco)

Staff Site Reliability Engineer, Managed AI

At Crusoe, our Site Reliability Engineering team ensures the reliability and sca...

Location

United States , San Francisco; Sunnyvale

Salary:

204000.00 - 247000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

Strong software engineering background — experience building production-grade systems beyond scripting or Bash
Demonstrated experience in distributed systems design and implementation
Hands-on work with large language models (LLMs) or AI/ML infrastructure
SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs
Building monitoring and observability systems
Driving performance and reliability improvements
Designing fault-tolerant systems and automated testing strategies
Proficiency in at least one modern programming language (Python, Go, Java, C++)
Familiarity with Kubernetes or container orchestration platforms
Strong collaboration and communication skills

Job Responsibility

Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
Build automation and reliability tooling to support distributed AI pipelines and inference services
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Staff Site Reliability Engineer

Ever since we started in 2007, Sunrun has been at the forefront of connecting pe...

Location

United States , Lehi

Salary:

242050.00 USD / Year

Sunrun

Expiration Date

Until further notice

Requirements

Bachelor’s in Computer Information Systems, Software Engineering or closely related
5 years of experience as a Software Developer using Microservices hosted in Azure
5 years of experience with Virtualization and cloud computing
5 years of experience with Object Oriented Design (OOD) & and Object-Oriented Programming (OOP)
5 years of experience building software solutions in an engineering environment using Python & Shell scripting
5 years of experience with Network analysis, debugging and troubleshooting with Wireshark & Git

Job Responsibility

Provide strategic leadership in designing, implementing, and managing the overall infrastructure strategy for our organization
Leverage cloud platforms (e.g., AWS, Azure) to design, deploy, and manage scalable infrastructure solutions
Spearhead the definition of advanced monitoring requirements and elevate SLAs
Collaborate with the engineering team and TPM to implement and enhance monitoring practices
Expertly convey intricate technical information to diverse stakeholders with clarity and precision
Provide leadership in integrating advanced SRE principles into applications and services
Lead the implementation of sophisticated system design measures for heightened security, performance, and resiliency
Develop strategic notification strategies for production outages
Leverage SLOs and SLIs to measure and optimize availability, latency, and response time
Lead and strategize emergency response efforts, conduct retrospectives with RCA, and manage on-call workloads effectively

What we offer

Medical/Dental/Vision Insurance
Life Insurance
Disability Insurance
401k Plan + Company Match
Stock Purchase Plan
Paid Vacations/Holidays
Paid Baby Bonding Leave
Employee Discounts
PowerU - 100% Funded Education Programs
Employee Donation Matching

Fulltime

Staff Site Reliability Engineer

Join our Site Reliability Engineering (SRE) team and help ensure the reliability...

Location

United States

Salary:

220000.00 - 325000.00 USD / Year

Replit

Expiration Date

Until further notice

Requirements

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
Strong programming skills in languages like Python or Go
Deep understanding of distributed systems
Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies
Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions
Strong incident management skills with extensive experience leading incident response for complex systems
Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools
Excellent written and verbal communication skills
Strong interpersonal skills, with experience working with and mentoring engineers
A willingness to dive into understanding, debugging, and improving any layer of the stack

Job Responsibility

Architect and Implement Observability
Define and Drive Reliability Standards
Lead Incident Management and Response
Drive Automation and Infrastructure as Code
Optimize Performance on Kubernetes
Debug and Harden Distributed Systems
Provide Staff-Level Guidance
Educate and Mentor
Build and Integrate

What we offer

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays

Fulltime

Sr Staff / Principal Site Reliability Engineer- Network & Security Operations

As a Site Reliability Engineer, you will be responsible for Palo Alto Networks’ ...

Location

United States , Santa Clara

Salary:

154000.00 - 249500.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

8+ years of experience in IAC and infra automation tools, using Terraform & Ansible, CI/CD tools
Expert knowledge on cloud orchestration via GKE / EKS, etc, preferably on GCP
Experienced in designing and implementing Business Continuity Plans and Disaster Recovery Plans
Expert knowledge of firewall technologies (PANW preferred), including VPNs and routing
Advanced knowledge of shell scripting and programming languages such a PERL, Ruby, PHP, or Python
Advanced knowledge of DNS and DHCP, and Microsoft AD infrastructure
Strong analytical skills for interpreting business requirements and translating them into technical specifications
Strong project management, time management, and organizational skills
Excellent communication skills, including the ability to write network and security documentation, policies, and guidelines
Ability to work nights and weekends and provide 24/7 on-call support

Job Responsibility

Design, implement and provide support for IT infrastructure compute components
Install, support and maintain software infrastructure according to best practices, including routers, Load balancers, switches, wifi controllers, and firewalls via terraform/ansible automation
Perform network security design and integration
Diagnose problems and solve issues, often under time constraints
Implement the necessary controls and procedures to protect information systems assets from intentional or inadvertent modification, disclosure, or destruction
Ensure system uptime and backup for all IT infrastructure
Provide security incident triage and response, including working with firewall and device logs, investigating security events, protecting forensic value of data and establishing monitoring and incident reporting and response procedures
Work closely with engineering to help report issues and manage project deliverables and provide status and progress reports
Provide on-call support for Incident Management

What we offer

restricted stock units
bonus

Fulltime

Senior Staff Engineer Software (Cloud Platform, Production & Reliability – Machine Identity Security)

The Production Engineering team is responsible for building, scaling, and operat...

Location

United States , Santa Clara

Salary:

126000.00 - 203500.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams

Job Responsibility

Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices

What we offer

restricted stock units
bonus

Fulltime

Staff Software Engineer - Digital eCommerce

Work Arrangement: This role is categorized as hybrid. This means the successful...

Location

United States , Austin, Texas; Warren, Michigan

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science or a related field, or equivalent work experience.
7+ years of rigorous software engineering experience, with a heavy concentration on enterprise-grade eCommerce solutions (shop, search, cart, checkout, order management) with full stack experience.
Extensive background in Site Reliability Engineering (SRE) principles, telemetry, and observability tools (Datadog) to ensure zero-downtime deployments and rapid incident resolution.
Extensive experience designing and building event-driven microservices, high-throughput RESTful APIs, and backend systems (e.g., Java, Python, Node.js).
Proven track record of mastering new languages, frameworks, and architectural paradigms quickly to solve complex business problems.
Strong hands-on experience with at least one major cloud platform (AWS, Azure, GCP) and containerization/orchestration.
Experience leveraging modern AI coding assistants and automation tools to enhance personal and team output.

Job Responsibility

Champion Operational Excellence: Help in platform management by establishing gold-standard monitoring, alerting, and incident response protocols, specifically for our Shopify infrastructure.
Shopify & eCommerce Leadership: Lead the deployment, configuration, and scaling of Shopify and custom eCommerce platforms, ensuring they can handle peak seasonal traffic with absolute reliability and low latency.
Architect for Scale: Design and implement highly available microservices and APIs that seamlessly integrate Shopify with GM's internal systems, IBM OMS, and external partners.
Drive AI & Engineering Excellence: Champion the integration of AI-assisted development tools (e.g., GitHub Copilot, AI-driven observability) to accelerate code delivery, automate repetitive tasks, and elevate team productivity.
Technical Mentorship: Provide strategic technical direction to a team of mid-level and senior software engineers, aligning architectural decisions with Digital Commerce business goals.
System Resilience: Partner tightly with DevOps and SRE teams to embed advanced observability frameworks (e.g., Datadog) and robust CI/CD pipelines directly into the engineering culture.
Innovate the Platform: Architect systems that support personalized recommendations, dynamic pricing, and real-time inventory visibility to continuously optimize the B2C eCommerce flow.

What we offer

Company Vehicle : Upon successful completion of a motor vehicle report review, you will be eligible to participate in a company vehicle evaluation program, through which you will be assigned a General Motors vehicle to drive and evaluate. Note: program participants are required to purchase/lease a qualifying GM vehicle every four years unless one of a limited number of exceptions applies.
This Job may be eligible for relocation benefits.

Fulltime

Site Reliability Engineer

As a Site Reliability Engineer (SRE) you will play a pivotal role in the design,...

Location

India , Noida

Salary:

Not provided

Taazaa Inc

Expiration Date

Until further notice

Requirements

5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
Strong observability expertise: Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.), Distributed tracing (OpenTelemetry, Jaeger, Zipkin), Structured logging and metrics at scale
Proven track record of incident management, postmortems, and driving reliability improvements
Experience defining and working with SLOs, SLIs, and error budgets
Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
Bias toward automation and reducing manual toil
Effective Communication

Job Responsibility

Partner with product engineering squads to design, build, and operate highly reliable services
Own and improve production reliability end-to-end: Define and measure SLOs/SLIs, error budgets, and reliability goals, Lead incident response, postmortems, and follow-up action items, Participate in on-call rotation and drive rapid, effective resolution of production issues
Build and maintain world-class observability: Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing, Enable squads to understand system behavior and debug effectively
Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
Collaborate closely with Staff Engineers / Team Leads to: Embed reliability best practices into the development lifecycle, Review architectural decisions with a production lens, Mentor engineers on operational excellence, observability, and on-call mindset
Champion modern engineering and DevOps practices: CI/CD pipelines, Progressive delivery (feature flags, canaries, blue-green), Infrastructure as code (Terraform, Pulumi, CDK), Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation
Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
Help evolve our technical strategy for reliability, scalability, and cost-efficiency

What we offer

Competitive compensation and performance-based incentives
Opportunities for professional growth through workshops and certifications
Flexible work-life balance with remote options
Collaborative culture
Exposure to diverse projects across various industries
Clear career advancement pathways
Comprehensive health benefits
Perks like team-building activities

Fulltime

Staff Software Engineer – Forward Deployed

We are seeking a skilled Software Engineer who will design, build, and maintain ...

Location

China , Shanghai; Dalian; Wuhan

Salary:

Not provided

Pfizer

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Engineering, or related field with 8-12 years of relevant experience
AI-Augmented Development: optimize AI tool usage, train engineers on AI-augmented workflows, evaluate new AI development tools, establish practices that balance AI speed with verification rigor
Business Immersion: rapidly acquire domain expertise, translate between business and engineering, mentor engineers on immersion
Data Integration: navigate complex enterprise data landscapes, build relationships to gain data access, handle undocumented schemas, build robust integration solutions, mentor engineers on data integration
Full-Stack Development: build complete applications rapidly across any technology stack, select the right tools, balance technical debt with delivery speed, mentor engineers on full-stack development
Multi-Audience Communication: influence through communication at all levels, handle difficult conversations skillfully, train engineers on effective communication, represent teams across the function
Problem Discovery: seek out undefined problems, embed with users to discover latent needs, coach engineers on problem discovery techniques, turn ambiguity into clear problem statements
Rapid Prototyping & Validation: lead rapid delivery initiatives, coach on prototype-first approaches, establish trust through consistent fast delivery, define clear criteria for prototype-to-production transitions
Site Reliability Engineering: define reliability standards, drive post-incident improvements systematically, design capacity planning processes, mentor engineers on SRE practices
Stakeholder Management: influence senior stakeholders, manage complex stakeholder landscapes with competing agendas, build trust rapidly with new stakeholders, shield teams from organizational friction

Job Responsibility

Delivery: Lead technical delivery of complex projects across multiple teams, unblock others through hands-on contributions, ensure engineering quality
AI: Design AI-augmented engineering workflows for your area, evaluate new AI tools, train engineers on effective AI usage, balance speed with verification
People: Coach multiple engineers on career growth, lead hiring for technical roles across your area, shape team technical culture
Business: Drive business outcomes through technical solutions across your area, influence product roadmaps, partner effectively with business stakeholders
Process: Drive process efficiency within your team, coordinate cross-functional technical work, lead retrospectives
Documentation: Design documentation strategies for your projects, ensure knowledge persists beyond individuals, write specifications that enable effective collaboration

Fulltime

Select Country

Staff Software Engineer - Site Reliability

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?