Software Engineer, Site Reliability Job at Fireworks AI (San Mateo)

Senior Software Engineer - Site Reliability Engineer/SRE

Software-defined vehicles represent a new paradigm for automakers and consumers,...

Location

United States , Sunnyvale

Salary:

152100.00 - 232900.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Azure experience is a must
5+ years of hands-on DevOps experience with at least one of the public cloud providers - Azure (preferred), AWS, GCP
Excellent skills with Terraform
Experience with monitoring and log aggregation frameworks such as Logstash, Splunk, DataDog, ElasticSearch, and Kibana
Strong CS fundamentals, including OO concepts, data structures, algorithms, and distributed systems
Experience with Bash, PowerShell, Python, Go, or Groovy
On-call and fire-fighting experience
Experience with modern site reliability practice including but not limited to post mortem, SLO/SLI, Tracing, Synthetic monitoring, etc.

What we offer

This position may be eligible for relocation benefits

Fulltime

Staff Software Engineer - Site Reliability

Ironclad is the leading AI contracting platform that transforms agreements into ...

Location

United States , San Francisco; New York City

Salary:

210000.00 - 235000.00 USD / Year

Ironclad

Expiration Date

Until further notice

Requirements

Minimum of 5 years of experience in a Site Reliability Engineering / DevOps role
Expert knowledge of Docker and Kubernetes, Crossplane experience is a plus
Strong knowledge of cloud platforms such as AWS and Google Cloud
Proficiency in scripting and programming languages like Python, Typescript, or Bash
Experience with infrastructure-as-code tools like Terraform or Pulumi
Strong troubleshooting and analytical skills, drive to help customers, and the ability to dive deep and learn a new product
Experience with CI/CD pipelines and deployment automation tools such as CircleCI and ArgoCD
Strong understanding of networking and security principles

Job Responsibility

Be part of the Cloud Platform SRE Team, focused on building our Cloud Platform using modern tools and best practices
Champion SRE best practices within the team and throughout the organization
Ensure the reliability, availability, and performance of services and infrastructure
Solve the whole problem. Design, implement, and maintain scalable systems
Automate repetitive operational tasks to streamline processes
Monitor system performance and troubleshoot issues proactively
Develop and document best practices for system operations
Collaborate with development teams to enhance system design
Manage incident responses and perform root cause analysis
Participate in on-call rotations to handle critical issues as they arise

What we offer

100% health coverage for employees (medical, dental, and vision), and 75% coverage for dependents with buy-up plan options available
Market-leading leave policies, including gender-neutral parental leave and compassionate leave
Family forming support through Maven for you and your partner
Paid time off - take the time you need, when you need it
Monthly stipends for wellbeing, hybrid work, and (if applicable) cell phone use
Mental health support through Modern Health, including therapy, coaching, and digital tools
Pre-tax commuter benefits (US Employees)
401(k) plan with Fidelity with employer match (US Employees)
Regular team events to connect, recharge, and have fun
And most importantly: the opportunity to help build the company you want to work at

Fulltime

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...

Location

United States; Canada

Salary:

186818.00 - 224183.00 USD; CAD / Year

Babylist

Expiration Date

Until further notice

Requirements

8+ years of experience as a Site Reliability Engineer or similar role
Experience supporting high-traffic consumer-facing websites
Proficiency with Terraform
Strong experience working with AWS cloud-based infrastructure and services
Proficiency with Docker and Kubernetes
Solid understanding of cloud-native systems design
Troubleshooting and debugging skills
Experience designing and supporting CI systems
Familiar with monitoring and alerting best practices
Proven experience in on-call management best practices

Job Responsibility

Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
Improve the speed and reliability of our Continuous Integration (CI) systems
Provide support to developers in troubleshooting issues
Establish, communicate, and support best practices for monitoring and alerting

What we offer

Company-paid medical, dental, and vision insurance
Retirement savings plan with company matching and flexible spending accounts
Generous paid parental leave and PTO
Remote work stipend
Perks for physical, mental, and emotional health, parenting, childcare, and financial planning

Fulltime

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...

Location

Egypt , Giza

Salary:

Not provided

Rackspace

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering/computer science or equivalent
Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
Proactive approach to identifying problems and solutions
Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
Experience with Terraform or Cloud Formation scripting
Experience with configuration management tools like Ansible, Chef or Puppet
Experience with standard software development best practices and tools such as code repositories (Git preferred)
Experience executing in an agile software development environment

Job Responsibility

Work with customers and implement Observability solutions
Build and maintain scalable systems and robust automation that supports engineering goals
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
Collaborate with team members to document and share solutions
Maintain a deep understanding of the customer’s business as well as their technical environment
Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Software engineer 2 / Senior Software engineer - Azure Data

Microsoft's Azure Data engineering team is leading the transformation of analyti...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience with the Azure stack including Storage, Compute, Networking, Fabric, Purview, Synapse, AKS, DevOps, Data Factory, or Power BI
Experience with big data technologies such as Spark, Kafka, Hadoop, or HBase
Experience building data lake or data engineering products, tools, or pipelines
Familiarity with container-based architectures (Docker, Kubernetes)
Ability to debug complex distributed systems on Linux and/or Windows platforms

Job Responsibility

Write extensible, maintainable code in C#, Java, Scala, or Python for Fabric Materialized Lake View services and HDInsight components
Use AI tools and coding best practices across the development lifecycle
Design data refresh, scheduling, and query optimisation features with minimal supervision
Review code from teammates for correctness, test coverage, security risks, and adherence to team standards
Coach junior engineers through code reviews
Debug complex issues in distributed systems running on Azure, Linux, and Windows
Run live site operations on a rotational, on-call basis
Integrate logging and instrumentation to gather telemetry on system health, performance, reliability, and security
Work with product managers, technical leads, and partners across geographies to define customer requirements for Materialized Lake View features

Fulltime

Software Engineer II & Senior Software Engineer

Security represents the most critical priorities for our customers in a world aw...

Location

United States , Redmond

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, C, C++, C#, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience with Troubleshoot and optimize automation, reliability, and monitoring for Live Site running as part of an on-call rotation owned by engineering team
Experience with distributed systems, messaging systems like Kafka etc - Large scale system design

Job Responsibility

Lead the architecture, design and implementation of services for extremely high scale, throughput, durability, and low latency
Innovate and make service deployment and maintenance an efficient well-oiled machine that provides excellent reliability with minimal manual engineer intervention
Ability to conduct in-depth triage, troubleshooting, and forensics across all facets of the cloud stack while executing processes corrective action and continual service improvement
Drive Infrastructure security improvements for mission critical high scale workloads
Lead the definition of requirements, KPIs, priorities and planning of engineering deliverables
Mentor and grow the energetic, diverse, and driven team with a good mix of senior and mid-level

Fulltime

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
8+ years of experience in backend service engineering, including work on high-scale infrastructures
Proficiency in one or more systems programming languages such as C#, C++
1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
2+ years of experience building software for scale, performance, and reliability
Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)

Job Responsibility

Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
Design and implement scalable backend services optimized for machine learning workflows and large language model integration
Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience

Fulltime

Select Country

Software Engineer, Site Reliability

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?