Site Reliability Engineer

Site Reliability Engineer - FedRAMP

We’re not just building better tech. We’re rewriting how data moves and what the...

Location

Canada , Toronto

Salary:

113200.00 - 130200.00 CAD / Year

Confluent

Expiration Date

Until further notice

Requirements

0-2 years of relevant SRE experience
Experience in Cloud Native technologies with experience operating production services in the cloud
Fundamentals of Distributed Systems and their design
Knowledge of Kubernetes and containerization
Proficiency in infrastructure as code (Terraform preferred)
Experience with telemetry tooling to monitor production systems (DataDog, Grafana, Prometheus)
Exposure and understanding of BCP/DR and high availability exercises
Ability to quickly problem-solve and troubleshoot critical services
Proficiency with scripting and automation (e.g Go, Java, Python, Bash)
Exceptional teamwork, collaboration skills, and the ability to act critically with minimal supervision at times in a remote first environment

Job Responsibility

Understand and participate in the changing FedRAMP space by quickly ramping up with the 20x controls and building upon these to maintain federal compliance
Own and champion high operational standards of Confluent Cloud systems leveraged by federal agencies
Deploy production changes to Confluent Cloud systems and infrastructure through established change management processes
Assist with process improvements and adoption of change management
Own monitoring and incident handling of complex distributed systems, engaging engineering teams when needed through an escort model system
Act as a core member of Confluents Business Continuity Plan and Disaster Recovery team with efforts across 3 large verticals
Innovate and design solutions to reduce toil, bolster operational maturity, and make day-to-day worklife easier
Participate in a 24/7 on-call rotation to maintain the integrity of Confluent Cloud for Government systems

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Site Reliability Engineer (FedRAMP / Security) - CA

Coralogix is a modern, full-stack observability platform transforming how busine...

Location

United States , Los Angeles

Salary:

170000.00 - 220000.00 USD / Year

Coralogix

Expiration Date

Until further notice

Requirements

At least 5 years of experience as a DevOps Engineer/ SRE in production environments
In-depth experience with Kubernetes - operating & monitoring are key parts
At least 2 years of experience Experience with FedRAMP compliance (High/Moderate levels), vulnerability management, and continuous monitoring, including scanning, patching, and reporting - advantage
High familiarity with monitoring tools such as Coralogix, Grafana, Prometheus
Experience in AWS or other cloud providers
Experience with infrastructure as a code (Terraform, Crossplane, etc.)
Understanding of networking - from networking layers to different networking protocols (http, grpc, ssl)
Some software engineering experience, preferably in Golang
An advantage - operating data pipelines
An advantage - familiarity with Apache Kafka

Job Responsibility

Work in high scale environments - Coralogix data pipeline processes 55Tb of data each day
Adopt cutting edge technologies with end-to-end responsibility
Building internal tools to expand our platform capabilities
Collaborate with R&D to improve stability & reliability of the system
Lead the product roadmap - our product is designed for engineers. Therefore, our engineers promote, enhance, and take a crucial part in influencing the product roadmap
Perform operational duties for FedRAMP cloud products, including deployments, on-call support, and incident management

What we offer

Healthcare
Dental
Mental health benefits
401(k) plan and match
Paid sick time
Paid time off

Fulltime

Site Reliability Engineer (FedRAMP / Security) - NY

Coralogix is a modern, full-stack observability platform transforming how busine...

Location

United States , New York

Salary:

170000.00 - 220000.00 USD / Year

Coralogix

Expiration Date

Until further notice

Requirements

At least 5 years of experience as a DevOps Engineer/ SRE in production environments
In-depth experience with Kubernetes - operating & monitoring are key parts
At least 2 years of experience Experience with FedRAMP compliance (High/Moderate levels), vulnerability management, and continuous monitoring, including scanning, patching, and reporting - advantage
High familiarity with monitoring tools such as Coralogix, Grafana, Prometheus
Experience in AWS or other cloud providers
Experience with infrastructure as a code (Terraform, Crossplane, etc.)
Understanding of networking - from networking layers to different networking protocols (http, grpc, ssl)
Some software engineering experience, preferably in Golang
An advantage - operating data pipelines
An advantage - familiarity with Apache Kafka

Job Responsibility

Work in high scale environments - Coralogix data pipeline processes 55Tb of data each day
Adopt cutting edge technologies with end-to-end responsibility
Building internal tools to expand our platform capabilities
Collaborate with R&D to improve stability & reliability of the system
Lead the product roadmap - our product is designed for engineers
Perform operational duties for FedRAMP cloud products, including deployments, on-call support, and incident management

What we offer

Comprehensive and inclusive employee benefits for healthcare, dental, and mental health benefits
401(k) plan and match
Paid sick time and paid time off

Fulltime

Site Reliability Engineer II - FedRAMP

Trimble is seeking a Site Reliability Engineer to join their world class and glo...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor’s Degree or equivalent in Computer Science, Engineering or related field or equivalent experience
Recent college graduate or one year of experience in IT operations, including knowledge of networking, computing and storage
Experience with AWS and/or Azure public cloud
Windows system administration familiarity and scripting skills, such as Python, Powershell
Linux system administration familiarity and scripting skills, including Bash and Perl
Familiarity with application operations, including Incident Management, Change Management, and Capacity Management
Excellent written and verbal communication
Troubleshooting and problem solving skills
Strong desire to learn new things

Job Responsibility

Responsible for configuration, optimization, documentation and support of the infrastructure components of software products which are hosted primarily in cloud services (AWS and Azure)
Perform day-to-day server application management, monitoring, incident response/resolution and working with the customer application development and technical support teams to establish effective application monitoring and to identify application changes to improve operations
Develop new and enhance current shared public cloud services with consideration for Availability, Operations, Performance, Capacity, Security, and User Experience
Responsible for management of security posture and adherence to corporate security best practices
Develop and maintain documentation including but not limited to architecture diagrams, service descriptions, build and deploy documentation and operations run book documentation
Provide design and deployment assistance for divisions needing help on a project basis
Manage AWS & FedRAMP best practice expectations (incorporating Trimble Cloud Core Platform standards)
Work with a global team and are able to occasionally meet or perform tasks off-hours

Principal Site Reliability Engineer (DNS Security)

We are seeking development-heavy Site Reliability Engineers (SREs) who are passi...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

Bachelor's or higher degree in Computer Science, Engineering, or related field or equivalent military experience required
6+ years of experience in DevOps, SRE, or related roles
Cloud Experiences: GCP/AWS/OCI/Azure
Container Docker, Kubernetes operational experiences
Knowledge of TCP/IP, DNS, HTTP, GRPC
Proven experience in designing, implementing, and maintaining scalable and reliable infrastructure
Strong proficiency in automation scripting and infrastructure as code (IaC)
Excellent problem-solving skills and the ability to troubleshoot complex issues
Effective communication skills, both written and verbal
Experience working in collaborative, cross-functional environments

Job Responsibility

Build Terraform to deploy infrastructures and services to multiple cloud platforms
Build automation for provisioning and operating infrastructure at a massive scale using Python or Go code
Work with Dev/QA teams to build pipelines and automation for delivering and deploying applications to production
Build observation (logging, metrics, alerting) systems to make sure system works well
Design and implement the infrastructure to ensure applications align with infrastructure requirements, focusing on scalability and reliability
Collaborate with PMs to deliver compliances (SOC2, Fedramp, IL5) and establish a vision for continuous improvement
On-call Support and Incident Resolution
Participate in occasional on-call rotations to support the infrastructure
Investigate incidents, formulate hypotheses, and identify root causes to solve issues promptly
Write postmortem reviews and provide remediation recommendations

Fulltime

Senior Site Reliability Engineer

Are you ready to start a new journey with a team of energized professionals adva...

Location

Australia , North Sydney; Perth; Brisbane

Salary:

Not provided

Bentley Systems

Expiration Date

Until further notice

Requirements

Degree in computer science, software engineering or relevant training and/or experience
+8 years of experience with Cloud Services development, deployment and/or IT Cloud infrastructure setup and maintenance (Azure Cloud or AWS or GCP)
Expertise in containerization and orchestration technologies (Docker, Kubernetes)
Experience with Scripting and automation skills using languages like PowerShell, Bash, Ansible, JavaScript or similar
Programming experience, preferably in a high-level language like C#, Python, Golang, Ruby, or equivalent
Knowledge of AD and DNS, IIS, and networking
Experience with FedRamp background screening
Experience with Azure DevOps (Pipelines, YAML) or GitHub enterprise (Git, Actions)
Good knowledge of Microsoft SQL Server/Azure SQL setup, SQL statements/scripts and troubleshooting
Ability to document architectural designs along with operational processes and procedures to support ongoing administration of cloud systems

Job Responsibility

Manage, implement, and improve automation (CI/CD Infrastructure) and tooling through Azure DevOps, scripting, developing tools and proprietary systems
Automate Azure cloud-based deployments, resource provisioning and other Azure infrastructure related tasks
Troubleshoot and resolve issues related to application development, deployment, and operations
Dive deep into availability, performance and outages for infrastructure and systems, and provide technical leadership for proactive resolutions
Ensure compliance with industry’s best practices and organizational policies
Continuously improving processes and tools to enhance efficiency and productivity
Maintain monitoring and alerting and participate as a member of a rotating on-call schedule
Share on-call responsibilities, including collaborating with other engineers to triage and fix issues that come up in production for our users

What we offer

A great Team and culture
An exciting career as an integral part of a world-leading software company providing solutions for architecture, engineering, and construction
An attractive salary and benefits package
A commitment to inclusion, belonging and colleague wellbeing through global initiatives and resource groups
A company committed to making a real difference by advancing the world’s infrastructure for better quality of life, where your contributions help build a more sustainable, connected, and resilient world

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...

Location

United States , Santa Clara

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
Expertise in configuration management and IaC (Terraform, Helm, Ansible)
Strong proficiency in programming languages like Python, Go, or Java
Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
Experience with GitOps principles and tools like GitLab CI and ArgoCD
Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
BS or MS in Computer Science, a related field, or equivalent professional/military experience

Job Responsibility

Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence

Fulltime

Coralogix is a modern, full-stack observability platform transforming how busine...

Location

United States , New York

Salary:

Not provided

Coralogix

Expiration Date

Until further notice

Requirements

At least 5 years of experience as a DevOps Engineer/ SRE in production environments
In-depth experience with Kubernetes - operating & monitoring are key parts
At least 2 years of experience Experience with FedRAMP compliance (High/Moderate levels), vulnerability management, and continuous monitoring, including scanning, patching, and reporting - advantage
High familiarity with monitoring tools such as Coralogix, Grafana, Prometheus
Experience in AWS or other cloud providers
Experience with infrastructure as a code (Terraform, Crossplane, etc.)
Understanding of networking - from networking layers to different networking protocols (http, grpc, ssl)
Some software engineering experience, preferably in Golang.

Job Responsibility

Work in high scale environments - Coralogix data pipeline processes 55Tb of data each day
Adopt cutting edge technologies with end-to-end responsibility
Building internal tools to expand our platform capabilities
Collaborate with R&D to improve stability & reliability of the system
Lead the product roadmap - our product is designed for engineers. Therefore, our engineers promote, enhance, and take a crucial part in influencing the product roadmap.
Perform operational duties for FedRAMP cloud products, including deployments, on-call support, and incident management.

Fulltime

Select Country

Site Reliability Engineer - FedRAMP

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Site Reliability Engineer - FedRAMP

Site Reliability Engineer - FedRAMP

Site Reliability Engineer (FedRAMP / Security) - CA

Site Reliability Engineer (FedRAMP / Security) - NY

Site Reliability Engineer II - FedRAMP

Principal Site Reliability Engineer (DNS Security)

Senior Site Reliability Engineer

Principal Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language