Principal Site Reliability Engineer Job at Atlassian (San Francisco)

Principal Network Operations Site Reliability Systems Engineer

This role entails incorporating Site Reliability Engineering (SRE) concepts into...

Location

United States

Salary:

115500.00 - 266000.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s or master’s degree in computer science, Computer Engineering, Information Systems, or equivalent
Typically, 10+ years’ experience
Experience with cloud platforms
Experience with software development languages for console and web-based applications
Experience in User Interface (UI/UX) design
Understanding of and experience with common network infrastructure devices such as switches, routers, access points, authentication, authorization, and accounting systems and protocols, and network management utilities
Experience with network monitoring protocols
Ability to design and implement relational database solutions, time-series databases, and NoSQL database solutions
Excellent analytical and problem-solving skills
Experience in the overall architecture of software systems for products and solutions

Job Responsibility

Develop strategies and implement plans to incorporate SRE concepts into network, tool, and process designs and leads execution of those strategies and plans
Evaluates LAN, WLAN, SD-WAN, AAA, Private 5G, and other network designs for fit-for-use criteria, and designs prototype analysis tools to facilitate rapid iteration of network delivery service enhancements
Identifies and engineers new ways to leverage data from multiple platforms to identify network performance trends and detect anomalies
Prototypes machine learning anomaly detection, event signature identification, and trend identification
Automates common incident management and problem management procedures
Develops organization-wide architectures, methodologies, and prototypes for software systems design and development across multiple platforms and organizations within the Global Business Unit
Identifies and evaluates new technologies and innovations for alignment with technology roadmap and business value
creates plans for prototyping and prototype iteration
Reviews and evaluates designs and project activities for compliance with development guidelines and standards
provides tangible feedback to improve product quality and mitigate failure risk

What we offer

Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
Career development programs
Inclusive environment celebrating individual uniqueness

Fulltime

Principal Site Reliability Engineer

Location

United States , Ft. Meade

Salary:

Not provided

CipherLogix

Expiration Date

Until further notice

Requirements

Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
Ten (10) years experience in system engineering/architecture
Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
Ten (10) years experience in the cleared environment
Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
Knowledge and experience with developing distributed storage routing and querying algorithms
Experience in developing documentation required to support a program’s technical issues and training situations

Fulltime

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Peru

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Colombia

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering
5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Ecuador

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer

We are looking for a Principal Site Reliability Engineer to join the CVML Platfo...

Location

United States

Salary:

166000.00 - 293000.00 USD / Year

Blue River Technology

Expiration Date

Until further notice

Requirements

8+ years of experience building infrastructure with K8S, AWS, and bare metal
8+ years of experience working with Python and Go (with production experience)
8+ years of experience working with infra automation tools: Terraform / Terragrunt (or Pulumi / CDK)
8+ experience with Linux-based systems and networks, and a deep understanding of internal components, networking, and security aspects
Has a track record of building and maintaining scalable systems in production environments
Experience in building CI/CD pipelines using GitHub Actions (or GitLab / Jenkins) for application release and deployment
Experience in using AWS ECS, EKS, IAM, EC2, and RDS at production scale
Deep understanding of Kubernetes and its internals (kubelet, CRDs, etc) and experience with building and extending clusters from scratch
Strong problem-solving skills and ability to troubleshoot complex infrastructure and networking issues
Excellent communication skills to collaborate effectively with technical and non-technical stakeholders

Job Responsibility

System Design: Architect and implement various cloud and on-premise applications, systems, and infrastructure
Hybrid system integration: Integrate extremely diverse systems, configure stable integration, uptime, and monitoring
Edge device integration: work with edge devices of various formats and integrate them with on-prem and cloud workflows, including networking, low-level OS, and electrical/control integration
Low-level performance optimization: optimize the performance and throughput of the system at the filesystem, networking, and software levels
High-level optimisation of cost and stability: optimize cost, operational stability, and supportability of highly diverse platforms and tech stack
Product Mindset: Collaborate with cross-functional teams to design, develop, and maintain robust, scalable, and user-friendly web and mobile data-intensive applications
System Integration: Build tools that enable users to easily move between different applications and platforms to utilize the strengths of each in a coherent ecosystem
Collaboration: Work closely with cross-functional teams, including data scientists, analysts, software engineers, and product managers, to understand data requirements and deliver data solutions that align with business goals
Documentation: Create and maintain technical documentation, including data flow diagrams, architecture designs, and standard operating procedures
Technology Evaluation: Stay up-to-date with industry trends and emerging technologies related to data engineering, recommending and implementing new tools and frameworks as appropriate

What we offer

eligibility for Blue River’s bonus and benefit programs

Fulltime

Principal Site Reliability Engineer

We are looking for a reliability expert who is passionate about scaling Cloud se...

Location

Salary:

Not provided

Atlassian

Expiration Date

Until further notice

Requirements

Expert-level proficiency with 10+ years experience in one or more prominent languages such as Java, Go or Python
Expert-level proficiency with 7+ years experience in public cloud offerings (with at least 2+ years specifically on GCP)
Expert-level proficiency with 7+ years experience in operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, writing runbooks, etc.
Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
An ability and desire to mentor and coach engineers

Job Responsibility

Analyse and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency
Cross team and functional boundaries to advocate for reliability methodologies
Work with a variety of platform, product and SRE teams to both build reliability into our platform and drive adoption of those practices into our products
Be the driving force for change

Principal Platform Engineer

Principal Platform Engineer role at Endor Labs building the Application Security...

Location

India , Bengaluru

Salary:

Not provided

Endor Labs

Expiration Date

Until further notice

Requirements

12+ years of Site Reliability Engineering or Platform Engineering experience
Deep hands-on expertise with Kubernetes and CNCF ecosystem in production environments
Significant experience with at least one major cloud provider (Azure, Google Cloud, or AWS)
Strong experience managing large infrastructure deployments using Terraform, OpenTofu, or Terragrunt
Hands-on experience with open source observability tools (Prometheus, Grafana, Mimir, Pyroscope)
Self-driven problem solver with initiative
Customer-focused engineering mindset
Clear communication skills across technical and non-technical audiences

Job Responsibility

Build Cloud Infrastructure at Scale on Azure, Google Cloud, and AWS
Master Kubernetes & CNCF Ecosystem with multi-tenant clusters
Scale Observability Platform with Prometheus, Grafana, Mimir, and Pyroscope
Transform Developer Experience with self-service tools and automation
Drive Infrastructure as Code with Terraform/OpenTofu
Solve Complex Technical Challenges like zero-downtime migrations and cost optimization
Collaborate Across Teams with Security, Backend, and Product Engineering
Iterate and Innovate in fast-paced environment

Fulltime

Principal Site Reliability Engineer

Atlassian

Location:
United States , San Francisco ▼
Mountain View

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
April 23, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Principal Site Reliability Engineer

Principal Network Operations Site Reliability Systems Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer (AI-first SRE)

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Platform Engineer

Principal Site Reliability Engineer

Atlassian

Location:United States , San Francisco ▼Mountain View

Category:IT - Software Development

Contract Type:Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:April 23, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Principal Site Reliability Engineer

Principal Network Operations Site Reliability Systems Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer (AI-first SRE)

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Platform Engineer

Location:
United States , San Francisco ▼
Mountain View

Category:
IT - Software Development

Contract Type:
Employment contract

Job Posted:
April 23, 2025