Site Reliability Engineer, Cloud Infrastructure Job at EdTech Jobs (San Francisco)

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...

Location

United States , Deerfield

Salary:

96000.00 - 132000.00 USD / Year

Baxter

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, IT, or related field (or equivalent experience)
Prior experience in Site Reliability Engineering and cloud-based infrastructure management
Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
Azure administration and operations experience, with certifications a plus
Knowledge of related technologies, including cloud, encryption, and security protocols
Systems administration experience in Windows and Linux environments
Proven problem-solving skills and experience with scripting and automation tools
Ability to create accurate documentation and reports, with excellent communication skills

Job Responsibility

Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
Manage and administer Azure resources, including virtual machines, databases, and networking components
Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
Define and refine Operations SLAs to maintain high level of Customer Satisfaction
Establish non-functional requirements to meet SLAs
Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
Define key performance indicators that can be monitored, measured, and used to derive opportunities
Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes

What we offer

Healthcare benefits
Employee Stock Purchase Plan (ESPP)
401(k) Retirement Savings Plan
Flexible Spending Accounts
Educational assistance programs
Paid holidays
Paid time off
Paid parental leave
Commuting benefits
Employee Discount Program

Fulltime

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...

Location

United States , Deerfield

Salary:

96000.00 - 132000.00 USD / Year

Baxter

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, IT, or related field (or equivalent experience)
Prior experience in Site Reliability Engineering and cloud-based infrastructure management
Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
Azure administration and operations experience, with certifications a plus
Knowledge of related technologies, including cloud, encryption, and security protocols
Systems administration experience in Windows and Linux environments
Proven problem-solving skills and experience with scripting and automation tools
Ability to create accurate documentation and reports, with excellent communication skills
Applicants must be authorized to work for any employer in the U.S.
Unable to sponsor or take over sponsorship of an employment visa at this time.

Job Responsibility

Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
Manage and administer Azure resources, including virtual machines, databases, and networking components
Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
Define and refine Operations SLAs to maintain high level of Customer Satisfaction
Establish non-functional requirements to meet SLAs
Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
Define key performance indicators that can be monitored, measured, and used to derive opportunities
Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.

What we offer

Support for Parents
Continuing Education/Professional Development
Employee Health & Well-Being Benefits
Paid Time Off
2 Days a Year to Volunteer
Medical and dental coverage starting day one
Insurance coverage for basic life, accident, short-term and long-term disability
Business travel accident insurance
Employee Stock Purchase Plan (ESPP)
401(k) Retirement Savings Plan

Fulltime

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...

Location

United States , San Francisco

Salary:

180960.00 - 230900.00 USD / Year

Atlassian

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
networking technologies such as TCP/IP or security
four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
knowledge of Linux and Windows systems
cloud technologies within AWS, GCP, Azure
continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
must pass technical interview

Job Responsibility

Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
provide real-time feedback on production systems
work with product family and platform developers to maintain and improve services and performance with a strong customer focus
utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices

What we offer

Health and wellbeing resources
paid volunteer days

Fulltime

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...

Location

France , Paris

Salary:

Not provided

Ledger

Expiration Date

Until further notice

Requirements

8+ years on cloud engineering at scale, on organizations operating SaaS solutions
Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
Customer focused with the ability to identify and understand both internal and external customer's needs
Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
Excellent presentation and written communication
Ability to deal with ambiguity, high level of pressure and rapidly changing environments
Engineering degree.

Job Responsibility

Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
Perform integration of platform software components
Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
Influence and create standards & best practices in support of service level objectives
Automate key SRE metrics including SLOs/SLAs and error budgets
Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
Ensure control of technical debt and promote quality practices
Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability

What we offer

Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
Flexibility: A hybrid work policy
Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
Well-being: Personal development, coaching & fitness with our dedicated partners
Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
High tech: Access to high performance office equipment and gadgets, including Apple products
Transport: Ledger reimburses part of your preferred means of transportation
Discounts: Employee discount on all our products.

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...

Location

United States; Canada

Salary:

186818.00 - 224183.00 USD; CAD / Year

Babylist

Expiration Date

Until further notice

Requirements

8+ years of experience as a Site Reliability Engineer or similar role
Experience supporting high-traffic consumer-facing websites
Proficiency with Terraform
Strong experience working with AWS cloud-based infrastructure and services
Proficiency with Docker and Kubernetes
Solid understanding of cloud-native systems design
Troubleshooting and debugging skills
Experience designing and supporting CI systems
Familiar with monitoring and alerting best practices
Proven experience in on-call management best practices

Job Responsibility

Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
Improve the speed and reliability of our Continuous Integration (CI) systems
Provide support to developers in troubleshooting issues
Establish, communicate, and support best practices for monitoring and alerting

What we offer

Company-paid medical, dental, and vision insurance
Retirement savings plan with company matching and flexible spending accounts
Generous paid parental leave and PTO
Remote work stipend
Perks for physical, mental, and emotional health, parenting, childcare, and financial planning

Fulltime

Site Reliability Engineer - Core

We are looking for a Site Reliability Engineer to join our Core team to encourag...

Location

United Kingdom , London

Salary:

Not provided

Blockchain

Expiration Date

Until further notice

Requirements

Experience with containerization and service orchestration, including best practices and security
Strong knowledge of at least one programming language
Linux, including an understanding of resource allocation, network and/or internals
Experience working with cloud solutions (GCP or AWS)
Deep understanding and demonstrable experience with modern monitoring tools such as Prometheus, Datadog, Grafana, Telegraf
Experience with infrastructure as code tools
Solid background with configuration management tools
Experience with using GitOps and CI to make changes, preferably Github Actions
Experience with messaging systems such as Kafka
Experience with database management

Job Responsibility

Play a critical role in evolving our infrastructure as we develop solutions to complex technical problems involving reliability, latency, bandwidth and most importantly security
Be an integral part of improving observability, monitoring and alerting throughout the platform
Help co-ordinate work across different areas of the company to ensure the most efficient path of execution
Centralize wherever possible common streams of work that are currently duplicated across developer teams
Focus heavily on writing tooling to replace manual, repetitive work in a scalable way
Work in a fast paced, and dynamic environment complementing our existing high calibre team

What we offer

Full-time salary based on experience and meaningful equity in an industry-leading company
Hybrid model working from home & awesome office location in the heart of London
Unlimited vacation policy
work hard and take time when you need it
Work from Anywhere Policy: You can work remotely from anywhere in the world for up to 20 days per year
Apple equipment
The opportunity to be a key player and build your career at a rapidly expanding, global technology company in an emerging field
Flexible work culture

Fulltime