CrawlJobs Logo

Site Reliability Engineer Sr. Staff

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
Puerto Rico , San Juan

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Designs, develops, troubleshoots and debugs software programs for software enhancements and new products. Develops software including operating systems, compilers, routers, networks, utilities, databases and Internet-related tools. Determines hardware compatibility and/or influences hardware design.

Job Responsibility:

  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs.

Requirements:

  • Minimum of 10 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
  • Knowledge of both relational (SQL) and non-relational databases
  • Excellent problem-solving and debugging skills with a strong sense of ownership
  • Experience managing distributed systems like Apache Kafka and Cassandra
  • Effective communicator and collaborative team player
  • It is mandatory to attend to San Juan office twice a week.

Nice to have:

  • Experience contributing to open-source projects
  • Background in security engineering or related disciplines.
What we offer:
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Additional Information:

Job Posted:
April 11, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 2916 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer Sr. Staff

Site Reliability Engineer Sr. Staff

Designs, develops, troubleshoots and debugs software programs for software enhan...
Location
Location
Puerto Rico , San Juan
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 10 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs.
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Sr Staff / Principal Site Reliability Engineer- Network & Security Operations

As a Site Reliability Engineer, you will be responsible for Palo Alto Networks’ ...
Location
Location
United States , Santa Clara
Salary
Salary:
154000.00 - 249500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in IAC and infra automation tools, using Terraform & Ansible, CI/CD tools
  • Expert knowledge on cloud orchestration via GKE / EKS, etc, preferably on GCP
  • Experienced in designing and implementing Business Continuity Plans and Disaster Recovery Plans
  • Expert knowledge of firewall technologies (PANW preferred), including VPNs and routing
  • Advanced knowledge of shell scripting and programming languages such a PERL, Ruby, PHP, or Python
  • Advanced knowledge of DNS and DHCP, and Microsoft AD infrastructure
  • Strong analytical skills for interpreting business requirements and translating them into technical specifications
  • Strong project management, time management, and organizational skills
  • Excellent communication skills, including the ability to write network and security documentation, policies, and guidelines
  • Ability to work nights and weekends and provide 24/7 on-call support
Job Responsibility
Job Responsibility
  • Design, implement and provide support for IT infrastructure compute components
  • Install, support and maintain software infrastructure according to best practices, including routers, Load balancers, switches, wifi controllers, and firewalls via terraform/ansible automation
  • Perform network security design and integration
  • Diagnose problems and solve issues, often under time constraints
  • Implement the necessary controls and procedures to protect information systems assets from intentional or inadvertent modification, disclosure, or destruction
  • Ensure system uptime and backup for all IT infrastructure
  • Provide security incident triage and response, including working with firewall and device logs, investigating security events, protecting forensic value of data and establishing monitoring and incident reporting and response procedures
  • Work closely with engineering to help report issues and manage project deliverables and provide status and progress reports
  • Provide on-call support for Incident Management
What we offer
What we offer
  • restricted stock units
  • bonus
  • Fulltime
Read More
Arrow Right

Unix - Senior Cloud - Digital Engineering Sr. Staff Engineer

Location
Location
India , Noida
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Should have a minimum of 8 to 10 years of experience as a Linux/Unix System Administrator. Should have expertise on at least 2 flavors of Unix. Linux is a must!
  • Should have a deep level of understanding of Linux OS & should be able to handle day to day admin tasks.
  • Should be well versed shell scripting.
  • Expert in Unix-Linux, AWS Cloud Administration, OS/server administration, patching, maintenance, and troubleshooting.
  • Proficient in operating and troubleshooting AWS services like EC2, networking, RDS, backups, storage (EBS, EFS, S3, Glacier), and security (Well-Architected framework).
  • Possesses a strong understanding of networking concepts for configuring secure VPCs, subnets, landing zones, ACLs, and security groups.
  • Experience in end-to-end cloud migrations, including strategy, assessment, design, architecture, and execution on AWS.
  • Skilled in identifying and migrating suitable applications and workloads, gathering migration requirements, and collaborating with stakeholders.
  • Good knowledge of various AWS services like Lambda, SNS, SQS, DynamoDB, OpenSearch, Transfer Family, CloudWatch, EC2, EFS, EKS, Step Functions, ELB, ACM, Directory Services, and networking.
  • Hands-on expertise in designing, architecting, deploying, and supporting hybrid cloud environments.
Job Responsibility
Job Responsibility
  • Perform installation, customization and maintenance of the UNIX-LINUX Server operating system and system software products in support of business processing requirements for both On-premise and Cloud environment
  • Evaluate and integrate new operating system versions, drivers and hardware.
  • Provides in-depth diagnosis for operating systems software/hardware failures and develops solutions.
  • Monitors and tunes the system to achieve optimum performance levels in standalone and multi-tiered environments.
  • Conducts system analysis, configuration management and develops improvements for system software performance, availability and reliability.
  • Implements appropriate levels of system security. Maintain security patching and remediating vulnerabilities, propose solutions for the same.
  • Perform incident resolution, problem determination and root cause analysis in accordance with service level. Knowledge of ITIL.
  • Recommend and implement modifications to the server environment, Innovation, Ideas to improve
  • Preparation of Standard Documents and periodically review them for modifications
  • Identifies opportunities for process and procedure enhancements to drive efficiency and customer service levels.
  • Fulltime
Read More
Arrow Right

Sr. Manager, Engineering - Process & Reliability

Complete oversight of BME operations for large, complex sites and/or multiple si...
Location
Location
United States , Clifton
Salary
Salary:
143000.00 - 163000.00 USD / Year
questdiagnostics.com Logo
Quest Diagnostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of three (3) years experience in a managerial role overseeing a service program (or similar)
  • Demonstrated understanding, experience, and leadership in Maintenance & Reliability, CCMS Computer Maintenance Management Systems and TPM Total Productive Maintenance (6+ years)
  • Demonstrated understanding, experience, and leadership in continuous improvement, process management, project management and change management, including leading large or complex projects with multiple workstreams (6+ years)
  • Ability to navigate the facility and individual labs/sites
  • Ability to travel
  • Ability to sit or stand for extended periods of time
  • Ability to lift light to moderately heavy objects. (1-10 lbs frequently, 11-25 lbs occasionally, 26-50 lbs seldomly)
  • Must be able to work in a biohazard environment and comply with safety policies and procedures outlined in the Environmental Health & Safety Manual
  • Daily automation & high complexity operations in a regulated industry
  • BME technical expertise
Job Responsibility
Job Responsibility
  • Lead and optimize the regional implementation of the CMMS / EAM System across Instrument Platforms to track and trend equipment up & downtime and automate KPI Measurement. Metrics and provide end user training
  • Strategic guidance and collaboration with enterprise operations matrix leadership teams for implementation of Automation platforms, Operations excellence, Reliability, Vendor management, and key projects
  • Lead, develop, and manage overall operations and distribution of resources (staffing, budgets, and outside vendor services) of the BME program in collaboration & consultation with cross-functional stakeholders and business partners
  • Review, audit, and participate in decision support activities related to problem diagnosis, repair, preventive maintenance, and quality assurance of equipment
  • Participate in the development of annual goals and objectives related to supporting the growth and development of equipment support services program. (both locally and enterprise wide)
  • Implement and manage large/complex projects (enterprise wide) utilizing operational excellence and project/program management skills
  • Develop and implement technical training for staff (i.e., onboarding materials, maintenance procedures)
  • Serve as a technical resource for the BME team and lab. Provides “best practices” to other enterprise-wide Quest sites and aids in their development
  • Oversees evaluation of equipment service needs and communicates with clinical equipment users on proper device use and safety
  • Evaluates maintenance and cost data related to laboratory equipment, to deliver expected service productivity and quality
What we offer
What we offer
  • Day 1 Medical, supplemental health, dental & vision for FT employees who work 30+ hours
  • Best-in-class well-being programs
  • Annual, no-cost health assessment program Blueprint for Wellness
  • healthyMINDS mental health program
  • Vacation and Health/Flex Time
  • 6 Holidays plus 1 "MyDay" off
  • FinFit financial coaching and services
  • 401(k) pre-tax and/or Roth IRA with company match up to 5% after 12 months of service
  • Employee stock purchase plan
  • Life and disability insurance, plus buy-up option
  • Fulltime
Read More
Arrow Right

Sr Director, Maintenance & Reliability

Provides leadership, direction and strategies for maintenance function consistin...
Location
Location
United States , El Dorado
Salary
Salary:
Not provided
delekus.com Logo
Delek US
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4 year / Bachelor's Degree
  • Ten (10) or more years Management experience
  • Fifteen (15) or more years experience in maintenance for large production operations
  • General Equipment Maintenance & Repair
  • Preventative Maintenance
  • Inspection & Maintenance Procedures
  • Inspections & Audits
  • Materials Engineering
  • Materials Selection
  • Mechanical Properties
Job Responsibility
Job Responsibility
  • Provides leadership, direction and strategies for maintenance function consisting primarily of maintenance planning, routine maintenance, turnaround planning/execution and capital/expense projects
  • Actively participates in labor-management committees (where appropriate) and in developing and strategic/operational plans and budgets
  • Leadership accountability for safe, environmentally sound and reliable operations of Maintenance across all Delek sites
  • Actively participates, as member of refinery leadership team, in development of refinery’s strategic and operational plans
  • Establishes Maintenance-specific objectives aligning with refinery’s targets for safety, regulatory compliance, reliability, and efficiency
  • Ensures risks associated with Maintenance activities are appropriately managed
  • Directs efforts to improve effectiveness and efficiency while ensuring departmental activities are conducted in safe, environmentally sound and regulatory compliant manner
  • Manages development and execution of department’s policies, programs and procedures to maximize operating efficiency
  • Ensures adoption of and adherence to engineering guidelines, industry standards and best practices
  • Manages budget and exercises financial stewardship to control expenditures
What we offer
What we offer
  • Up to a 10% match on 401K on hire start with a vesting timeline of only one year
  • Medical benefits that start on day one with a 30% premium rebate annually
  • Access to the Calm app for FREE
  • Additional annual incentives through performance management program
  • Fulltime
Read More
Arrow Right

Sr Director, Maintenance & Reliability

Provides leadership, direction and strategies for maintenance function consistin...
Location
Location
United States , Big Spring
Salary
Salary:
Not provided
delekus.com Logo
Delek US
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4 year / Bachelor's Degree (Required)
  • Ten (10) or more years Management experience (Required)
  • Fifteen (15) or more years experience in maintenance for large production operations (Required)
  • General Equipment Maintenance & Repair
  • Preventative Maintenance
  • Inspection & Maintenance Procedures
  • Inspections & Audits
  • Materials Engineering
  • Materials Selection
  • Mechanical Properties
Job Responsibility
Job Responsibility
  • Actively participates, as member of refinery leadership team, in development of refinery's strategic and operational plans
  • Establishes Maintenance-specific objectives aligning with refinery's targets for safety, regulatory compliance, reliability, and efficiency
  • Ensures risks associated with Maintenance activities are appropriately managed
  • Directs efforts to improve effectiveness and efficiency while ensuring departmental activities are conducted in safe, environmentally sound and regulatory compliant manner
  • Manages development and execution of department's policies, programs and procedures to maximize operating efficiency
  • Ensures adoption of and adherence to engineering guidelines, industry standards and best practices
  • Manages budget and exercises financial stewardship to control expenditures
  • Promotes culture of continuous improvement
  • Accountable for the fiscal responsibility of the Maintenance department
  • Participates with Corporate on initiatives to improve reliability of the facility
What we offer
What we offer
  • Up to a 10% match on 401K on your hire start, with a vesting timeline of only one year
  • Medical benefits that start on day one with a 30% premium rebate annually
  • Access to the Calm app for FREE
  • Additional annual incentives through performance management program
  • Fulltime
Read More
Arrow Right

Sr. System Administrator (Data Centers)

As a Sr. System Administrator on the Data Centers team, you'll own both our hard...
Location
Location
Canada , Vancouver
Salary
Salary:
118500.00 - 150000.00 CAD / Year
dialpad.com Logo
Dialpad
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Background in Systems and/or Software Engineering, with a strong focus on infrastructure and operations
  • Extensive experience with Linux, both on-premise and in the cloud, including performance tuning, troubleshooting, and automation at scale
  • Familiarity with networking technologies: TCP/IP, DHCP, DNS, routing, firewalls, and load balancing concepts
  • Data center setup/deployment experience, including racking/stacking, cabling standards, and remote management
  • Exposure to cloud platforms such as GCP or AWS, and experience working in hybrid environments
  • Demonstrated ability to keep abreast of industry standards and trends, and to translate them into practical improvements in a production environment
  • Proven experience in a senior or lead capacity (typically 5+ years in systems administration or similar roles), including driving cross-team initiatives and mentoring others
  • Strong communication skills and the ability to collaborate effectively with distributed teams
Job Responsibility
Job Responsibility
  • Scout, evaluate, and compare hardware options and colocation facilities, partnering with Engineering to align decisions with performance and cost objectives
  • Design and deploy a cloud expansion strategy that balances reliability, performance, and efficiency across providers and regions
  • Steer capacity planning and our expansion/upgrade strategy, using data to anticipate growth and proactively mitigate bottlenecks
  • Design and deploy servers at scale into data centers around the globe, ensuring consistent standards and automation from day one
  • Develop and maintain automation for a large fleet of servers, VMs, and containers, reducing toil and improving consistency across environments
  • Work with vendors to obtain quotes, make purchases, and schedule services, including coordinating logistics for data center installations and maintenance
  • Set up and evolve monitoring for server, network, and data center health, including alerting, dashboards, and SLO-oriented metrics
  • Develop and maintain proper documentation for engineering staff, including runbooks, standards, and architectural diagrams
  • Participate in a rotating on-call schedule within the larger Infrastructure Engineering division, helping drive rapid incident response and robust post-incident reviews
  • Lead complex systems and network troubleshooting, fault analysis, and resolution, acting as an escalation point for the broader team
What we offer
What we offer
  • Competitive salary
  • comprehensive benefits
  • real opportunities for growth
  • Fulltime
Read More
Arrow Right

Sr Platformization/Cloud Automation Engineer

Palo Alto Networks CDSS group is looking for a seasoned platformization and clou...
Location
Location
United States , Santa Clara
Salary
Salary:
104600.00 - 169225.00 USD / Year
paloaltonetworks.it Logo
Palo Alto Networks Italia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors/Masters degree in Computer Science or a related field
  • 5+ years of industry experience in engineering
  • Fluent scripting skills (preferably Python or Bash) with deep experience in Unix/Linux systems from kernel to shell and beyond
  • 4+ years of working with Microservices architectures on Kubernetes
  • HandsOn experience with container native tools like Docker, Helm for managing workloads running in Kubernetes
  • Experience managing AWS and GCP at scale, with knowledge of cloud-neutral connectivity between platforms
  • Experience designing and maintaining API specifications using Swagger/OpenAPI, and working with API frameworks such as Apigee to enable secure, scalable integrations
  • HandsOn experience with infrastructure-as-code and automation tools such as Terraform, Ansible, etc.
  • Proficient in CI/CD platforms like GitlabCI, Jenkins, ArgoCD, CircleCI etc.
  • In-depth knowledge of operating systems (processes, threads, concurrency, etc)
Job Responsibility
Job Responsibility
  • Work with development teams to ensure that applications have scalability and reliability built-in from day one
  • Design, review and enhance software architecture to improve scalability, service reliability, cost, and performance
  • Drive platformization by building standardized, self-service infrastructure platforms that improve developer productivity, scalability, and operational efficiency
  • Deploy automation for provisioning and operating infrastructure at large scale
  • Partner with teams to improve CI/CD processes and technology
  • Mentor members of the staff on large scale cloud deployments
  • Drive the adoption of observability practices and a data-driven mindset
  • Setup processes like on-call rotations, Postmortems, Run books to continue supporting the infrastructure owned by the SRE team while finding ways to reduce the time to resolution and improve the reliability of services
  • Support, optimize and deploy mission critical, front-end and back-end production
  • Improving site performance, monitoring, and overall stability of our infrastructure
  • Fulltime
Read More
Arrow Right