CrawlJobs Logo

Principal Network Operations Site Reliability Systems Engineer

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

115500.00 - 266000.00 USD / Year

Job Description:

This role entails incorporating Site Reliability Engineering (SRE) concepts into network, tool, and process designs, improving service performance, and addressing unmet customer needs. The role also involves leveraging cloud platforms, evaluating network designs, and engaging in advanced software and database architecture. An excellent opportunity for innovation and technical leadership at Hewlett Packard Enterprise.

Job Responsibility:

  • Develop strategies and implement plans to incorporate SRE concepts into network, tool, and process designs and leads execution of those strategies and plans
  • Evaluates LAN, WLAN, SD-WAN, AAA, Private 5G, and other network designs for fit-for-use criteria, and designs prototype analysis tools to facilitate rapid iteration of network delivery service enhancements
  • Identifies and engineers new ways to leverage data from multiple platforms to identify network performance trends and detect anomalies
  • Prototypes machine learning anomaly detection, event signature identification, and trend identification
  • Automates common incident management and problem management procedures
  • Develops organization-wide architectures, methodologies, and prototypes for software systems design and development across multiple platforms and organizations within the Global Business Unit
  • Identifies and evaluates new technologies and innovations for alignment with technology roadmap and business value
  • creates plans for prototyping and prototype iteration
  • Reviews and evaluates designs and project activities for compliance with development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk

Requirements:

  • Bachelor’s or master’s degree in computer science, Computer Engineering, Information Systems, or equivalent
  • Typically, 10+ years’ experience
  • Experience with cloud platforms
  • Experience with software development languages for console and web-based applications
  • Experience in User Interface (UI/UX) design
  • Understanding of and experience with common network infrastructure devices such as switches, routers, access points, authentication, authorization, and accounting systems and protocols, and network management utilities
  • Experience with network monitoring protocols
  • Ability to design and implement relational database solutions, time-series databases, and NoSQL database solutions
  • Excellent analytical and problem-solving skills
  • Experience in the overall architecture of software systems for products and solutions
  • Designing and integrating software systems running on multiple platform types into overall architecture
  • Evaluating and selecting forms and processes for software systems testing and methodology, including writing and execution of test plans, debugging, and testing scripts and tools
  • History of innovation with multiple patents or deployed solutions in the field of software design
  • Excellent written and verbal communication skills
  • mastery of English language
  • Ability to effectively communicate product architectures and design proposals and negotiate options at business unit and executive levels

Nice to have:

  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • User Experience (UX)
What we offer:
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Inclusive environment celebrating individual uniqueness

Additional Information:

Job Posted:
May 20, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Network Operations Site Reliability Systems Engineer

Principal Site Reliability Engineer

We are looking for a Principal Site Reliability Engineer to join the CVML Platfo...
Location
Location
United States
Salary
Salary:
166000.00 - 293000.00 USD / Year
bluerivertechnology.com Logo
Blue River Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building infrastructure with K8S, AWS, and bare metal
  • 8+ years of experience working with Python and Go (with production experience)
  • 8+ years of experience working with infra automation tools: Terraform / Terragrunt (or Pulumi / CDK)
  • 8+ experience with Linux-based systems and networks, and a deep understanding of internal components, networking, and security aspects
  • Has a track record of building and maintaining scalable systems in production environments
  • Experience in building CI/CD pipelines using GitHub Actions (or GitLab / Jenkins) for application release and deployment
  • Experience in using AWS ECS, EKS, IAM, EC2, and RDS at production scale
  • Deep understanding of Kubernetes and its internals (kubelet, CRDs, etc) and experience with building and extending clusters from scratch
  • Strong problem-solving skills and ability to troubleshoot complex infrastructure and networking issues
  • Excellent communication skills to collaborate effectively with technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • System Design: Architect and implement various cloud and on-premise applications, systems, and infrastructure
  • Hybrid system integration: Integrate extremely diverse systems, configure stable integration, uptime, and monitoring
  • Edge device integration: work with edge devices of various formats and integrate them with on-prem and cloud workflows, including networking, low-level OS, and electrical/control integration
  • Low-level performance optimization: optimize the performance and throughput of the system at the filesystem, networking, and software levels
  • High-level optimisation of cost and stability: optimize cost, operational stability, and supportability of highly diverse platforms and tech stack
  • Product Mindset: Collaborate with cross-functional teams to design, develop, and maintain robust, scalable, and user-friendly web and mobile data-intensive applications
  • System Integration: Build tools that enable users to easily move between different applications and platforms to utilize the strengths of each in a coherent ecosystem
  • Collaboration: Work closely with cross-functional teams, including data scientists, analysts, software engineers, and product managers, to understand data requirements and deliver data solutions that align with business goals
  • Documentation: Create and maintain technical documentation, including data flow diagrams, architecture designs, and standard operating procedures
  • Technology Evaluation: Stay up-to-date with industry trends and emerging technologies related to data engineering, recommending and implementing new tools and frameworks as appropriate
What we offer
What we offer
  • eligibility for Blue River’s bonus and benefit programs
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

Are you a Principal Site Reliability Engineering Manager interested in improving...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 3+ years of people management experience
  • 5+ years of experience planning, designing, implementing, and delivering large initiatives spanning multiple engineers as the primary owner, including operating and improving production services at scale
  • Experience leading reliability engineering for developer-facing or platform services, including incident response, automation/toil reduction, and observability (metrics/logs/tracing) built on top of mature observability platforms and practices
  • Experience working across disciplines, groups, and teams to align reliability priorities and delivery plans
  • Experience architecting, deploying, and operating enterprise scale distributed cloud services (Azure preferred), including containerization and orchestration
  • Experience operating engineering systems outer loop processes (CI/CD, build, and release platforms) with reliability, safety, and governance practices
Job Responsibility
Job Responsibility
  • Partner with engineers, product managers, and partner teams to design, operate, and maintain reliable and resilient services, with clear operational requirements (monitoring, alerting, runbooks, capacity, and failure modes)
  • Drive cross-org alignment through partnerships and co-development following the “One Microsoft” philosophy, including shared reliability standards and operational tooling
  • Build, grow, and retain a team of Site Reliability Engineers
  • Provide mentorship and coaching on reliability engineering, incident response, and pragmatic automation—within and beyond your team
  • Define, implement, and operate SLOs/SLIs and error budgets for critical engineering systems services
  • use them to guide prioritization and continuous improvement
  • Lead incident management for your services, including on-call health, escalation paths, blameless post incident reviews, modeling follow-through on corrective and preventive actions
  • Drive automation to reduce toil and improve operational efficiency across build, validation, and deployment systems (e.g., self-healing, safe rollouts, and automated remediation)
  • Establish observability (metrics, logs, traces), capacity planning, and performance management to meet reliability and latency goals at scale
  • Foster a diverse and inclusive culture where everyone can bring their full and authentic self, while holding a high bar for customer impact and reliability
  • Fulltime
Read More
Arrow Right

Principal Group Engineering Manager

Microsoft Specialized Clouds combines the power of edge platforms, devices, and ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional software engineering experience, including designing, building, and operating distributed, cloud-scale services
  • 5+ years of engineering leadership experience, including managing managers and leading multi-team engineering organizations (M2+)
  • Deep experience with network device platforms — specifically Arista (EOS, eAPI, CloudVision) and/or Cisco (NX-OS, DCNM/NDFC) — including device programming, configuration management, and automation
  • Strong background in device programming and network automation — building systems that programmatically configure, validate, and manage network device state at scale
  • Experience with Azure Resource Provider (RP) engineering — ARM resource modeling, deployment pipelines, control-plane architecture, and resource lifecycle management
  • Solid understanding of L2/L3 networking fundamentals: spine-leaf architecture, VXLAN, overlay/underlay networking, BGP, and data center network design
  • Proven ability to set technical direction and architectural strategy for complex platforms spanning multiple components and partner teams
  • Demonstrated success owning end-to-end delivery of customer-critical services, including design, development, release, and live-site operations
  • Strong experience driving operational excellence, including reliability, incident management, automation, and cost optimization for production services
  • Proven track record of leading organizational transformation — such as quality resets, reliability turnarounds, code yellow resolution, or engineering culture change across an engineering org
Job Responsibility
Job Responsibility
  • Lead engineering teams through the design, architecture, development, testing, and operations of the Network Fabric platform — the cloud-managed networking layer for Azure Operator Nexus and Azure Local
  • Drive execution excellence across the full software lifecycle: semester planning, feature delivery, release management, and live-site operations
  • Own engineering commitments across multiple workstreams including network device programming, Azure Resource Provider development, fabric orchestration, and network configuration management
  • Ensure services meet Microsoft standards for quality, reliability, security, and operational readiness
  • Establish and enforce engineering best practices — including test-driven development, automated validation, secure development lifecycle (SDL/SFI), and continuous integration
  • Continue and accelerate the ongoing engineering transformation: driving quality resets, improving release predictability, and reducing customer-impacting incidents
  • Own the resolution of code yellow and equivalent quality escalations, driving root cause analysis and systemic remediation across the engineering organization
  • Champion a culture of engineering fundamentals — ensuring that quality, security, and operational maturity are embedded into every sprint, not treated as afterthoughts
  • Drive measurable reduction in support costs through automation, improved test coverage, and process optimization
  • Provide technical leadership across device programming (Arista EOS, Cisco NX-OS), network fabric orchestration, and Azure Resource Provider engineering
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • 7+ years technical experience working with large-scale cloud or distributed systems
  • Experience building or scaling incident response programs at organizational or enterprise scope
  • Background in SRE, production engineering, or platform reliability roles
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

The Principal SRE leads curial initiatives in the team responsible for durable, ...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Proven experience leading teams through high‑severity production incidents in large, distributed systems
  • Strong understanding of incident management, reliability engineering, and live‑site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time‑critical situations
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high‑impact events
  • Act as the senior incident leader or sponsor for long‑running, high‑stakes, or cross‑service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high‑quality post‑incident reviews and drive accountability for repair items that reduce recurrence and systemic risk. Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high‑pressure, executive‑visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live‑site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer‑impacting events
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Candidates may be considered without currently holding these background investigations, provided they are eligible for and able to successfully obtain them
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right
New

Principal Engineer-Site Reliability Engineering and AIOps

We are looking for a Principal Engineer to set the enterprise technical directio...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
May 10, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 7+ years of engineering experience, including principal-level technical leadership on large-scale reliability, production operations, or platform programs across complex environments
  • 7+ years of software engineering experience (e.g., Java, C#, Python) with demonstrated expertise in system design and distributed systems
  • track record of delivering reusable automation and platform capabilities adopted by multiple teams
  • 5+ years operating Linux/Unix and Windows platforms in production, including performance tuning, capacity planning, and reliability hardening for mission-critical services
  • 5+ years designing and operating cloud solutions (public and/or private cloud), including reliability and security architecture, infrastructure-as-code, and cost-aware engineering at scale
  • 5+ years leading reliability and operations practices for enterprise-scale, highly available services, including major incident leadership, problem management, and establishing operational readiness mechanisms
  • 5+ years architecting and scaling full-stack observability solutions, including instrumentation standards, alert strategy, service dashboards, and governance that improves signal quality and reduces noise
  • 5+ years with automation and observability toolsets (e.g., Ansible, Grafana, Elastic, Splunk, Prometheus) and experience building reusable components, templates, and paved paths integrated with CI/CD
  • Exceptional communication and influence skills, including the ability to align senior stakeholders, drive technical decisions across organizations, and clearly articulate risk, tradeoffs, and recommended paths forward
Job Responsibility
Job Responsibility
  • Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
  • Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
  • Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Set and evangelize the SRE and AIOps technical strategy for EFT, establishing reference architectures, standards, and guardrails (service tiering, onboarding criteria, SLO/error budget governance) and holding teams accountable through transparent executive-level reporting
  • Act as a principal-level technical advisor and multiplier: mentor senior engineers, contribute to hiring and technical bar-raising, and define reliability patterns and guardrails across applications, networks, databases, operating systems, and web technologies
  • Own the reliability and observability architecture across hybrid/multi-cloud, driving standardization of monitoring, logging, tracing, synthetics, and resilience/chaos testing
  • define platform patterns that teams can adopt with minimal friction
  • Fulltime
!
Read More
Arrow Right

Principal Site Reliability Engineering Manager

The Principal SRE Manager leads the team responsible for durable, high quality h...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • equivalent experience
  • Proven experience leading teams through high severity production incidents in large, distributed systems
  • Demonstrated people leadership experience managing senior engineers or technical incident leaders
  • Strong understanding of incident management, reliability engineering, and live site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time critical situations
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right