CrawlJobs Logo

Principal Network Operations Site Reliability Systems Engineer

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

115500.00 - 266000.00 USD / Year

Job Description:

This role entails incorporating Site Reliability Engineering (SRE) concepts into network, tool, and process designs, improving service performance, and addressing unmet customer needs. The role also involves leveraging cloud platforms, evaluating network designs, and engaging in advanced software and database architecture. An excellent opportunity for innovation and technical leadership at Hewlett Packard Enterprise.

Job Responsibility:

  • Develop strategies and implement plans to incorporate SRE concepts into network, tool, and process designs and leads execution of those strategies and plans
  • Evaluates LAN, WLAN, SD-WAN, AAA, Private 5G, and other network designs for fit-for-use criteria, and designs prototype analysis tools to facilitate rapid iteration of network delivery service enhancements
  • Identifies and engineers new ways to leverage data from multiple platforms to identify network performance trends and detect anomalies
  • Prototypes machine learning anomaly detection, event signature identification, and trend identification
  • Automates common incident management and problem management procedures
  • Develops organization-wide architectures, methodologies, and prototypes for software systems design and development across multiple platforms and organizations within the Global Business Unit
  • Identifies and evaluates new technologies and innovations for alignment with technology roadmap and business value
  • creates plans for prototyping and prototype iteration
  • Reviews and evaluates designs and project activities for compliance with development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk

Requirements:

  • Bachelor’s or master’s degree in computer science, Computer Engineering, Information Systems, or equivalent
  • Typically, 10+ years’ experience
  • Experience with cloud platforms
  • Experience with software development languages for console and web-based applications
  • Experience in User Interface (UI/UX) design
  • Understanding of and experience with common network infrastructure devices such as switches, routers, access points, authentication, authorization, and accounting systems and protocols, and network management utilities
  • Experience with network monitoring protocols
  • Ability to design and implement relational database solutions, time-series databases, and NoSQL database solutions
  • Excellent analytical and problem-solving skills
  • Experience in the overall architecture of software systems for products and solutions
  • Designing and integrating software systems running on multiple platform types into overall architecture
  • Evaluating and selecting forms and processes for software systems testing and methodology, including writing and execution of test plans, debugging, and testing scripts and tools
  • History of innovation with multiple patents or deployed solutions in the field of software design
  • Excellent written and verbal communication skills
  • mastery of English language
  • Ability to effectively communicate product architectures and design proposals and negotiate options at business unit and executive levels

Nice to have:

  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • User Experience (UX)
What we offer:
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Inclusive environment celebrating individual uniqueness

Additional Information:

Job Posted:
May 20, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Network Operations Site Reliability Systems Engineer

Principal Site Reliability Engineer

We are looking for a Principal Site Reliability Engineer to join the CVML Platfo...
Location
Location
United States
Salary
Salary:
166000.00 - 293000.00 USD / Year
bluerivertechnology.com Logo
Blue River Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building infrastructure with K8S, AWS, and bare metal
  • 8+ years of experience working with Python and Go (with production experience)
  • 8+ years of experience working with infra automation tools: Terraform / Terragrunt (or Pulumi / CDK)
  • 8+ experience with Linux-based systems and networks, and a deep understanding of internal components, networking, and security aspects
  • Has a track record of building and maintaining scalable systems in production environments
  • Experience in building CI/CD pipelines using GitHub Actions (or GitLab / Jenkins) for application release and deployment
  • Experience in using AWS ECS, EKS, IAM, EC2, and RDS at production scale
  • Deep understanding of Kubernetes and its internals (kubelet, CRDs, etc) and experience with building and extending clusters from scratch
  • Strong problem-solving skills and ability to troubleshoot complex infrastructure and networking issues
  • Excellent communication skills to collaborate effectively with technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • System Design: Architect and implement various cloud and on-premise applications, systems, and infrastructure
  • Hybrid system integration: Integrate extremely diverse systems, configure stable integration, uptime, and monitoring
  • Edge device integration: work with edge devices of various formats and integrate them with on-prem and cloud workflows, including networking, low-level OS, and electrical/control integration
  • Low-level performance optimization: optimize the performance and throughput of the system at the filesystem, networking, and software levels
  • High-level optimisation of cost and stability: optimize cost, operational stability, and supportability of highly diverse platforms and tech stack
  • Product Mindset: Collaborate with cross-functional teams to design, develop, and maintain robust, scalable, and user-friendly web and mobile data-intensive applications
  • System Integration: Build tools that enable users to easily move between different applications and platforms to utilize the strengths of each in a coherent ecosystem
  • Collaboration: Work closely with cross-functional teams, including data scientists, analysts, software engineers, and product managers, to understand data requirements and deliver data solutions that align with business goals
  • Documentation: Create and maintain technical documentation, including data flow diagrams, architecture designs, and standard operating procedures
  • Technology Evaluation: Stay up-to-date with industry trends and emerging technologies related to data engineering, recommending and implementing new tools and frameworks as appropriate
What we offer
What we offer
  • eligibility for Blue River’s bonus and benefit programs
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

The Principal SRE leads curial initiatives in the team responsible for durable, ...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Proven experience leading teams through high‑severity production incidents in large, distributed systems
  • Strong understanding of incident management, reliability engineering, and live‑site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time‑critical situations
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high‑impact events
  • Act as the senior incident leader or sponsor for long‑running, high‑stakes, or cross‑service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high‑quality post‑incident reviews and drive accountability for repair items that reduce recurrence and systemic risk. Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high‑pressure, executive‑visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live‑site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer‑impacting events
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

The Principal SRE Manager leads the team responsible for durable, high quality h...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • equivalent experience
  • Proven experience leading teams through high severity production incidents in large, distributed systems
  • Demonstrated people leadership experience managing senior engineers or technical incident leaders
  • Strong understanding of incident management, reliability engineering, and live site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time critical situations
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Principal Systems Reliability Engineer, Secure Federal Operations

The Principal Systems Reliability Engineer is responsible for designing and impl...
Location
Location
United States , Herndon
Salary
Salary:
114800.00 - 207200.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of progressive experience in systems architecture, platform engineering, or site reliability engineering, with a strong focus on security and operational excellence
  • Experience designing and implementing secure, scalable, and highly available systems across hybrid and cloud environments (Azure, AWS, or GCP)
  • Experience in automation and scripting using Python, Go, PowerShell, or Bash
  • Knowledge of imaging processes and asset lifecycle management, including provisioning, patching, and compliance tracking preferred
  • Strong background in network architecture and security, including segmentation, VPNs, firewalls, and Zero Trust principles preferred
  • Experience with DevOps tools, such as, Ansible, Chef, Puppet, etc. Experience in Docker, Kubernetes, etc. is preferable
  • Experience with Application Performance Monitoring (APM) tools such as AppDynamics, and logging/observability tools like Splunk for troubleshooting and performance analysis
  • Experience working in a cloud environment (public/private)
  • Ability to influence technology direction, lead architecture reviews, and collaborate across multiple teams preferred
  • Experience in incident and problem management, root cause analysis, and disaster recovery planning preferred
Job Responsibility
Job Responsibility
  • Develop and implement system designs and architectures to improve software delivery speed and operational efficiency
  • Lead architecture for cross-domain programs, ensuring alignment with enterprise standards
  • Build and operate cloud-native platforms (Kubernetes, service mesh, ingress, policy engines)
  • Implement network segmentation, firewalls, VPNs, and Zero Trust principles
  • Contribute to advancing software delivery processes including cloud enablement and microservices containerization
  • Deliver software solutions that enhance service availability, scalability, latency, and efficiency
  • Manage environment provisioning and pipeline configurations to support automated server deployment
  • Also responsible for other duties/projects as assigned by business management as needed
What we offer
What we offer
  • Medical, dental and vision insurance
  • Flexible spending account
  • 401(k)
  • Employee stock grants
  • Employee stock purchase plan
  • Paid time off and up to 12 paid holidays
  • Paid parental and family leave
  • Family building benefits
  • Back-up care
  • Enhanced family support
  • Fulltime
Read More
Arrow Right

Principal Systems Reliability Engineer

This role is responsible for designing and implementing secure, scalable, and hi...
Location
Location
United States , Herndon
Salary
Salary:
114800.00 - 207200.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field OR equivalent experience
  • Advanced degree with 5+ years of related experience preferred
  • 7+ years of progressive experience in systems architecture, platform engineering, or site reliability engineering
  • Hands-on experience with Azure and AWS cloud platforms
  • Expertise in Active Directory, DNS, 802.1X, and certificate lifecycle management
  • Strong background in Windows and Linux operating systems
  • Proficiency in TCP/IP networking and network security principles
  • Administration of Microsoft 365 (M365) services (Exchange Online, SharePoint, Teams)
  • Automation and scripting using PowerShell, Python, or Bash preferred
  • Experience working in a cloud environment (public/private)
Job Responsibility
Job Responsibility
  • Develop and implement system designs to improve software delivery speed and operational efficiency
  • Lead architecture for cross-domain programs ensuring alignment with enterprise standards
  • Deliver solutions that enhance service availability, scalability, latency, and efficiency
  • Design and deploy solutions on Azure and AWS
  • Build and operate cloud-native platforms (Kubernetes, service mesh, ingress, policy engines)
  • Implement Infrastructure as Code (IaC) for automated deployments
  • Administer Active Directory and integrate with cloud identity solutions
  • Configure 802.1X authentication for secure network access
  • Manage digital certificates lifecycle (issuance, renewal, revocation)
  • Manage DNS, TCP/IP networks, and network segmentation
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Employee stock grants
  • Employee stock purchase plan
  • Paid time off
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Specialized Clouds combines the power of edge platforms, devices, and ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, c#, .NET. Golang, Java, or Python
  • 10+ years of experience in commercial software development
  • Experience in building, shipping and operating reliable, distributed systems software
  • Ability to engage in site-reliability engineering practices
  • Experience in building large distributed systems software
  • Experience in Microservices architecture
  • Experience working with Kubernetes and Helm charts
  • Understanding of basic computer networking
  • Self-starter, who proactively identifies problems and drives for resolution
  • Great problem-solving skills and outstanding drive for results
Job Responsibility
Job Responsibility
  • Incubate new ideas, carry PoC, convert into feature requirements
  • Take features from ideation to successful global roll out
  • Act as an expert code and design reviewer and mentor other engineers
  • Accelerate development velocity for all engineers and deliver continuous improvements to the team’s process and codebase
  • Establish best coding practices and automation process
  • Build strong cross group collaboration with other cross-geo teams to remove blockers and drive for comprehensive end-end service delivery
  • Leverage other data platforms (open source and within Microsoft) for building solutions
  • Work on customer support cases, collaborate with various internal teams and customers to resolved within the SLA timelines
  • Collaborate with Product management team to conduct customer workshops to educate customers on features, translate the feedback into engineering requitements
  • Torch bearer for AI driven development
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of ...
Location
Location
Portugal
Salary
Salary:
Not provided
outsystems.com Logo
OutSystems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
  • 8+ years of experience in Software Engineering or SRE, ideally within high-growth, cloud-native environments
  • Expertise in Observability: Proven ability to implement SLIs/SLOs and telemetry systems that provide actionable insights into complex distributed systems
  • Cloud Mastery: Deep architectural knowledge of AWS/GCP/Azure, specifically regarding networking, security, and cost-optimization
  • Strategic Impact: Demonstrated success in leading cross-functional initiatives that improved system reliability or developer velocity at an organizational scale
  • System Design & Architecture: Expertise in designing highly available, fault-tolerant distributed systems (Microservices, Event-driven architecture)
  • Development: Professional-level proficiency in Go, Python, or Rust, with the ability to contribute to core product codebases and build custom internal tooling
  • Cloud Ecosystems: Deep-tier mastery of AWS, GCP, or Azure (specifically IAM, VPC networking, Transit Gateways, and Cross-region redundancy)
  • Orchestration at Scale: Extensive experience managing Kubernetes (K8s) in production, including Custom Resource Definitions (CRDs), Service Mesh (Istio/Linkerd), and Admission Controllers
  • Infrastructure as Code (IaC): Advanced usage of Terraform, CloudFormation, or Spacelift, focusing on modularity, state management, and CI/CD integration for infrastructure.
Job Responsibility
Job Responsibility
  • Help define and execute the strategic vision and roadmap for the Site Reliability Engineering function
  • Provide leadership and mentorship to more junior SREs, fostering a culture of innovation, collaboration, and operational excellence
  • Collaborate with leadership and other stakeholders to ensure cross-functional alignment
  • Take active participation, collaborate effectively with development teams, and influence the design of a highly reliable and scalable infrastructure, leveraging cloud technologies and industry best practices
  • Collaborate with development teams at all stages of the product development lifecycle to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant
  • Drive the adoption, definition, and improvement of Service Level Objectives (SLOs)
  • Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents
  • Oversee incident response efforts, ensuring quick resolution and minimal downtime, and effective RCA/post-mortems
  • Automate every operational task, with a special focus on fast incident detection & recovery
  • Foster a culture of continuous improvement and knowledge sharing
What we offer
What we offer
  • A company that is always growing, changing, and innovating
  • Real career opportunities
  • Work colleagues that are as smart, hard-working, and driven as you
  • Disrupting the status quo is in our DNA
  • We ask “why” a lot
  • OutSystems nurtures an inclusive culture of diversity, where everyone feels empowered to be their authentic self and perform at their best.
  • Fulltime
Read More
Arrow Right

Principal Software Engineering Manager

The HPC/AI (High-Performance Computing and Artificial Intelligence) organization...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years people management experience
  • 10+ years of professional software design and development experience in large-scale distributed systems
  • Experience building and operating networking infrastructure for hyperscale datacenters or AI clusters
  • Hands-on experience with networking technologies in AI-specific hardware (e.g., InfiniBand, ROCE, MRC, NVLink)
  • In-depth understanding of networking protocols (e.g., Ethernet, TCP/IP, RDMA, gRPC) and distributed systems
  • Familiarity with network virtualization, software-defined networking (SDN), or network performance tuning
  • Familiarity with AI accelerators such as GPUs (NVIDIA, AMD) or TPUs, and how they interact with networking infrastructure
Job Responsibility
Job Responsibility
  • Hire, manage, and grow a high-performing team of software engineers, fostering a culture of excellence, inclusion, and innovation
  • Lead the design and development of large-scale distributed systems and services that power Azure’s AI infrastructure
  • Drive engineering planning and execution while ensuring alignment with organizational OKRs and long-term strategy
  • Establish lean, scalable, and efficient processes that promote innovation and engineering rigor
  • Deliver best-in-class engineering by ensuring services and components are modular, secure, reliable, diagnosable, observable, and reusable
  • Improve test coverage, automation, and integration testing to proactively identify and resolve reliability gaps
  • Ensure live-site reliability and service health through robust monitoring, telemetry, and automation
  • Collaborate across Microsoft and partner organizations to deliver cohesive, end-to-end infrastructure solutions
  • Apply data-driven insights to optimize performance, scalability, and customer satisfaction
  • Champion Microsoft’s culture by modeling, coaching, and caring—nurturing diversity, inclusion, and continuous growth for your team and peers
  • Fulltime
Read More
Arrow Right