CrawlJobs Logo
Amgen Logo Amgen · -

Principal Engineer - Global Reliability Network Lead

United States 140334.15 - 189863.85 USD / Year · Job Posted June 04, 2026
Apply Position
Job Link Share

Job Description

Lead Amgen's Global Reliability Network as part of Engineering Operations Services (EOS) department, helping sites turn reliability data, maintenance history, failure trends, and equipment performance insights into actions that improve asset reliability, equipment availability, maintenance effectiveness, and operational capacity. Serve as the business lead for the CMMS Asset Performance Monitoring module and reliability-focused equipment insight tools across Amgen's manufacturing network.

Job Responsibility

  • Lead the Global Reliability Network
  • Lead a network of site reliability engineers, maintenance leaders, system owners, and technical SMEs
  • Develop and execute the Reliability Network strategy, goals, milestones, and maturity plans
  • Share reliability KPIs, bad actor trends, equipment improvements, lessons learned, and reusable solutions
  • Represent reliability priorities and recommendations in senior engineering governance forums
  • Use CMMS data, Asset Performance Monitoring, Plant Pulse, PdM findings, and site input to identify recurring failures, high-risk assets, downtime drivers, and bad actors
  • Convert reliability insights into corrective actions, equipment improvements, PM/PdM optimization, spare parts strategies, and capital recommendations
  • Track improvement actions, verify benefits, and scale successful practices across sites
  • Define and maintain a standard risk-based asset criticality approach across sites
  • Use criticality to prioritize maintenance strategies, work execution, spare parts, lifecycle planning, and performance monitoring
  • Standardize maintenance strategies based on criticality, failure modes, operating context, and performance history
  • Support RCM, FMEA/FMECA, PM optimization, predictive maintenance, and condition-based maintenance approaches
  • Define business requirements for asset health, criticality, failure modes, risk ranking, performance monitoring, and continuous improvement
  • Partner with SAP EAM, Technology/IS, site engineering teams, the Reliability Network, and SMEs to support deployment, training, adoption, and data quality
  • Serve as PlantPulse business owner to ensure tools address reliability pain points, support troubleshooting, and generate actionable equipment insights
  • Partner with OPEX teams to identify reliability improvements that unlock capacity and reduce downtime
  • Translate recurring failures and reliability constraints into business impact
  • Support lifecycle risk visibility for critical assets, including deterioration, obsolescence, maintenance cost, reliability concerns, and end-of-life planning
  • Integrate reliability principles into capital projects from conceptual design through turnover

Requirements

  • Doctorate degree & 2 years of engineering industry experience
  • OR Master's degree & 4 years of engineering industry experience
  • OR Bachelor's degree & 6 years of engineering industry experience
  • OR Associate's degree & 10 years of engineering industry experience
  • OR High school diploma/GED & 12 years of engineering industry experience

Nice to have

  • Bachelor's degree in Engineering
  • Experience in maintenance, reliability, engineering, asset management, manufacturing operations, and/or engineering digital transformation
  • Experience leading cross-functional initiatives in site, multi-site, or global environments
  • Strong knowledge of CMMS/EAM and asset performance monitoring systems, such as SAP, Maximo, or equivalent
  • Working knowledge of asset criticality, RCM, FMEA/FMECA, RCA, PM optimization, PdM, condition-based maintenance, and reliability KPIs
  • Ability to translate site needs into scalable business processes, digital product requirements, and improvement actions
  • Strong stakeholder management, communication, network leadership, and change leadership skills
  • Experience with analytics or dashboarding tools such as Power BI, Spotfire, SAP Analytics Cloud, or equivalent
  • Experience in GMP, biopharmaceutical manufacturing, GxP research, laboratory, or other regulated operations
  • Familiarity with ISO 55000, ISO 14224, SMRP, Uptime Elements, or similar reliability and asset management frameworks
  • Certification such as CMRP, CRL, PMP, Lean Six Sigma, or asset management certification
  • Experience deploying predictive maintenance, condition monitoring, smart sensors, AI-enabled analytics, or advanced asset health technologies
  • Experience developing business cases for reliability, maintenance, lifecycle asset management, or capacity improvement projects
  • Fluent English proficiency, written and verbal
  • Availability to commute to the nearest Amgen operation site as business needs arise, including for meetings, training, operational support, or other business-critical activities

What we offer

  • Retirement and Savings Plan with generous company contributions
  • group medical, dental & vision coverage
  • life & disability insurance
  • discretionary annual bonus program
  • stock-based long-term incentives
  • award-winning time-off plans
  • flexible work models where possible

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Engineer - Global Reliability Network Lead

8 matching positions

New

Principal Service Reliability Engineer

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliab...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
  • Experience leading reliability efforts for enterprise-scale or globally distributed systems
  • Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
  • Demonstrated ability to mentor senior engineers and influence engineering culture at scale
  • Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
  • Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
  • Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
  • Deep experience in observability, incident management, and production operations at scale
  • Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
  • Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
  • Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
  • Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
  • Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
  • Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
  • Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
  • Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
  • Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • 7+ years technical experience working with large-scale cloud or distributed systems
  • Experience building or scaling incident response programs at organizational or enterprise scope
  • Background in SRE, production engineering, or platform reliability roles
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Principal Software Engineer - C++

We are part of the Windows Servicing and Delivery (WSD) organization in the Wind...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Strong expertise in networking fundamentals and protocols, such as TCP/IP, UDP, DNS, DHCP, routing, VPNs, or network security
  • Proven experience designing and debugging low-level, performance-critical, and concurrent systems
  • Demonstrated ability to lead technical design discussions and influence architecture across teams
  • Experience with security engineering, including threat modeling, invariants, and regression risk analysis
  • Strong problem-solving skills and the ability to reason about ambiguous, high-impact technical challenges
Job Responsibility
Job Responsibility
  • Provide technical leadership for complex networking components across Windows Client and Windows Server (e.g., TCP/IP stack, DNS, DHCP, VPN, NDIS, filtering platforms, or distributed networking services)
  • Drive architecture, design reviews, and invariant-based engineering to ensure security, reliability, and performance at global scale
  • Lead the design and implementation of high-impact features, security fixes, and platform hardening, including variant enumeration and regression prevention
  • Own end-to-end engineering quality—from design and code to validation, deployment, and long-term maintainability
  • Partner closely with security teams, Azure, and Redmond counterparts to align on roadmap, risk mitigation, and cross-platform dependencies
  • Diagnose and resolve complex customer issues and live-site incidents, balancing short-term mitigations with long-term fixes
  • Mentor senior engineers and raise the technical bar through code reviews, design discussions, and engineering best practices
  • Influence engineering processes, tooling, and testing strategies to improve efficiency, correctness, and confidence at scale
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Azure Front Door (AFD) is the global edge for Microsoft and many of our customer...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, or related technical discipline AND 10+ years building and operating distributed systems or networking platforms in production
  • track record of delivering high‑throughput, low‑latency services
  • Strong systems programming proficiency in C/C++ and/or Rust (Go acceptable) with deep understanding of memory, concurrency, async I/O, and performance profiling (perf/eBPF/flamegraphs)
  • Expertise in networking & protocols: TCP/UDP, DNS, TLS, HTTP/1.1–3, QUIC
  • load balancing algorithms
  • congestion control
  • connection pooling
  • keep‑alive
  • retry/backoff
  • Linux fundamentals (kernel & networking stack), containerization/orchestration (Kubernetes), CI/CD, safe releases, and observability (metrics/traces/logs)
Job Responsibility
Job Responsibility
  • Architect and build internet-scale, low-latency edge services (proxies, load balancers, TLS offload, routing pipelines, caching layers) across hundreds of global sites and thousands of nodes
  • Design and build services that provide L4/L7 DDoS protection, HTTP-level CDN, global load balancing, and WAF capabilities
  • Lead reliability by design: champion SLOs, error budgets, and graceful degradation patterns
  • instrument systems end-to-end (metrics/traces/logs), drive telemetry-driven engineering and automated mitigations
  • Lead identification of dependencies and development of design documents for products, applications, services, or platforms
  • Mentor engineers and lead by example to produce extensible and maintainable code used across products
  • Own live-site for AFD services: participate in DRI/on-call, guide incident response, lead post-incident reviews, and convert findings into systemic fixes and automation
  • Proactively seek new knowledge and adapt to trends, technical solutions, and patterns that improve availability, reliability, efficiency, observability, and performance at scale
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

We’re looking for a seasoned Principal Engineer to take full ownership of Airwal...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience in cloud-native edge networking (API Gateway, DNS, CDN, GA, firewalls)
  • Proficiency with SDN concepts and tools (e.g., OpenDaylight, Envoy, NGINX/OpenResty, Kong, Apisix)
  • Familiar with Cloudflare, AWS or GCP Cloud Networking, techniques
  • Knowledge of hybrid/multi-cloud patterns and traffic engineering at scale
  • Hands-on with cloud firewall systems, WAF, rate limiting, and bot detection
  • A security-aware mindset with ability to balance protection and developer experience
  • Experience defining cross-team processes and governance frameworks
  • Strong communication skills and ability to lead across engineering and security teams
Job Responsibility
Job Responsibility
  • Own the Edge Network Stack
  • Design and evolve the architecture for Airwallex's external traffic stack including: API Gateways (routing, filtering, throttling), DNS services (global resolution & routing), CDNs (caching strategies and invalidation), Global Accelerators (latency and route optimization)
  • Define and Enforce Border Security
  • Partner with InfoSec to design and operationalize: DDoS protection, bot mitigation, and anomaly detection (e.g., Cloud Armor, WAF), Rate limiting and QoS policy enforcement for prioritized customer/partner APIs, Firewall rule governance and bad actor prevention mechanisms, Intrusion Prevention and Auth mechanisms at the border
  • Policy-Driven API Route Management
  • Build end-to-end processes and tooling for how engineers expose public APIs: Define policy and controls for route registration, approval, and change management, Work with platform teams to enforce compliance across microservices and gateways, Contribute to internal tools for observability, access review, and lifecycle auditing
  • Enable Global-Scale, Secure Performance
  • Establish reliability and quality of service (QoS) goals for critical paths (e.g., payments, onboarding, auth), Design for hybrid/multi-cloud edge strategy and backbone traffic replication, Tune latency, failover, and availability posture across regions
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, AI Cloud

At Docker, we make app development easier so developers can focus on what matter...
Location
Location
United States , Seattle
Salary
Salary:
232000.00 - 319000.00 USD / Year
docker.com Logo
Docker
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience, including 3+ years in technical leadership roles (Staff or Principal level)
  • Proven experience designing and building highly scalable distributed systems in production environments
  • Deep understanding of cloud infrastructure (AWS, Azure, GCP, or OCI), including compute, networking, and storage primitives
  • Proficiency in Go, Rust, or Java
  • Expertise in Kubernetes, microservices, and service mesh architectures
  • Strong foundation in observability, CI/CD, and infrastructure-as-code (Terraform, Pulumi, or CloudFormation)
  • Experience operating high-availability (99.99%+) production systems
  • Exceptional communication skills and ability to influence across technical and business domains
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Define and drive the long-term technical strategy for Docker AI Cloud’s control and data plane services
  • Architect highly available, multi-region systems capable of operating seamlessly across multiple cloud providers
  • Design APIs and service abstractions that integrate Docker Desktop, Hub, and enterprise cloud services
  • Establish standards for reliability, scalability, and observability across the Docker AI Cloud platform
  • Lead cross-functional technical discussions and influence architectural decisions company-wide
  • Design and implement distributed systems for workload orchestration, service discovery, and lifecycle management
  • Build and operate control plane components that manage multi-tenant workloads and cloud networking
  • Develop infrastructure that delivers predictable performance, intelligent scaling, and automated failover
  • Ensure security, data integrity, and compliance across Docker’s global infrastructure footprint
  • Partner with platform and product teams to deliver developer-friendly APIs and cloud experiences
What we offer
What we offer
  • Freedom & flexibility
  • fit your work around your life
  • Designated quarterly Whaleness Days plus end of year Whaleness break
  • Home office setup
  • we want you comfortable while you work
  • 16 weeks of paid Parental leave
  • Technology stipend equivalent to $100 net/month
  • PTO plan that encourages you to take time to do the things you enjoy
  • Training stipend for conferences, courses and classes
  • Equity
  • Fulltime
Read More
Arrow Right

Principal Architects, Systems

Principal Architect, Systems located in Frisco, TX will design and implement rob...
Location
Location
United States , Frisco
Salary
Salary:
154856.00 - 175000.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master’s degree in Computer Science, Information Technology or related, and 7 years of relevant work experience
  • Bachelor’s degree in Computer Science, Information Technology or related, and 9 years of relevant work experience
  • Experience designing and developing cloud architecture and modernization frameworks across AWS and Azure
  • Experience leading and executing enterprise-scale cloud infrastructure initiatives leveraging Kubernetes platform engineering using EKS, AI, and Network Services
  • Experience implementing global IaC and automation solutions using Terraform, Ansible, and Python
  • Experience building and executing strategies for cloud governance, financial optimization, and operational excellence
  • Experience providing Cloud leadership, people development, and strategic direction
  • Experience leading initiatives in observability and site reliability engineering using Prometheus, Grafana, and CloudWatch
  • At least 18 years of age
  • Legally authorized to work in the United States
Job Responsibility
Job Responsibility
  • Ensure system coherence and integrate cutting-edge solutions to enhance operational efficiency and service delivery
  • Collaborate with various stakeholders to lead the adoption of new technologies and methodologies
  • Drive innovation across the organization through expert knowledge and leadership in system architecture
  • Maintain T-Mobile’s leadership in the telecommunications industry by leveraging advanced architectural practices
  • Evaluate current technological trends and integrate them into strategic planning for system development
  • Participate in other duties or projects as assigned by business management as needed
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Free, year-round money coaches
  • Annual bonus or periodic sales incentive or bonus
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Up to 12 paid holidays
  • Fulltime
Read More
Arrow Right

Director, Architect Enterprise Resilience & Recoverability

Location
Location
USA , Bethesda
Salary
Salary:
Not provided
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
June 19, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related discipline - or equivalent professional experience and certifications
  • 8+ years of progressive experience in systems, infrastructure, cloud, or platform engineering within a large enterprise environment, including: 5+ years specifically in resiliency engineering, disaster recovery, or reliability engineering at scale
  • Demonstrated experience as a senior technical authority - architect, principal engineer, or technical director - for enterprise resiliency and/or disaster recovery programs and for live recovery events
  • Proven experience designing and validating end-to-end DR and high-availability architectures for enterprise-scale workloads across cloud (AWS, Azure, GCP, or Alibaba), hybrid, and on-premises environments
  • Experience aligning technical recovery designs to business recovery objectives (RTO, RPO, business criticality) and translating between business impact and technical implementation
  • Deep working knowledge of cloud-native resiliency patterns: multi-AZ and multi-region designs, redundancy and fault tolerance, automated failover, dynamic traffic management, and adaptive connectivity
  • Strong recoverability foundation: backup and restore integrity, immutable and versioned backup, ransomware recovery frameworks, isolated recovery environments, and cross-region recovery patterns
  • Familiarity with infrastructure-as-code and automation tooling (e.g., Terraform, Ansible, CloudFormation) applied to DR orchestration, validation, and drift detection
  • Experience with containerized and distributed systems, including Kubernetes, service mesh, and platform-level resiliency patterns
  • Demonstrated ability to influence and drive accountability across a highly matrixed organization without direct authority - across application, infrastructure, cloud, network, SRE, security, and vendor teams
Job Responsibility
Job Responsibility
  • Accountable for the technical strategy, architecture, and engineering execution of resiliency and recoverability across Marriott’s global technology estate - spanning AWS, Azure, Alibaba, hybrid cloud, on-premises, and partner-hosted workloads supporting hundreds of properties worldwide
  • Own the architectural roadmap for engineered, continuously tested resilience across the most critical revenue-supporting platforms
  • Serve as the single technical leader unifying resiliency (preventative, design-time) and recoverability (operational, response-time) under a single coherent strategy
  • Partner with major modernization and consolidation programs to ensure new and migrating platforms are recoverable by design, with repeatable failover and verified transaction success for prioritized critical workloads
  • Establish and chair architectural standards, production readiness criteria, and resiliency review gates that govern how new and changed systems enter production
  • Breaks down complex technical problems and drives to the best technical decision based on high level of communication, debate, discussion within the team and with other subject matter experts
  • Performs research in technologies that are emerging in the industry as a competitive advantage and reports on that research in terms of business opportunities
  • Advises on viability of emerging technologies for the business
  • articulates the risks, costs, and ROI
  • Provides guidance to improve operational processes and procedures to improve service, reduce costs, and leverage technologies
What we offer
What we offer
  • 401(k) plan
  • stock purchase plan
  • discounts at Marriott properties
  • commuter benefits
  • employee assistance plan
  • childcare discounts
  • medical
  • dental
  • vision
  • health care flexible spending account
  • Fulltime
Read More
Arrow Right