CrawlJobs Logo

Principal Systems Reliability Engineer

https://www.t-mobile.com Logo

T-Mobile

Location Icon

Location:
United States , Herndon

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

114800.00 - 207200.00 USD / Year

Job Description:

This role is responsible for designing and implementing secure, scalable, and highly reliable technology solutions across cloud platforms, networking, and cybersecurity domains. This role combines advanced expertise in system architecture, cloud engineering (Azure, AWS), and DevSecOps practices to ensure optimal performance, availability, and security of enterprise systems. Key responsibilities include managing identity and access controls, certificate lifecycle, patch management using SCCM, and automation through scripting. Additionally, the role oversees Microsoft 365 (M365) services to ensure seamless collaboration, compliance, and security across productivity platforms. Success is measured by improved security posture, operational efficiency, and enhanced service reliability, directly impacting organizational performance and customer experience.

Job Responsibility:

  • Develop and implement system designs to improve software delivery speed and operational efficiency
  • Lead architecture for cross-domain programs ensuring alignment with enterprise standards
  • Deliver solutions that enhance service availability, scalability, latency, and efficiency
  • Design and deploy solutions on Azure and AWS
  • Build and operate cloud-native platforms (Kubernetes, service mesh, ingress, policy engines)
  • Implement Infrastructure as Code (IaC) for automated deployments
  • Administer Active Directory and integrate with cloud identity solutions
  • Configure 802.1X authentication for secure network access
  • Manage digital certificates lifecycle (issuance, renewal, revocation)
  • Manage DNS, TCP/IP networks, and network segmentation
  • Implement firewalls, VPNs, and Zero Trust principles
  • Apply cybersecurity frameworks and monitor vulnerabilities
  • Maintain and secure Windows and Linux servers
  • Utilize SCCM for patch management and compliance reporting
  • Perform OS hardening and lifecycle management
  • Manage M365 services including Exchange Online, SharePoint, Teams, and OneDrive
  • Ensure compliance, security, and availability of collaboration platforms
  • Implement policies for data governance and identity integration with M365
  • Develop scripts in PowerShell, Python, or Bash to automate tasks
  • Support CI/CD pipelines and cloud enablement
  • Use APM tools (AppDynamics) and observability platforms (Splunk) for troubleshooting
  • Participate in incident/problem management and disaster recovery planning

Requirements:

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field OR equivalent experience
  • Advanced degree with 5+ years of related experience preferred
  • 7+ years of progressive experience in systems architecture, platform engineering, or site reliability engineering
  • Hands-on experience with Azure and AWS cloud platforms
  • Expertise in Active Directory, DNS, 802.1X, and certificate lifecycle management
  • Strong background in Windows and Linux operating systems
  • Proficiency in TCP/IP networking and network security principles
  • Administration of Microsoft 365 (M365) services (Exchange Online, SharePoint, Teams)
  • Automation and scripting using PowerShell, Python, or Bash preferred
  • Experience working in a cloud environment (public/private)
  • Knowledge of containerization (Docker, Kubernetes) preferred
  • Experience in incident and problem management, root cause analysis, and disaster recovery planning preferred
  • US citizenship (without dual citizenship)
  • At least 18 years of age and legally authorized to work in the United States
  • Active security clearance or ability to obtain one
What we offer:
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Employee stock grants
  • Employee stock purchase plan
  • Paid time off
  • Up to 12 paid holidays
  • Paid parental and family leave
  • Family building benefits
  • Back-up care
  • Enhanced family support
  • Childcare subsidy
  • Tuition assistance
  • College coaching
  • Short- and long-term disability
  • Voluntary AD&D coverage
  • Voluntary accident coverage
  • Voluntary life insurance
  • Voluntary disability insurance
  • Voluntary long-term care insurance
  • Mobile service & home internet discounts
  • Pet insurance
  • Access to commuter and transit programs

Additional Information:

Job Posted:
February 16, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Systems Reliability Engineer

Principal Network Operations Site Reliability Systems Engineer

This role entails incorporating Site Reliability Engineering (SRE) concepts into...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or master’s degree in computer science, Computer Engineering, Information Systems, or equivalent
  • Typically, 10+ years’ experience
  • Experience with cloud platforms
  • Experience with software development languages for console and web-based applications
  • Experience in User Interface (UI/UX) design
  • Understanding of and experience with common network infrastructure devices such as switches, routers, access points, authentication, authorization, and accounting systems and protocols, and network management utilities
  • Experience with network monitoring protocols
  • Ability to design and implement relational database solutions, time-series databases, and NoSQL database solutions
  • Excellent analytical and problem-solving skills
  • Experience in the overall architecture of software systems for products and solutions
Job Responsibility
Job Responsibility
  • Develop strategies and implement plans to incorporate SRE concepts into network, tool, and process designs and leads execution of those strategies and plans
  • Evaluates LAN, WLAN, SD-WAN, AAA, Private 5G, and other network designs for fit-for-use criteria, and designs prototype analysis tools to facilitate rapid iteration of network delivery service enhancements
  • Identifies and engineers new ways to leverage data from multiple platforms to identify network performance trends and detect anomalies
  • Prototypes machine learning anomaly detection, event signature identification, and trend identification
  • Automates common incident management and problem management procedures
  • Develops organization-wide architectures, methodologies, and prototypes for software systems design and development across multiple platforms and organizations within the Global Business Unit
  • Identifies and evaluates new technologies and innovations for alignment with technology roadmap and business value
  • creates plans for prototyping and prototype iteration
  • Reviews and evaluates designs and project activities for compliance with development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk
What we offer
What we offer
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Inclusive environment celebrating individual uniqueness
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Location
Location
United States , Ft. Meade
Salary
Salary:
Not provided
cipherlogix.com Logo
CipherLogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
  • Ten (10) years experience in system engineering/architecture
  • Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
  • At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
  • At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
  • Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
  • Ten (10) years experience in the cleared environment
  • Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
  • Knowledge and experience with developing distributed storage routing and querying algorithms
  • Experience in developing documentation required to support a program’s technical issues and training situations
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

We are looking for a Principal Site Reliability Engineer to join the CVML Platfo...
Location
Location
United States
Salary
Salary:
166000.00 - 293000.00 USD / Year
bluerivertechnology.com Logo
Blue River Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building infrastructure with K8S, AWS, and bare metal
  • 8+ years of experience working with Python and Go (with production experience)
  • 8+ years of experience working with infra automation tools: Terraform / Terragrunt (or Pulumi / CDK)
  • 8+ experience with Linux-based systems and networks, and a deep understanding of internal components, networking, and security aspects
  • Has a track record of building and maintaining scalable systems in production environments
  • Experience in building CI/CD pipelines using GitHub Actions (or GitLab / Jenkins) for application release and deployment
  • Experience in using AWS ECS, EKS, IAM, EC2, and RDS at production scale
  • Deep understanding of Kubernetes and its internals (kubelet, CRDs, etc) and experience with building and extending clusters from scratch
  • Strong problem-solving skills and ability to troubleshoot complex infrastructure and networking issues
  • Excellent communication skills to collaborate effectively with technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • System Design: Architect and implement various cloud and on-premise applications, systems, and infrastructure
  • Hybrid system integration: Integrate extremely diverse systems, configure stable integration, uptime, and monitoring
  • Edge device integration: work with edge devices of various formats and integrate them with on-prem and cloud workflows, including networking, low-level OS, and electrical/control integration
  • Low-level performance optimization: optimize the performance and throughput of the system at the filesystem, networking, and software levels
  • High-level optimisation of cost and stability: optimize cost, operational stability, and supportability of highly diverse platforms and tech stack
  • Product Mindset: Collaborate with cross-functional teams to design, develop, and maintain robust, scalable, and user-friendly web and mobile data-intensive applications
  • System Integration: Build tools that enable users to easily move between different applications and platforms to utilize the strengths of each in a coherent ecosystem
  • Collaboration: Work closely with cross-functional teams, including data scientists, analysts, software engineers, and product managers, to understand data requirements and deliver data solutions that align with business goals
  • Documentation: Create and maintain technical documentation, including data flow diagrams, architecture designs, and standard operating procedures
  • Technology Evaluation: Stay up-to-date with industry trends and emerging technologies related to data engineering, recommending and implementing new tools and frameworks as appropriate
What we offer
What we offer
  • eligibility for Blue River’s bonus and benefit programs
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Machine Learning System Engineer

As a Principal Machine Learning Systems Engineer, you will lead the design, deve...
Location
Location
United States , Seattle; San Francisco
Salary
Salary:
190300.00 - 305600.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Lead the design, development, and deployment of scalable machine learning (ML) systems and infrastructure
  • Collaborate closely with data scientists, software engineers, and product teams
  • Optimize model performance
  • Ensure system reliability
  • Implement efficient data pipelines
  • Drive architectural decisions for high-performance computing and cloud-based ML platforms
  • Mentor junior engineers
  • Promote best practices in ML operations (MLOps)
  • Stay updated on emerging technologies
Job Responsibility
Job Responsibility
  • Translate complex ML models into production-ready solutions
  • Ensure scalability and security
  • Deliver robust, scalable, and efficient machine learning solutions that support business growth and innovation
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Principal Frontend Software Engineer - Design Systems & AI

We’re looking for a passionate Principal Engineer (P60) to join the Design Syste...
Location
Location
Australia
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A strong interest in AI, especially in generative approaches for frontend code that adheres to design systems and frontend standards
  • Systems thinking and experience architecting and maintaining large-scale systems (100+ packages, content, standards, etc.)
  • Proven Tech Lead experience: You’ve led complex technical initiatives and mentored other engineers
  • Experience with Javascript (ES6), HTML5, CSS and experience with modern Javascript frameworks (e.g., React, AngularJS, Vue)
  • Bachelor's or Master's degree (preferably a Computer Science degree or equivalent experience)
  • Extensive experience with modern testing frameworks (e.g., Jest, Cypress, Mocha, Chai)
  • Strong comfortability with the JavaScript language and ecosystem
  • Experience in design system best practices
Job Responsibility
Job Responsibility
  • Lead the technical vision and architecture for AI-driven design system solutions, ensuring scalability, reliability, and compliance with Atlassian’s frontend standards
  • Drive the development of generative AI tools that produce frontend code aligned with our design system and accessibility requirements
  • Tackle the challenges of maintaining and evolving a system of 100+ packages, including content, standards, and tooling
  • Mentor and guide engineers across the team, fostering a culture of technical excellence and innovation
  • Collaborate with cross-functional partners to deliver impactful solutions that elevate the user experience for millions of Atlassian customers
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
Read More
Arrow Right