This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The System Reliability Engineer (SRE) guides and mentors other SREs and improves and protects the software and systems behind T-Mobile's IT services, including scalability, availability, latency, performance, security, and capacity, while enabling faster, higher-quality software delivery. In this role, you will support T-Mobile’s Legal & Emergency Response platforms—mission-critical systems that enable response to urgent law enforcement and emergency situations. These systems operate in high-stakes environments where reliability, speed, and accuracy truly matter. When these systems are up and running, they can help enable real-world outcomes that impact people in critical moments. This is not a maintenance role. You’ll step into a modern, cloud-native platform that is still evolving—giving you the opportunity to shape how it scales, improves, and becomes truly resilient. You’ll work on meaningful problems, own production systems end-to-end, and directly influence how reliability is built into the platform. You’ll also work in an environment that embraces modern engineering practices, including strong adoption of AI-assisted tools, and a culture that values ownership, collaboration, and continuous improvement.
Job Responsibility:
Apply DevOps automation for CI/CD, configuration management, and environment management (non-prod and prod)
Provision and manage environments
configure pipelines and infrastructure (VMs/containers)
Improve availability, scalability, latency, and efficiency of services, with emphasis on Legal Technology platforms
Own reliability and performance of critical applications (LRS, E-Core, LEEP)
Participate in on-call rotation (~1 week every 2 months)
respond to alerts/incidents
Lead incident response, root cause analysis, and post-incident improvements
Build and enhance observability (dashboards, alerts), runbooks, and automation
Partner with engineering to design for reliability and eliminate recurring issues in distributed systems
Drive improvements in delivery and operations (cloud enablement, microservices, containerization, zero-downtime deployments)
Mentor SREs and guide reliability practices across the team
Requirements:
Bachelor’s Degree plus 5 years of related work experience OR Advanced degree with 3 years of related experience
4–7+ years relevant experience (Required)
Experience in Agile/DevOps environments (Required)
Proficiency in one or more: Java, Python, Go, C/C#, or scripting (Shell/Perl) (Required)
Experience with DBMS (Postgres or Oracle) (Required)
Experience with CI/CD tools (e.g., Jenkins) and DevOps tools (GitHub/GitLab, Chef/Puppet) (Required)
Experience with Docker, Kubernetes (Required)
Experience with APM/observability tools (e.g., Splunk, Grafana, AppDynamics) (Required)
Experience troubleshooting distributed systems using logs/metrics/traces (Required)
DevOps (Required)
Integration (Required)
Strong troubleshooting in distributed systems
Ability to operate in production environments and respond to incidents
Ownership mindset with focus on reliability and continuous improvement
At least 18 years of age
Legally authorized to work in the United States
U.S. citizenship (Required)
Nice to have:
Experience in cloud environments (Preferred)
Experience with high-availability or regulated environments (Preferred)