CrawlJobs Logo

Senior AI Infrastructure Engineer

https://www.t-mobile.com Logo

T-Mobile

Location Icon

Location:
United States , Bothell

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

113600.00 - 205000.00 USD / Year

Job Description:

This role will be responsible for designing, deploying, and maintaining high-performance computing environments optimized for AI and machine learning workloads. The role involves building scalable infrastructure, ensuring efficient workload management, providing self-service and on-demand tooling, and collaborating with teams to support AI-driven applications. This role will drive operational excellence, and work with diverse hardware and software solutions to enhance performance and reliability of our on-premises AI/ML infrastructure.

Job Responsibility:

  • Technical System Expertise: Understands system protocols, how systems operate and data flows
  • Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
  • Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
  • Technical Writing: Writes basic documentation on how technology works
  • Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
  • Technology Strategy: Contributes to new and existing technology options that support business goals

Requirements:

  • 5+ years technical engineering experience, preferably in multiple technology focus areas
  • Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
  • Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
  • Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
  • Understanding of AI infrastructure security best practices
  • Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
  • Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts

Nice to have:

  • Understanding of CI/CD pipelines using tools such as Git, Artifactory, Jenkins, etc.
  • Experience with AI/ML pipelines (PyTorch, TensorFlow, RAPIDS AI, or other deep learning frameworks)
  • Experience with configuring and using monitoring tools (e.g., Prometheus, Grafana, NVIDIA DGCM)
What we offer:
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Paid holidays
  • Paid parental and family leave
  • Family building benefits
  • Back-up care
  • Enhanced family support
  • Childcare subsidy
  • Tuition assistance
  • College coaching
  • Short- and long-term disability
  • Voluntary AD&D coverage
  • Voluntary accident coverage
  • Voluntary life insurance
  • Voluntary disability insurance
  • Voluntary long-term care insurance
  • Mobile service & home internet discounts
  • Pet insurance
  • Access to commuter and transit programs

Additional Information:

Job Posted:
April 05, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior AI Infrastructure Engineer

Senior AI Engineer

As a Senior AI Engineer on our AI Engineering team, you will be responsible for ...
Location
Location
Canada; United States
Salary
Salary:
160000.00 - 260000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience with a focus on production systems
  • 1.5+ years of hands-on LLM experience (2023-present) building real applications with GPT, Claude, Llama, or other modern LLMs
  • Production LLM Applications: Demonstrated experience building customer-facing, scalable LLM-powered products with real user usage (not just POCs or internal tools)
  • Agent Development: Experience building multi-step AI agents, LLM chaining, and complex workflow automation
  • Prompt Engineering Expertise: Deep understanding of prompting strategies, few-shot learning, chain-of-thought reasoning, and prompt optimization techniques
  • Python Proficiency: Expert-level Python skills for production AI systems
  • Backend Engineering: Strong experience building scalable backend systems, APIs, and distributed architectures
  • LangChain or Similar Frameworks: Experience with LangChain, LlamaIndex, or other LLM application frameworks
  • API Integration: Proven ability to integrate multiple APIs and services to create advanced AI capabilities
  • Production Deployment: Experience deploying and managing AI models in cloud environments (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Design and Deploy Production LLM Systems: Build scalable, reliable AI systems that serve millions of users with high availability and performance requirements
  • Agent Development: Create sophisticated AI agents that can chain multiple LLM calls, integrate with external APIs, and maintain state across complex workflows
  • Prompt Engineering Excellence: Develop and optimize prompting strategies, understand trade-offs between prompt engineering vs fine-tuning, and implement advanced prompting techniques
  • System Integration: Build robust APIs and integrate AI capabilities with existing Apollo infrastructure and external services
  • Evaluation & Quality Assurance: Implement comprehensive evaluation frameworks, A/B testing, and monitoring systems to ensure AI systems meet accuracy, safety, and reliability standards
  • Performance Optimization: Optimize for cost, latency, and scalability across different LLM providers and deployment scenarios
  • Cross-functional Collaboration: Work closely with product teams, backend engineers, and stakeholders to translate business requirements into technical AI solutions
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year, flex PTO, and parental leave
  • employee assistance program and wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

As a Senior AI Engineer on our AI Engineering team, you will be responsible for ...
Location
Location
India
Salary
Salary:
Not provided
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with a focus on production systems
  • 1.5+ years of hands-on LLM experience (2023-present) building real applications with GPT, Claude, Llama, or other modern LLMs
  • Demonstrated experience building customer-facing, scalable LLM-powered products with real user usage (not just POCs or internal tools)
  • Experience building multi-step AI agents, LLM chaining, and complex workflow automation
  • Deep understanding of prompting strategies, few-shot learning, chain-of-thought reasoning, and prompt optimization techniques
  • Expert-level Python skills for production AI systems
  • Strong experience building scalable backend systems, APIs, and distributed architectures
  • Experience with LangChain, LlamaIndex, or other LLM application frameworks
  • Proven ability to integrate multiple APIs and services to create advanced AI capabilities
  • Experience deploying and managing AI models in cloud environments (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Design and Deploy Production LLM Systems: Build scalable, reliable AI systems that serve millions of users with high availability and performance requirements
  • Agent Development: Create sophisticated AI agents that can chain multiple LLM calls, integrate with external APIs, and maintain state across complex workflows
  • Prompt Engineering Excellence: Develop and optimize prompting strategies, understand trade-offs between prompt engineering vs fine-tuning, and implement advanced prompting techniques
  • System Integration: Build robust APIs and integrate AI capabilities with existing Apollo infrastructure and external services
  • Evaluation & Quality Assurance: Implement comprehensive evaluation frameworks, A/B testing, and monitoring systems to ensure AI systems meet accuracy, safety, and reliability standards
  • Performance Optimization: Optimize for cost, latency, and scalability across different LLM providers and deployment scenarios
  • Cross-functional Collaboration: Work closely with product teams, backend engineers, and stakeholders to translate business requirements into technical AI solutions
What we offer
What we offer
  • Invest deeply in your growth, ensuring you have the resources, support, and autonomy to own your role and make a real impact
  • Collaboration is at our core—we’re all for one, meaning you’ll have a team across departments ready to help you succeed
  • We encourage bold ideas and courageous action, giving you the freedom to experiment, take smart risks, and drive big wins
Read More
Arrow Right

Senior Engineering Manager - AI

We are seeking a Senior Engineering Manager (Level 5) to lead a high-performing ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional experience in software engineering
  • At least 4+ years in engineering leadership roles
  • Strong technical background in AI/ML systems, large-scale data pipelines, and cloud-native platforms
  • Hands-on experience with Python (preferred), modern ML frameworks (PyTorch/TensorFlow), and cloud services (AWS)
  • Proven success in managing teams of 4–6 engineers, scaling processes, and building diverse, high-performance teams
  • Strong architectural design and system-thinking abilities
  • Excellent communication skills with ability to influence cross-functional stakeholders
  • Passion for sustainability, decarbonization, and using technology to create positive climate impact
  • Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more
Job Responsibility
Job Responsibility
  • Lead and grow a team of engineers focused on building AI-driven and data-intensive systems for the Arcadia platform
  • Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
  • Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
  • Balance management responsibilities (hiring, coaching, performance reviews, career growth) with technical leadership (architecture, system design, technical strategy)
  • Drive end-to-end delivery of complex projects in partnership with Product, Data, and Infrastructure teams
  • Guide the adoption of modern AI/ML technologies, ensuring practical, scalable use in production
  • Foster a culture of high performance, ownership, and technical excellence
  • Establish engineering best practices in testing, observability, reliability, and CI/CD
  • Partner with leadership to define roadmaps, set priorities, and align execution with Arcadia’s strategic goals
  • Represent AI across the company, articulating technical trade-offs and championing innovation
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

We are seeking an experienced Senior Python Software Engineer (Senior AI Develop...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Data Science, Artificial Intelligence, or a related field, or equivalent practical experience
  • Several years of experience in AI and Machine Learning development, ideally within Customer Care solutions
  • Strong proficiency in Python and NLP frameworks
  • Hands-on experience with Azure AI services (e.g., Azure Machine Learning, Cognitive Services, Bot Services)
  • Solid understanding of cloud architectures and microservices on Azure
  • Experience with CI/CD pipelines and MLOps
  • Analytical mindset and strong problem-solving capabilities
  • Polish & English speaker
Job Responsibility
Job Responsibility
  • Design, develop, and integrate AI/ML solutions, with a particular focus on Generative AI (GenAI), LLMs, and multi-modal (chat, voice) interfaces
  • Architect and deliver customer-facing AI agents that provide real-time, intelligent automation for support, marketing, or transactional use cases
  • Build and maintain multi-model pipelines for inference, fine-tuning, chunking, and embedding-based retrieval (RAG) systems
  • Deploy, monitor, and optimize AI models in production-grade environments using Kubernetes and Azure-native services
  • Integrate GenAI agents with cross-company APIs, backend services, and partner systems through MCP for dynamic tool use and data enrichment
  • Collaborate closely with DevOps engineers to implement scalable CI/CD pipelines, infrastructure-as-code, and secure AI workload automation
  • Evaluate and integrate open-source and proprietary LLMs, embeddings, and vector databases
  • Optimize prompt engineering strategies and implement orchestration tools (e.g., LangChain, MCP) to enable complex task execution
  • Build robust model evaluation frameworks, A/B testing environments, and experiment tracking for iterative development
  • Design privacy-first AI workflows that comply with GDPR, anonymization, and auditability (e.g., PII scrubbing, user consent)
What we offer
What we offer
  • Flexible working hours
  • Hybrid work model, allowing employees to divide their time between home and modern offices in key Polish cities
  • A cafeteria system that allows employees to personalize benefits by choosing from a variety of options
  • Generous referral bonuses, offering up to PLN6,000 for referring specialists
  • Additional revenue sharing opportunities for initiating partnerships with new clients
  • Ongoing guidance from a dedicated Team Manager for each employee
  • Tailored technical mentoring from an assigned technical leader, depending on individual expertise and project needs
  • Dedicated team-building budget for online and on-site team events
  • Opportunities to participate in charitable initiatives and local sports programs
  • A supportive and inclusive work culture with an emphasis on diversity and mutual respect
  • Fulltime
Read More
Arrow Right

Senior Cloud Infrastructure Engineer

We are looking for a Senior Cloud Infrastructure Engineer to join our Infrastruc...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
babbel.com Logo
Babbel
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in infrastructure engineering or a related role
  • Extensive knowledge of AWS services
  • Proficiency in Terraform for infrastructure as code
  • Experience with Okta for identity and access management
  • Expertise with AI tooling and infrastructure, including LLM-powered development tools (e.g., Cursor) and agentic AI systems, with the ability to apply them effectively to accelerate engineering workflows and improve productivity
  • Comprehensive understanding of networking protocols and technologies
  • Proficiency in programming languages (Ruby, Node.js, TypeScript, Go, Bash)
  • Security-first mindset: IAM least-privilege, secrets management, network hardening
  • Excellent troubleshooting and problem-solving abilities
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, implement, and manage cloud-based infrastructure on AWS, GCP and other cloud providers
  • Develop and maintain infrastructure as code using Terraform
  • Integrate and manage identity and access management solutions with Okta
  • Frequent operational work including Terraform pull request reviews and support for other teams
  • Ensure high availability, reliability, and security of systems
  • Troubleshoot and resolve complex infrastructure issues
  • Automate infrastructure deployment and management tasks
  • Develop and maintain documentation for infrastructure processes and procedures
  • Collaborate with product-engineering teams to support application deployments
  • Participate in on-call rotation for after-hours support
What we offer
What we offer
  • 30 vacation days
  • 3-month Sabbatical
  • family and life situation counseling
  • flexible working hours and remote-friendly options
  • Jobbatical (up to 3 months inside the EU and UK)
  • fully equipped office with nap, faith and family rooms
  • internal learning opportunities
  • yearly learning & development budget for external training
  • full access to Babbel & Babbel Live classes
  • mobility benefits options
  • Fulltime
Read More
Arrow Right