CrawlJobs Logo

Senior AI Infrastructure Engineer

together.ai Logo

Together AI

Location Icon

Location:
Netherlands , Amsterdam

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform – a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world.

Job Responsibility:

  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance

Requirements:

  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale
  • Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)

Nice to have:

  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
  • Experience building IaaS or PaaS systems at scale a plus
  • Experience with DPUs/SmartNICs a plus
  • GPU programming, NCCL, CUDA knowledge a plus

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior AI Infrastructure Engineer

Senior AI Infrastructure Engineer

This role will be responsible for designing, deploying, and maintaining high-per...
Location
Location
United States , Bothell; Overland Park; Bellevue
Salary
Salary:
113600.00 - 205000.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years technical engineering experience, preferably in multiple technology focus areas
  • Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
  • Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
  • Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
  • Understanding of AI infrastructure security best practices
  • Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
  • Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts
Job Responsibility
Job Responsibility
  • Technical System Expertise: Understands system protocols, how systems operate and data flows
  • Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
  • Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
  • Technical Writing: Writes basic documentation on how technology works
  • Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
  • Technology Strategy: Contributes to new and existing technology options that support business goals
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Paid holidays
  • Paid parental and family leave
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

As a Senior AI Engineer on our AI Engineering team, you will be responsible for ...
Location
Location
Canada; United States
Salary
Salary:
160000.00 - 260000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience with a focus on production systems
  • 1.5+ years of hands-on LLM experience (2023-present) building real applications with GPT, Claude, Llama, or other modern LLMs
  • Production LLM Applications: Demonstrated experience building customer-facing, scalable LLM-powered products with real user usage (not just POCs or internal tools)
  • Agent Development: Experience building multi-step AI agents, LLM chaining, and complex workflow automation
  • Prompt Engineering Expertise: Deep understanding of prompting strategies, few-shot learning, chain-of-thought reasoning, and prompt optimization techniques
  • Python Proficiency: Expert-level Python skills for production AI systems
  • Backend Engineering: Strong experience building scalable backend systems, APIs, and distributed architectures
  • LangChain or Similar Frameworks: Experience with LangChain, LlamaIndex, or other LLM application frameworks
  • API Integration: Proven ability to integrate multiple APIs and services to create advanced AI capabilities
  • Production Deployment: Experience deploying and managing AI models in cloud environments (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Design and Deploy Production LLM Systems: Build scalable, reliable AI systems that serve millions of users with high availability and performance requirements
  • Agent Development: Create sophisticated AI agents that can chain multiple LLM calls, integrate with external APIs, and maintain state across complex workflows
  • Prompt Engineering Excellence: Develop and optimize prompting strategies, understand trade-offs between prompt engineering vs fine-tuning, and implement advanced prompting techniques
  • System Integration: Build robust APIs and integrate AI capabilities with existing Apollo infrastructure and external services
  • Evaluation & Quality Assurance: Implement comprehensive evaluation frameworks, A/B testing, and monitoring systems to ensure AI systems meet accuracy, safety, and reliability standards
  • Performance Optimization: Optimize for cost, latency, and scalability across different LLM providers and deployment scenarios
  • Cross-functional Collaboration: Work closely with product teams, backend engineers, and stakeholders to translate business requirements into technical AI solutions
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year, flex PTO, and parental leave
  • employee assistance program and wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

As a Senior AI Engineer on our AI Engineering team, you will be responsible for ...
Location
Location
India
Salary
Salary:
Not provided
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with a focus on production systems
  • 1.5+ years of hands-on LLM experience (2023-present) building real applications with GPT, Claude, Llama, or other modern LLMs
  • Demonstrated experience building customer-facing, scalable LLM-powered products with real user usage (not just POCs or internal tools)
  • Experience building multi-step AI agents, LLM chaining, and complex workflow automation
  • Deep understanding of prompting strategies, few-shot learning, chain-of-thought reasoning, and prompt optimization techniques
  • Expert-level Python skills for production AI systems
  • Strong experience building scalable backend systems, APIs, and distributed architectures
  • Experience with LangChain, LlamaIndex, or other LLM application frameworks
  • Proven ability to integrate multiple APIs and services to create advanced AI capabilities
  • Experience deploying and managing AI models in cloud environments (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Design and Deploy Production LLM Systems: Build scalable, reliable AI systems that serve millions of users with high availability and performance requirements
  • Agent Development: Create sophisticated AI agents that can chain multiple LLM calls, integrate with external APIs, and maintain state across complex workflows
  • Prompt Engineering Excellence: Develop and optimize prompting strategies, understand trade-offs between prompt engineering vs fine-tuning, and implement advanced prompting techniques
  • System Integration: Build robust APIs and integrate AI capabilities with existing Apollo infrastructure and external services
  • Evaluation & Quality Assurance: Implement comprehensive evaluation frameworks, A/B testing, and monitoring systems to ensure AI systems meet accuracy, safety, and reliability standards
  • Performance Optimization: Optimize for cost, latency, and scalability across different LLM providers and deployment scenarios
  • Cross-functional Collaboration: Work closely with product teams, backend engineers, and stakeholders to translate business requirements into technical AI solutions
What we offer
What we offer
  • Invest deeply in your growth, ensuring you have the resources, support, and autonomy to own your role and make a real impact
  • Collaboration is at our core—we’re all for one, meaning you’ll have a team across departments ready to help you succeed
  • We encourage bold ideas and courageous action, giving you the freedom to experiment, take smart risks, and drive big wins
Read More
Arrow Right

Senior Engineering Manager - AI

We are seeking a Senior Engineering Manager (Level 5) to lead a high-performing ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional experience in software engineering
  • At least 4+ years in engineering leadership roles
  • Strong technical background in AI/ML systems, large-scale data pipelines, and cloud-native platforms
  • Hands-on experience with Python (preferred), modern ML frameworks (PyTorch/TensorFlow), and cloud services (AWS)
  • Proven success in managing teams of 4–6 engineers, scaling processes, and building diverse, high-performance teams
  • Strong architectural design and system-thinking abilities
  • Excellent communication skills with ability to influence cross-functional stakeholders
  • Passion for sustainability, decarbonization, and using technology to create positive climate impact
  • Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more
Job Responsibility
Job Responsibility
  • Lead and grow a team of engineers focused on building AI-driven and data-intensive systems for the Arcadia platform
  • Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
  • Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
  • Balance management responsibilities (hiring, coaching, performance reviews, career growth) with technical leadership (architecture, system design, technical strategy)
  • Drive end-to-end delivery of complex projects in partnership with Product, Data, and Infrastructure teams
  • Guide the adoption of modern AI/ML technologies, ensuring practical, scalable use in production
  • Foster a culture of high performance, ownership, and technical excellence
  • Establish engineering best practices in testing, observability, reliability, and CI/CD
  • Partner with leadership to define roadmaps, set priorities, and align execution with Arcadia’s strategic goals
  • Represent AI across the company, articulating technical trade-offs and championing innovation
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

We are seeking an experienced Senior Python Software Engineer (Senior AI Develop...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Data Science, Artificial Intelligence, or a related field, or equivalent practical experience
  • Several years of experience in AI and Machine Learning development, ideally within Customer Care solutions
  • Strong proficiency in Python and NLP frameworks
  • Hands-on experience with Azure AI services (e.g., Azure Machine Learning, Cognitive Services, Bot Services)
  • Solid understanding of cloud architectures and microservices on Azure
  • Experience with CI/CD pipelines and MLOps
  • Analytical mindset and strong problem-solving capabilities
  • Polish & English speaker
Job Responsibility
Job Responsibility
  • Design, develop, and integrate AI/ML solutions, with a particular focus on Generative AI (GenAI), LLMs, and multi-modal (chat, voice) interfaces
  • Architect and deliver customer-facing AI agents that provide real-time, intelligent automation for support, marketing, or transactional use cases
  • Build and maintain multi-model pipelines for inference, fine-tuning, chunking, and embedding-based retrieval (RAG) systems
  • Deploy, monitor, and optimize AI models in production-grade environments using Kubernetes and Azure-native services
  • Integrate GenAI agents with cross-company APIs, backend services, and partner systems through MCP for dynamic tool use and data enrichment
  • Collaborate closely with DevOps engineers to implement scalable CI/CD pipelines, infrastructure-as-code, and secure AI workload automation
  • Evaluate and integrate open-source and proprietary LLMs, embeddings, and vector databases
  • Optimize prompt engineering strategies and implement orchestration tools (e.g., LangChain, MCP) to enable complex task execution
  • Build robust model evaluation frameworks, A/B testing environments, and experiment tracking for iterative development
  • Design privacy-first AI workflows that comply with GDPR, anonymization, and auditability (e.g., PII scrubbing, user consent)
What we offer
What we offer
  • Flexible working hours
  • Hybrid work model, allowing employees to divide their time between home and modern offices in key Polish cities
  • A cafeteria system that allows employees to personalize benefits by choosing from a variety of options
  • Generous referral bonuses, offering up to PLN6,000 for referring specialists
  • Additional revenue sharing opportunities for initiating partnerships with new clients
  • Ongoing guidance from a dedicated Team Manager for each employee
  • Tailored technical mentoring from an assigned technical leader, depending on individual expertise and project needs
  • Dedicated team-building budget for online and on-site team events
  • Opportunities to participate in charitable initiatives and local sports programs
  • A supportive and inclusive work culture with an emphasis on diversity and mutual respect
  • Fulltime
Read More
Arrow Right