CrawlJobs Logo

Manager, Software Development (Hands-On Technical), ML Network Stack

United States, Seattle Employment contract · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

We are hiring a hands-on Software Development Manager for the team that owns the network stack for EC2 distributed AI/ML systems. The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, Perplexity kernels and others. We are seeking an experienced engineering manager for a mid-sized team, with multiple years of hands-on experience in systems programming, HW/SW co-design, and familiarity with networking (HPC networking preferred). Experience with the NVIDIA stack, ML applications, and frameworks will be highly regarded. You'll be leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads.

Requirements

  • 3+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency

Nice to have

  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
  • Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers

What we offer

  • Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance)
  • Option for Supplemental life plans
  • EAP
  • Mental Health Support
  • Medical Advice Line
  • Flexible Spending Accounts
  • Adoption and Surrogacy Reimbursement coverage
  • 401(k) matching
  • Paid time off
  • Parental leave
  • Sign-on payments
  • Restricted stock units (RSUs)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Manager, Software Development (Hands-On Technical), ML Network Stack

8 matching positions

Manager, Software Development (Hands-On Technical), ML Network Stack

We are hiring a hands-on Software Development Manager for the team that owns the...
Location
Location
United States , Cupertino; Seattle
Salary
Salary:
184900.00 - 287700.00 USD / Year
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency
Job Responsibility
Job Responsibility
  • Leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads
What we offer
What we offer
  • Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)
  • Fulltime
Read More
Arrow Right

Manager, Software Development (Hands-On Technical), ML Network Stack - Annapurna Labs

We are hiring a hands-on Software Development Manager for the team that owns the...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency
Job Responsibility
Job Responsibility
  • We are hiring a hands-on Software Development Manager for the team that owns the network stack for EC2 distributed AI/ML systems
  • The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, Perplexity kernels and others
  • You'll be leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads
What we offer
What we offer
  • Work/Life Balance
  • Mentorship & Career Growth
  • Fulltime
Read More
Arrow Right

Software Engineering IC5

The CoreAI Infrastructure team builds the foundational accelerated compute platf...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience
  • Proven ability to design and operate largescale, production infrastructure with high reliability and performance requirements using Azure Kubernetes Service (AKS)
  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues
  • Demonstrated technical leadership, including mentoring engineers and driving cross team architectural alignment
  • Hands-on experience with virtualization and/or container platforms (e.g., VMs, Kubernetes, container runtimes)
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries
  • Expertise with distributed observability technologies (e.g., Prometheus, OpenTelemetry, Grafana) and experience designing or scaling telemetry pipelines for high-throughput production systems
  • Advanced, hands-on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools
Job Responsibility
Job Responsibility
  • Design and build GPU and CPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments with focus on observability key metrics at scale
  • Develop End to End Observability operational excellence systems for GPU/CPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multitenant usage)
  • Build and operate advanced orchestration and resource governance and management scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources
  • Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios
  • Optimize performance, reliability, and utilization across large GPU/CPU fleets, including scaleup and scale out configurations
  • Partner with networking and storage teams to enable high performance interconnects (e.g., RDMA/InfiniBand class networking) for distributed workloads
  • Drive end-to-end platform features from design through production, including observability, diagnostics, and operational excellence
  • Influence platform architecture and technical direction across teams through design reviews and technical leadership
  • Fulltime
Read More
Arrow Right

Lead Information Security Engineer - Python Full Stack Developer

Wells Fargo is seeking a Lead Information Security Engineer.
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
June 29, 2026
Flip Icon
Requirements
Requirements
  • 5+ years of Information Security Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+years of experience in Software Engineering, Data Engineering, or a backend development python development and backend architecture
  • Expert level knowledge of Python internals, concurrency (Asyncic/Multiprocessing) and building high performance, memory efficient applications
  • Proven expertise in designing and governing enterprise grade CI/CD pipelines and must manage complex code promotions across multi-region environments using GIT hub actions, Git Lab, or Azure DevOps
  • Extensive hands-on experience with Apache Kafka (or Confluent), including cluster tuning, schema registry management and designing event driven architectures
  • Deep experience with Grafana and Prometheus for full stack observability – defining SLIs/SLOs, custom exporters and complex alerting logic
  • Strong understating of the end-to-end ML life cycle, specifically in the deployment and scaling of models using frameworks like BentoML, Ray, or KServe
  • Experience in SQL, data modelling, ETL/ELT pipelines, and large-scale data processing
  • Good to have knowledge in Terraform, Palumi and container orchestration – Kubernetes, EKS
Job Responsibility
Job Responsibility
  • Lead computer security incident response activities for highly complex events
  • Conduct technical investigation of security related incidents and post incident digital forensics to identify causes and recommend future mitigation strategies
  • Provide security consulting on large projects for internal clients to ensure conformity with corporate information, security policy, and standards
  • Design, document, test, maintain, and provide issue resolution recommendations for highly complex security solutions related to networking, cryptography, cloud, authentication and directory services, email, internet, applications, and endpoint security
  • Review and correlate security logs
  • Utilize subject matter knowledge in industry leading security solutions and best practices to implement one or more components of information security such as availability, integrity, confidentiality, risk management, threat identification, modeling, monitoring, incident response, access management, and business continuity
  • Identify security vulnerabilities and issues, perform risk assessments, and evaluate remediation alternatives
  • Collaborate and influence all levels of professionals including managers
  • Lead a team to achieve objectives
  • Lead the development of mission critical python services, ensuring high availability and low latency performance
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Head of Technology, Intelligence Ventures

The Head of Technology serves as the chief technology officer of a new behaviora...
Location
Location
United States , New York
Salary
Salary:
263200.00 - 393800.00 USD / Year
corporate.charter.com Logo
Spectrum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise across the full intelligence platform stack — including distributed data pipelines, ML platform architecture, embedding systems, feature stores, agent-to-agent API design, and LLM-powered application layers — with demonstrated ability to architect and ship all layers as a coherent, production-grade product
  • Demonstrated experience architecting and operating consumer data intelligence or data product platforms underpinned by complex machine learning systems and built on modern, cloud-native data infrastructure
  • Hands-on proficiency with Snowflake (including Cortex, Native Apps, and data sharing frameworks), cloud data platforms (AWS, Azure, or GCP), and production ML/AI systems at scale
  • Experience building agentic AI systems and LLM-powered product interfaces — including agent-to-agent APIs, retrieval-augmented generation architectures, and natural language UIs grounded in proprietary data — with strong product instincts around accuracy, trust, and user experience for non-technical enterprise audiences
  • Proven ability to translate complex technical architecture into clear executive and partner-facing communications
  • comfortable engaging at the C-suite level and in strategic partner negotiations with hyperscalers and technology platforms
  • Strong understanding of privacy-preserving data architecture, including differential privacy, de-identification techniques, zero-copy and clean room frameworks, and the regulatory landscape governing consumer behavioral data
  • Track record of recruiting and developing exceptional engineering talent in competitive markets
  • experience building high-performance teams from early-stage through scaled operations
  • Experience managing external development partners and outsourced engineering resources alongside an internal team in a fast-moving, build-from-scratch environment
Job Responsibility
Job Responsibility
  • Own the end-to-end technical architecture of the platform — from large-scale network signal ingestion and processing through behavioral embedding generation, feature store construction, and zero-copy intelligence delivery to enterprise partners — ensuring the platform is production-grade, built for household-scale throughput, and designed for long-term extensibility across new signal sources and use cases
  • Lead the build of the platform’s cloud-native data and application infrastructure — including ingestion and transformation pipelines, ML/AI compute environments, zero-copy partner access frameworks, and a real-time agent-to-agent API layer that enables external AI systems (marketing agents, commerce agents, customer service agents) to query household-level intelligence and receive contextually grounded responses
  • Serve as the technical lead in strategic partner integrations with cloud, data, and AI platform providers, ensuring each integration is architecturally differentiated and aligned with the platform’s cloud-agnostic, API-first design principles
  • Architect and deliver a business-facing agentic intelligence interface — a natural language UI that allows non-technical marketers, planners, and business users to query household behavioral intelligence, surface demand signals, and take action without requiring data or engineering support
  • Build and manage a world-class engineering organization capable of competing for talent with the leading technology and data infrastructure companies
  • Fulltime
Read More
Arrow Right