CrawlJobs Logo

Manager, Software Development (Hands-On Technical), ML Network Stack

amazon.de Logo

Amazon Pforzheim GmbH

Location Icon

Location:
United States , Cupertino

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

184900.00 - 287700.00 USD / Year

Job Description:

We are hiring a hands-on Software Development Manager for the team that owns the network stack for EC2 distributed AI/ML systems. The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, Perplexity kernels and others. We are seeking an experienced engineering manager for a mid-sized team, with multiple years of hands-on experience in systems programming, HW/SW co-design, and familiarity with networking (HPC networking preferred). Experience with the NVIDIA stack, ML applications, and frameworks will be highly regarded. You'll be leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads.

Job Responsibility:

Leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads

Requirements:

  • 3+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency

Nice to have:

  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
  • Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers
What we offer:
  • Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)

Additional Information:

Job Posted:
May 03, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Manager, Software Development (Hands-On Technical), ML Network Stack

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...
Location
Location
United States , Boston
Salary
Salary:
152800.00 - 224100.00 USD / Year
simplisafe.com Logo
SimpliSafe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
  • Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
  • Hands-on experience with: CI/CD pipelines, Docker
  • Kubernetes
  • Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
  • Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
  • Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
  • LLM application development, including prompt engineering and evaluation
  • Strong communication skills for partnering with cross-functional technical and non-technical teams
Job Responsibility
Job Responsibility
  • Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
  • Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
  • Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
  • Design, implement, and own comprehensive production monitoring for ML models/systems
  • Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
  • Drive best practices in model versioning, observability, reproducibility, and deployment reliability
  • Serve in an on-call rotation as a first responder for software owned by your team
What we offer
What we offer
  • A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
  • A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
  • Free SimpliSafe system and professional monitoring for your home
  • Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
  • Participation in our annual bonus program, equity, and other forms of compensation
  • A full range of medical, retirement, and lifestyle benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Principal Data Infrastructure Engineer

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering
  • OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling, or data engineering
  • OR equivalent experience
  • 4+ years in Big Data Infrastructure, DevOps, SRE, or Platform Engineering
  • 3+ years of hands-on experience managing and scaling distributed systems—from bare-metal to cloud-native environments
  • 2+ years deploying containerized applications using Kubernetes and Helm/Kustomize
  • Solid scripting and automation skills using Python, Bash, or PowerShell
  • Proven success in CI/CD pipeline management, release automation, and production troubleshooting
  • Experience working with Databricks for scalable data processing and analytics
  • Familiarity with security practices in infrastructure environments, including IAM, OAuth, and Kerberos administration
Job Responsibility
Job Responsibility
  • Architect and maintain scalable, reliable, and observable Big Data Infrastructure for mission-critical AI applications
  • Champion DevOps and SRE best practices—automated deployments, service monitoring, and incident response
  • Build a self-service big data platform that empowers data and platform engineers and researchers
  • Develop robust CI/CD pipelines and automate infrastructure provisioning using Infrastructure as Code tools (Bicep, Terraform, ARM)
  • Collaborate with Data Engineers, Data Scientists, AI Researchers, and Developers to deliver secure, seamless big data workflows
  • Lead technical design reviews and uphold a clean, secure, and well-documented codebase
  • Proactively identify and resolve bottlenecks in data pipelines and infrastructure
  • Optimize system performance across storage, compute, and analytics layers
  • Partner with Security teams to enhance system security (IAM, OAuth, Kerberos)
  • Embody and promote Microsoft’s values: Respect, Integrity, Accountability, and Inclusion
  • Fulltime
Read More
Arrow Right

Engineering Manager, Inference Platform

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...
Location
Location
United States; Canada , Sunnyvale; Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years in high-scale software engineering
  • 3+ years leading distributed systems or ML infra teams
  • strong coding and review skills
  • Proven track record scaling LLM inference: optimizing latency (<100ms P99), throughput, batching, memory/IO efficiency and resources utilization
  • Expertise in distributed inference/training for modern LLMs
  • understanding of AI/ML ecosystems, including public clouds (AWS/GCP/Azure)
  • Hands-on with model-serving frameworks (e.g. vLLM, TensorRT-LLM, Triton or similar) and ML stacks (PyTorch, Hugging Face, SageMaker)
  • Deep experience with orchestration (Kubernetes/EKS, Slurm), large clusters, and low-latency networking
  • Strong background in monitoring and reliability engineering (Prometheus/Grafana, incident response, post-mortems)
  • Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products
Job Responsibility
Job Responsibility
  • Provide hands-on technical leadership, owning the technical vision and roadmap for the Cerebras Inference Platform, from internal scaling to on-prem customer solutions
  • Lead the end-to-end development of distributed inference systems, including request routing, autoscaling, and resource orchestration on Cerebras' unique hardware
  • Drive a culture of operational excellence, guaranteeing platform reliability (>99.9% uptime), performance, and efficiency
  • Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution
  • Productize the platform into an enterprise-ready, on-prem solution, collaborating closely with product, ops, and customer teams to ensure successful deployments
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
  • Fulltime
Read More
Arrow Right
New

Head of Technology, Intelligence Ventures

The Head of Technology serves as the chief technology officer of a new behaviora...
Location
Location
United States , New York
Salary
Salary:
263200.00 - 393800.00 USD / Year
corporate.charter.com Logo
Spectrum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise across the full intelligence platform stack — including distributed data pipelines, ML platform architecture, embedding systems, feature stores, agent-to-agent API design, and LLM-powered application layers — with demonstrated ability to architect and ship all layers as a coherent, production-grade product
  • Demonstrated experience architecting and operating consumer data intelligence or data product platforms underpinned by complex machine learning systems and built on modern, cloud-native data infrastructure
  • Hands-on proficiency with Snowflake (including Cortex, Native Apps, and data sharing frameworks), cloud data platforms (AWS, Azure, or GCP), and production ML/AI systems at scale
  • Experience building agentic AI systems and LLM-powered product interfaces — including agent-to-agent APIs, retrieval-augmented generation architectures, and natural language UIs grounded in proprietary data — with strong product instincts around accuracy, trust, and user experience for non-technical enterprise audiences
  • Proven ability to translate complex technical architecture into clear executive and partner-facing communications
  • comfortable engaging at the C-suite level and in strategic partner negotiations with hyperscalers and technology platforms
  • Strong understanding of privacy-preserving data architecture, including differential privacy, de-identification techniques, zero-copy and clean room frameworks, and the regulatory landscape governing consumer behavioral data
  • Track record of recruiting and developing exceptional engineering talent in competitive markets
  • experience building high-performance teams from early-stage through scaled operations
  • Experience managing external development partners and outsourced engineering resources alongside an internal team in a fast-moving, build-from-scratch environment
Job Responsibility
Job Responsibility
  • Own the end-to-end technical architecture of the platform — from large-scale network signal ingestion and processing through behavioral embedding generation, feature store construction, and zero-copy intelligence delivery to enterprise partners — ensuring the platform is production-grade, built for household-scale throughput, and designed for long-term extensibility across new signal sources and use cases
  • Lead the build of the platform’s cloud-native data and application infrastructure — including ingestion and transformation pipelines, ML/AI compute environments, zero-copy partner access frameworks, and a real-time agent-to-agent API layer that enables external AI systems (marketing agents, commerce agents, customer service agents) to query household-level intelligence and receive contextually grounded responses
  • Serve as the technical lead in strategic partner integrations with cloud, data, and AI platform providers, ensuring each integration is architecturally differentiated and aligned with the platform’s cloud-agnostic, API-first design principles
  • Architect and deliver a business-facing agentic intelligence interface — a natural language UI that allows non-technical marketers, planners, and business users to query household behavioral intelligence, surface demand signals, and take action without requiring data or engineering support
  • Build and manage a world-class engineering organization capable of competing for talent with the leading technology and data infrastructure companies
  • Fulltime
Read More
Arrow Right
New

Senior Information Security Engineer

Wells Fargo is seeking a Senior Information Security Engineer.
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
May 30, 2026
Flip Icon
Requirements
Requirements
  • 4+ years of Information Security Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 4+ years of experience in Software Engineering, Data Engineering, or a backend development python development and backend architecture
  • Expert level knowledge of Python internals, concurrency (Asyncic/Multiprocessing) and building high performance, memory efficient applications
  • Proven expertise in designing and governing enterprise grade CI/CD pipelines and must manage complex code promotions across multi-region environments using GIT hub actions, Git Lab, or Azure DevOps
  • Extensive hands-on experience with Apache Kafka (or Confluent), including cluster tuning, schema registry management and designing event driven architectures
  • Deep experience with Grafana and Prometheus for full stack observability – defining SLIs/SLOs, custom exporters and complex alerting logic
  • Strong understating of the end-to-end ML life cycle, specifically in the deployment and scaling of models using frameworks like BentoML, Ray, or KServe
  • Experience in SQL, data modelling, ETL/ELT pipelines, and large-scale data processing
  • Good to have knowledge in Terraform, Palumi and container orchestration – Kubernetes, EKS
Job Responsibility
Job Responsibility
  • Lead or participate in computer security incident response activities for moderately complex events
  • Conduct technical investigation of security related incidents and post incident digital forensics to identify causes and recommend future mitigation strategies
  • Provide security consulting on medium projects for internal clients to ensure conformity with corporate information, security policy, and standards
  • Design, document, test, maintain, and provide issue resolution recommendations for moderately complex security solutions related to networking, cryptography, cloud, authentication and directory services, email, internet, applications, and endpoint security
  • Review and correlate security logs
  • Utilize subject matter knowledge in industry leading security solutions and best practices to implement one or more components of information security such as availability, integrity, confidentiality, risk management, threat identification, modeling, monitoring, incident response, access management, and business continuity
  • Identify security vulnerabilities and issues, perform risk assessments, and evaluate remediation alternatives
  • Collaborate and consult with peers, colleagues and managers to resolve issues and achieve goals
  • Lead the development of mission critical python services, ensuring high availability and low latency performance
  • Standardize how code moves through the organization, implementing sophisticated deployment patterns like Blue-Green, Cannery or Ring deployments
  • Fulltime
Read More
Arrow Right