Manager, Software Development (Hands-On Technical), ML Network Stack Job at Amazon (Seattle)

Manager, Software Development (Hands-On Technical), ML Network Stack

We are hiring a hands-on Software Development Manager for the team that owns the...

Location

United States , Cupertino; Seattle

Salary:

184900.00 - 287700.00 USD / Year

Amazon Pforzheim GmbH

Expiration Date

Until further notice

Requirements

3+ years of engineering team management experience
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience partnering with product or program management teams
3+ years of C or C++ or Rust development experience
5+ years of hands-on engineering experience, maintaining active programming proficiency

Job Responsibility

Leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads

What we offer

Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave
sign-on payments
restricted stock units (RSUs)

Fulltime

Manager, Software Development (Hands-On Technical), ML Network Stack - Annapurna Labs

We are hiring a hands-on Software Development Manager for the team that owns the...

Location

Israel , Tel Aviv

Salary:

Not provided

Amazon Pforzheim GmbH

Expiration Date

Until further notice

Requirements

5+ years of engineering team management experience
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience partnering with product or program management teams
3+ years of C or C++ or Rust development experience
5+ years of hands-on engineering experience, maintaining active programming proficiency

Job Responsibility

We are hiring a hands-on Software Development Manager for the team that owns the network stack for EC2 distributed AI/ML systems
The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, Perplexity kernels and others
You'll be leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads

What we offer

Work/Life Balance
Mentorship & Career Growth

Fulltime

Software Engineering IC5

The CoreAI Infrastructure team builds the foundational accelerated compute platf...

Location

United States , Redmond

Salary:

142800.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience
Proven ability to design and operate largescale, production infrastructure with high reliability and performance requirements using Azure Kubernetes Service (AKS)
Strong problem-solving skills and the ability to debug complex, cross layer systems issues
Demonstrated technical leadership, including mentoring engineers and driving cross team architectural alignment
Hands-on experience with virtualization and/or container platforms (e.g., VMs, Kubernetes, container runtimes)
Strong collaboration and communication skills, with the ability to work across organizational boundaries
Expertise with distributed observability technologies (e.g., Prometheus, OpenTelemetry, Grafana) and experience designing or scaling telemetry pipelines for high-throughput production systems
Advanced, hands-on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools

Job Responsibility

Design and build GPU and CPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments with focus on observability key metrics at scale
Develop End to End Observability operational excellence systems for GPU/CPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multitenant usage)
Build and operate advanced orchestration and resource governance and management scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources
Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios
Optimize performance, reliability, and utilization across large GPU/CPU fleets, including scaleup and scale out configurations
Partner with networking and storage teams to enable high performance interconnects (e.g., RDMA/InfiniBand class networking) for distributed workloads
Drive end-to-end platform features from design through production, including observability, diagnostics, and operational excellence
Influence platform architecture and technical direction across teams through design reviews and technical leadership

Fulltime

Lead Information Security Engineer - Python Full Stack Developer

Wells Fargo is seeking a Lead Information Security Engineer.

Location

India , Hyderabad

Salary:

Not provided

Wells Fargo

Expiration Date

June 29, 2026

Requirements

5+ years of Information Security Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+years of experience in Software Engineering, Data Engineering, or a backend development python development and backend architecture
Expert level knowledge of Python internals, concurrency (Asyncic/Multiprocessing) and building high performance, memory efficient applications
Proven expertise in designing and governing enterprise grade CI/CD pipelines and must manage complex code promotions across multi-region environments using GIT hub actions, Git Lab, or Azure DevOps
Extensive hands-on experience with Apache Kafka (or Confluent), including cluster tuning, schema registry management and designing event driven architectures
Deep experience with Grafana and Prometheus for full stack observability – defining SLIs/SLOs, custom exporters and complex alerting logic
Strong understating of the end-to-end ML life cycle, specifically in the deployment and scaling of models using frameworks like BentoML, Ray, or KServe
Experience in SQL, data modelling, ETL/ELT pipelines, and large-scale data processing
Good to have knowledge in Terraform, Palumi and container orchestration – Kubernetes, EKS

Job Responsibility

Lead computer security incident response activities for highly complex events
Conduct technical investigation of security related incidents and post incident digital forensics to identify causes and recommend future mitigation strategies
Provide security consulting on large projects for internal clients to ensure conformity with corporate information, security policy, and standards
Design, document, test, maintain, and provide issue resolution recommendations for highly complex security solutions related to networking, cryptography, cloud, authentication and directory services, email, internet, applications, and endpoint security
Review and correlate security logs
Utilize subject matter knowledge in industry leading security solutions and best practices to implement one or more components of information security such as availability, integrity, confidentiality, risk management, threat identification, modeling, monitoring, incident response, access management, and business continuity
Identify security vulnerabilities and issues, perform risk assessments, and evaluate remediation alternatives
Collaborate and influence all levels of professionals including managers
Lead a team to achieve objectives
Lead the development of mission critical python services, ensuring high availability and low latency performance

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Head of Technology, Intelligence Ventures

The Head of Technology serves as the chief technology officer of a new behaviora...

Location

United States , New York

Salary:

263200.00 - 393800.00 USD / Year

Spectrum

Expiration Date

Until further notice

Requirements

Deep expertise across the full intelligence platform stack — including distributed data pipelines, ML platform architecture, embedding systems, feature stores, agent-to-agent API design, and LLM-powered application layers — with demonstrated ability to architect and ship all layers as a coherent, production-grade product
Demonstrated experience architecting and operating consumer data intelligence or data product platforms underpinned by complex machine learning systems and built on modern, cloud-native data infrastructure
Hands-on proficiency with Snowflake (including Cortex, Native Apps, and data sharing frameworks), cloud data platforms (AWS, Azure, or GCP), and production ML/AI systems at scale
Experience building agentic AI systems and LLM-powered product interfaces — including agent-to-agent APIs, retrieval-augmented generation architectures, and natural language UIs grounded in proprietary data — with strong product instincts around accuracy, trust, and user experience for non-technical enterprise audiences
Proven ability to translate complex technical architecture into clear executive and partner-facing communications
comfortable engaging at the C-suite level and in strategic partner negotiations with hyperscalers and technology platforms
Strong understanding of privacy-preserving data architecture, including differential privacy, de-identification techniques, zero-copy and clean room frameworks, and the regulatory landscape governing consumer behavioral data
Track record of recruiting and developing exceptional engineering talent in competitive markets
experience building high-performance teams from early-stage through scaled operations
Experience managing external development partners and outsourced engineering resources alongside an internal team in a fast-moving, build-from-scratch environment

Job Responsibility

Own the end-to-end technical architecture of the platform — from large-scale network signal ingestion and processing through behavioral embedding generation, feature store construction, and zero-copy intelligence delivery to enterprise partners — ensuring the platform is production-grade, built for household-scale throughput, and designed for long-term extensibility across new signal sources and use cases
Lead the build of the platform’s cloud-native data and application infrastructure — including ingestion and transformation pipelines, ML/AI compute environments, zero-copy partner access frameworks, and a real-time agent-to-agent API layer that enables external AI systems (marketing agents, commerce agents, customer service agents) to query household-level intelligence and receive contextually grounded responses
Serve as the technical lead in strategic partner integrations with cloud, data, and AI platform providers, ensuring each integration is architecturally differentiated and aligned with the platform’s cloud-agnostic, API-first design principles
Architect and deliver a business-facing agentic intelligence interface — a natural language UI that allows non-technical marketers, planners, and business users to query household behavioral intelligence, surface demand signals, and take action without requiring data or engineering support
Build and manage a world-class engineering organization capable of competing for talent with the leading technology and data infrastructure companies

Fulltime

Select Country

Manager, Software Development (Hands-On Technical), ML Network Stack

Job Description

Requirements

Nice to have

What we offer

Looking for more opportunities?

Manager, Software Development (Hands-On Technical), ML Network Stack

Manager, Software Development (Hands-On Technical), ML Network Stack

Manager, Software Development (Hands-On Technical), ML Network Stack - Annapurna Labs

Software Engineering IC5

Lead Information Security Engineer - Python Full Stack Developer

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Staff Software Engineer - AI/ML Infra

Head of Technology, Intelligence Ventures

Our AI answers in your language