Manager, Software Development (Hands-On Technical), ML Network Stack Job at Amazon Pforzheim GmbH (Cupertino)

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...

Location

United States , Boston

Salary:

152800.00 - 224100.00 USD / Year

SimpliSafe

Expiration Date

Until further notice

Requirements

5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
Hands-on experience with: CI/CD pipelines, Docker
Kubernetes
Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
LLM application development, including prompt engineering and evaluation
Strong communication skills for partnering with cross-functional technical and non-technical teams

Job Responsibility

Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
Design, implement, and own comprehensive production monitoring for ML models/systems
Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
Drive best practices in model versioning, observability, reproducibility, and deployment reliability
Serve in an on-call rotation as a first responder for software owned by your team

What we offer

A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
Free SimpliSafe system and professional monitoring for your home
Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
Participation in our annual bonus program, equity, and other forms of compensation
A full range of medical, retirement, and lifestyle benefits

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Principal Data Infrastructure Engineer

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering
OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling, or data engineering
OR equivalent experience
4+ years in Big Data Infrastructure, DevOps, SRE, or Platform Engineering
3+ years of hands-on experience managing and scaling distributed systems—from bare-metal to cloud-native environments
2+ years deploying containerized applications using Kubernetes and Helm/Kustomize
Solid scripting and automation skills using Python, Bash, or PowerShell
Proven success in CI/CD pipeline management, release automation, and production troubleshooting
Experience working with Databricks for scalable data processing and analytics
Familiarity with security practices in infrastructure environments, including IAM, OAuth, and Kerberos administration

Job Responsibility

Architect and maintain scalable, reliable, and observable Big Data Infrastructure for mission-critical AI applications
Champion DevOps and SRE best practices—automated deployments, service monitoring, and incident response
Build a self-service big data platform that empowers data and platform engineers and researchers
Develop robust CI/CD pipelines and automate infrastructure provisioning using Infrastructure as Code tools (Bicep, Terraform, ARM)
Collaborate with Data Engineers, Data Scientists, AI Researchers, and Developers to deliver secure, seamless big data workflows
Lead technical design reviews and uphold a clean, secure, and well-documented codebase
Proactively identify and resolve bottlenecks in data pipelines and infrastructure
Optimize system performance across storage, compute, and analytics layers
Partner with Security teams to enhance system security (IAM, OAuth, Kerberos)
Embody and promote Microsoft’s values: Respect, Integrity, Accountability, and Inclusion

Fulltime

Engineering Manager, Inference Platform

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...

Location

United States; Canada , Sunnyvale; Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

6+ years in high-scale software engineering
3+ years leading distributed systems or ML infra teams
strong coding and review skills
Proven track record scaling LLM inference: optimizing latency (<100ms P99), throughput, batching, memory/IO efficiency and resources utilization
Expertise in distributed inference/training for modern LLMs
understanding of AI/ML ecosystems, including public clouds (AWS/GCP/Azure)
Hands-on with model-serving frameworks (e.g. vLLM, TensorRT-LLM, Triton or similar) and ML stacks (PyTorch, Hugging Face, SageMaker)
Deep experience with orchestration (Kubernetes/EKS, Slurm), large clusters, and low-latency networking
Strong background in monitoring and reliability engineering (Prometheus/Grafana, incident response, post-mortems)
Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products

Job Responsibility

Provide hands-on technical leadership, owning the technical vision and roadmap for the Cerebras Inference Platform, from internal scaling to on-prem customer solutions
Lead the end-to-end development of distributed inference systems, including request routing, autoscaling, and resource orchestration on Cerebras' unique hardware
Drive a culture of operational excellence, guaranteeing platform reliability (>99.9% uptime), performance, and efficiency
Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution
Productize the platform into an enterprise-ready, on-prem solution, collaborating closely with product, ops, and customer teams to ensure successful deployments

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Fulltime

New

Head of Technology, Intelligence Ventures

The Head of Technology serves as the chief technology officer of a new behaviora...

Location

United States , New York

Salary:

263200.00 - 393800.00 USD / Year

Spectrum

Expiration Date

Until further notice

Requirements

Deep expertise across the full intelligence platform stack — including distributed data pipelines, ML platform architecture, embedding systems, feature stores, agent-to-agent API design, and LLM-powered application layers — with demonstrated ability to architect and ship all layers as a coherent, production-grade product
Demonstrated experience architecting and operating consumer data intelligence or data product platforms underpinned by complex machine learning systems and built on modern, cloud-native data infrastructure
Hands-on proficiency with Snowflake (including Cortex, Native Apps, and data sharing frameworks), cloud data platforms (AWS, Azure, or GCP), and production ML/AI systems at scale
Experience building agentic AI systems and LLM-powered product interfaces — including agent-to-agent APIs, retrieval-augmented generation architectures, and natural language UIs grounded in proprietary data — with strong product instincts around accuracy, trust, and user experience for non-technical enterprise audiences
Proven ability to translate complex technical architecture into clear executive and partner-facing communications
comfortable engaging at the C-suite level and in strategic partner negotiations with hyperscalers and technology platforms
Strong understanding of privacy-preserving data architecture, including differential privacy, de-identification techniques, zero-copy and clean room frameworks, and the regulatory landscape governing consumer behavioral data
Track record of recruiting and developing exceptional engineering talent in competitive markets
experience building high-performance teams from early-stage through scaled operations
Experience managing external development partners and outsourced engineering resources alongside an internal team in a fast-moving, build-from-scratch environment

Job Responsibility

Own the end-to-end technical architecture of the platform — from large-scale network signal ingestion and processing through behavioral embedding generation, feature store construction, and zero-copy intelligence delivery to enterprise partners — ensuring the platform is production-grade, built for household-scale throughput, and designed for long-term extensibility across new signal sources and use cases
Lead the build of the platform’s cloud-native data and application infrastructure — including ingestion and transformation pipelines, ML/AI compute environments, zero-copy partner access frameworks, and a real-time agent-to-agent API layer that enables external AI systems (marketing agents, commerce agents, customer service agents) to query household-level intelligence and receive contextually grounded responses
Serve as the technical lead in strategic partner integrations with cloud, data, and AI platform providers, ensuring each integration is architecturally differentiated and aligned with the platform’s cloud-agnostic, API-first design principles
Architect and deliver a business-facing agentic intelligence interface — a natural language UI that allows non-technical marketers, planners, and business users to query household behavioral intelligence, surface demand signals, and take action without requiring data or engineering support
Build and manage a world-class engineering organization capable of competing for talent with the leading technology and data infrastructure companies

Fulltime

New

Senior Information Security Engineer

Wells Fargo is seeking a Senior Information Security Engineer.

Location

India , Bengaluru

Salary:

Not provided

Wells Fargo

Expiration Date

May 30, 2026

Requirements

4+ years of Information Security Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
4+ years of experience in Software Engineering, Data Engineering, or a backend development python development and backend architecture
Expert level knowledge of Python internals, concurrency (Asyncic/Multiprocessing) and building high performance, memory efficient applications
Proven expertise in designing and governing enterprise grade CI/CD pipelines and must manage complex code promotions across multi-region environments using GIT hub actions, Git Lab, or Azure DevOps
Extensive hands-on experience with Apache Kafka (or Confluent), including cluster tuning, schema registry management and designing event driven architectures
Deep experience with Grafana and Prometheus for full stack observability – defining SLIs/SLOs, custom exporters and complex alerting logic
Strong understating of the end-to-end ML life cycle, specifically in the deployment and scaling of models using frameworks like BentoML, Ray, or KServe
Experience in SQL, data modelling, ETL/ELT pipelines, and large-scale data processing
Good to have knowledge in Terraform, Palumi and container orchestration – Kubernetes, EKS

Job Responsibility

Lead or participate in computer security incident response activities for moderately complex events
Conduct technical investigation of security related incidents and post incident digital forensics to identify causes and recommend future mitigation strategies
Provide security consulting on medium projects for internal clients to ensure conformity with corporate information, security policy, and standards
Design, document, test, maintain, and provide issue resolution recommendations for moderately complex security solutions related to networking, cryptography, cloud, authentication and directory services, email, internet, applications, and endpoint security
Review and correlate security logs
Utilize subject matter knowledge in industry leading security solutions and best practices to implement one or more components of information security such as availability, integrity, confidentiality, risk management, threat identification, modeling, monitoring, incident response, access management, and business continuity
Identify security vulnerabilities and issues, perform risk assessments, and evaluate remediation alternatives
Collaborate and consult with peers, colleagues and managers to resolve issues and achieve goals
Lead the development of mission critical python services, ensuring high availability and low latency performance
Standardize how code moves through the organization, implementing sophisticated deployment patterns like Blue-Green, Cannery or Ring deployments

Fulltime

Manager, Software Development (Hands-On Technical), ML Network Stack

Amazon Pforzheim GmbH

Location:
United States , Cupertino ▼
Seattle

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 03, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Manager, Software Development (Hands-On Technical), ML Network Stack

Senior Machine Learning Engineer (Infrastructure)

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Staff Software Engineer - AI/ML Infra

Principal Data Infrastructure Engineer

Engineering Manager, Inference Platform

Head of Technology, Intelligence Ventures

Senior Information Security Engineer

Manager, Software Development (Hands-On Technical), ML Network Stack

Amazon Pforzheim GmbH

Location:United States , Cupertino ▼Seattle

Category:IT - Software Development

Contract Type:Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:May 03, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Manager, Software Development (Hands-On Technical), ML Network Stack

Senior Machine Learning Engineer (Infrastructure)

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Staff Software Engineer - AI/ML Infra

Principal Data Infrastructure Engineer

Engineering Manager, Inference Platform

Head of Technology, Intelligence Ventures

Senior Information Security Engineer

Location:
United States , Cupertino ▼
Seattle

Category:
IT - Software Development

Contract Type:
Employment contract

Job Posted:
May 03, 2026