CrawlJobs Logo

Senior Software Engineer- ML Network Stack

Israel, Tel Aviv · Job Posted June 15, 2026
Apply Position
Job Link Share

Job Description

We are seeking an experienced engineer to join our team that owns the network stack for EC2 distributed AI/ML systems. The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, and Perplexity kernels. Solid knowledge of Linux, networking, and performant coding is important. Experience with embedded systems is valued, and experience with high-speed networking or HPC/RDMA interconnects is highly valued. If you like solving hard problems, want to work with HPC and ML customers, iterate fast and deliver meaningful solutions at scale, then come join us! This truly is a role at the forefront of AI/ML—you'll be working on features for the largest clusters, with the largest customers, for the largest AI models. . The organization you would be joining is Annapurna Labs, an integral part of AWS that develops hardware and software components that are critical building blocks for EC2 infrastructure. Every instance in EC2 is running some type of hardware designed by Annapurna Labs. We specialize in designing software, systems, and chips that optimize the AWS customer experience.

Job Responsibility

  • Be a senior engineer on a team that builds and maintains the infrastructure that monitors and reports on functionality and performance of massive testing workloads run at scale
  • Use internal Amazon CI/CD tools, Linux, and public AWS products to automate the delivery of our software to customers, saving developer time
  • Write Python code that effortlessly spools up large clusters and runs benchmarks and applications for ML and HPC workloads
  • Use AWS Managed Grafana and Athena to digest the massive amount of performance data generated by these workloads and create dashboards for developers and stakeholders
  • Invent automatic mechanisms to alert developers to functional and performance regressions so they never reach customers
  • Manage the complexity of infrastructure that covers many instance types, software stacks, Linux operating systems, cutting-edge releases and make it easy to evolve

Requirements

  • 5+ years of non-internship professional software development experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • 3+ years as a mentor, tech lead or leading engineering teams
  • 3+years experience in SW/HW Co-Design

Nice to have

  • Bachelor's degree in computer science or equivalent
  • Experience creating automated dashboards and visualization (such as Grafana)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Software Engineer- ML Network Stack

8 matching positions

Senior Software Engineer - Planning ML Integration

We're building the next generation of planning capabilities by integrating learn...
Location
Location
United States , Mountain View
Salary
Salary:
160000.00 USD / Year
kodiak.ai Logo
Kodiak Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills with proficiency in C++
  • Python proficiency is a plus
  • Experience integrating ML models or learned components into a real-time system
  • A strong background in robotics, planning, optimization, and mathematics (MS, PhD, or equivalent experience)
  • Industry experience in robotics or autonomous driving
  • Experience working in large-scale or safety-critical systems with strict performance requirements
  • Experience evaluating or interpreting ML model outputs
  • Strong analytical skills, including the ability to reason about algorithmic trade-offs and system behavior
  • Excellent communication skills and comfort working across teams
  • A desire to collaborate with other teams outside of planning
Job Responsibility
Job Responsibility
  • Incorporate neural networks into the planning stack, working closely with ML, perception, and systems teams
  • Evaluate how learned inputs influence planner performance, in simulation and on-road
  • Architect fallback, hybrid, or arbitration strategies that maintain safety and reliability when learned models are uncertain or degraded
  • Contribute to the broader planning system by designing and implementing new planning behaviors, search strategies, optimizations, and structural improvements
  • Write high-quality C++ code that meets real-time constraints and supports safety-critical deployment
  • Participate in code reviews, design discussions, and cross-team planning to ensure alignment and technical excellence
What we offer
What we offer
  • Competitive compensation package including equity and annual bonuses
  • Excellent Medical, Dental, and Vision plans through Kaiser Permanente, Cigna, and MetLife (including a medical plan with infertility benefits)
  • MetLife Legal Services, Identity & Fraud Protection, Hospital Indemnity Insurance, Accident Insurance, & Critical Illness Insurance
  • Flexible PTO, 10 paid holidays, and generous parental leave policies
  • Office perks: dog-friendly, free catered lunch, a fully stocked kitchen, and free EV charging
  • Long Term Disability, Short Term Disability, Life Insurance
  • Wellbeing Benefits - Headspace through Cigna, Calm through Kaiser, One Medical, Gympass, Spring Health through Cigna, Rula (mental health navigation)
  • Fidelity 401(k)
  • Commuter, FSA, Dependent Care FSA, HSA
  • Various incentive programs (referral bonuses, patent bonuses, etc.)
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Real-Time Workflows & ML Serving

Modern ads platforms run on always-on, real-time data: streaming events, feature...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, or a related field, with 6+ years of related experience
  • Strong programming skills in language C++,C# or Python (at least one required)
  • Hands-on experience in one or more: Building and operating streaming data pipelines in production (Flink or Spark Structured Streaming), Distributed systems engineering with strong reliability and operational rigor, Messaging systems such as Kafka/Pulsar
  • Experience operating services with Kubernetes/containers and production readiness practices (deployments, scaling, rollbacks)
  • Experience with observability stacks such as OpenTelemetry, Prometheus, Grafana
Job Responsibility
Job Responsibility
  • Design and implement real-time streaming ETL / feature pipelines (e.g., Flink or Spark Structured Streaming) that meet strict freshness and correctness constraints
  • Build and operate reliable messaging and ingestion with Kafka/Pulsar (partitioning strategy, retries, ordering guarantees, DLQs, backpressure handling)
  • Own data contracts between producers, pipelines, and consumers: schema evolution, versioning, compatibility, validation, and safe rollout
  • Implement production-grade backfill/replay workflows
  • Define and meet SLOs using OpenTelemetry/Prometheus/Grafana for metrics, tracing, dashboards, alerting, and incident response readiness
  • Integrate pipelines with online stores/caches and ML consumers (feature stores, embedding pipelines, LLM API calls, online/offline consistency patterns)
  • Partner with applied scientists on feature/embedding definitions, validation, and end-to-end quality measurement
  • Optimize end-to-end performance and efficiency: CPU/memory/I/O, serialization, caching, network overhead, concurrency, and pipeline compute cost
  • Contribute to serving/inference integrations where needed (e.g., Triton/ONNX Runtime/TensorRT) including batching and latency/cost tradeoffs
  • Ship safely with CI/CD, automated testing (unit/integration/data quality), and operational playbooks/runbooks
  • Fulltime
Read More
Arrow Right

Manager, Software Development (Hands-On Technical), ML Network Stack - Annapurna Labs

We are hiring a hands-on Software Development Manager for the team that owns the...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency
Job Responsibility
Job Responsibility
  • We are hiring a hands-on Software Development Manager for the team that owns the network stack for EC2 distributed AI/ML systems
  • The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, Perplexity kernels and others
  • You'll be leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads
What we offer
What we offer
  • Work/Life Balance
  • Mentorship & Career Growth
  • Fulltime
Read More
Arrow Right

Manager, Software Development (Hands-On Technical), ML Network Stack

We are hiring a hands-on Software Development Manager for the team that owns the...
Location
Location
United States , Cupertino; Seattle
Salary
Salary:
184900.00 - 287700.00 USD / Year
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of engineering team management experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams
  • 3+ years of C or C++ or Rust development experience
  • 5+ years of hands-on engineering experience, maintaining active programming proficiency
Job Responsibility
Job Responsibility
  • Leading senior, mid-level, and junior SDEs and directing work to ensure the team delivers functions and features required for the latest and largest ML workloads
What we offer
What we offer
  • Health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)
  • Fulltime
Read More
Arrow Right

Lead Software Engineer

Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Experience in Software Engineering, SRE, DevOps, or Platform Engineering
  • Strong proficiency in Python for automation and tooling
  • Hands‑on experience with Grafana, Prometheus, and Splunk in production environments
  • Solid understanding of SLIs, SLOs, dashboards, alerting, and observability best practices
  • Experience applying AI/ML concepts to monitoring, alerting, or operational analytics
  • Strong knowledge of Linux, networking, and distributed systems
  • Experience with Cloud platforms and Kubernetes/OpenShift
  • Proven experience leading incidents, RCAs, and reliability initiatives
  • Experience building custom Prometheus exporters or advanced Grafana dashboards
Job Responsibility
Job Responsibility
  • Lead complex technology initiatives including those that are companywide with broad impact
  • Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
  • Design, code, test, debug, and document for projects and programs
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
  • Own and improve availability, performance, scalability, and resilience of production systems
  • Define, monitor, and manage SLIs/SLOs and error budgets to guide reliability investments
  • Lead capacity planning, performance testing, failover readiness, and disaster‑recovery design
  • Fulltime
Read More
Arrow Right

Staff II Software Engineer AI/ML Ops

We're looking for a Lead Data Engineer to design, build, and optimize data pipel...
Location
Location
United States , Pleasanton
Salary
Salary:
245000.00 - 307000.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
  • Proficiency in containerization technologies (e.g., Docker, Kubernetes)
  • Proficient in scripting languages (e.g., Bash, python) for automation
  • Experience with workflow orchestration tools (e.g., Apache Airflow)
Job Responsibility
Job Responsibility
  • Lead data pipeline development: Build and maintain PySpark ETL pipelines with high data quality and performance
  • Manage integrations: Establish robust connections to client data sources via APIs and tools like FiveTran, Plaid, and BlackLine's own internal connector ecosystem
  • Ensure reliability: Monitor pipeline performance, automate testing, and validate data accuracy
  • Optimize for scale: Implement performance improvements (e.g., CDC mechanisms, indexing strategies) for large-scale datasets
  • Collaborate & innovate: Work with business stakeholders to refine data requirements and integrate cutting-edge AI and big data technologies
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
What we offer
What we offer
  • Short-term and long-term incentive programs
  • Robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Risk

The Risk Platform team at Airwallex is responsible for managing the risk for all...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of back-end engineering experience, including substantial time owning and operating complex, distributed systems in production
  • Deep experience in risk, fraud, or fintech systems (e.g., payments, banking, trading, or similar high-stakes domains where correctness and reliability are critical)
  • Strong proficiency in Java, including multi-threading, high-concurrency patterns, performance tuning, and networked service design (I/O/NIO, HTTP/TCP, REST)
  • Hands-on experience with distributed system design, including event-driven architectures, partitioning/sharding, consistency models, caching strategies, and resiliency patterns
  • Proficiency with Spring / Spring Boot and build tools such as Gradle or Maven
  • Practical experience with containerization and orchestration, particularly Docker and Kubernetes, in production environments
  • Solid understanding of observability and operations: logging, metrics, tracing, dashboards, and incident management for large-scale systems
  • Ability to lead complex technical initiatives end-to-end, influencing engineers and stakeholders without relying on formal management authority
  • Strong communication skills, with the ability to explain complex technical topics clearly to engineers, product managers, and non-technical stakeholders
  • Bachelor’s degree in Computer Science or a related field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the technical direction of core risk services, including real-time detection, decision engines, and risk tooling, ensuring they are reliable, scalable, and cost-efficient under high throughput and low-latency constraints
  • Design and evolve system architecture for distributed, event-driven risk systems (microservices, streaming pipelines, feature stores, model serving layers) that support global products and regulatory requirements
  • Own end-to-end delivery of complex initiatives: from high-level design and technical specifications, to implementation, rollout strategies, and continuous improvement
  • Be deeply hands-on in code and design, writing high-quality production code (primarily in Java/Spring Boot) and driving high-impact design reviews, RFCs, and architecture discussions
  • Partner closely with the Engineering Manager and Product Manager to shape the roadmap, define technical milestones, and translate business and risk objectives into robust engineering solutions
  • Champion engineering excellence by defining and enforcing standards for code quality, observability, reliability, security, and performance across the Risk Platform stack
  • Improve detection effectiveness by working with data and ML teams to integrate models, rules, and features into production systems, and by designing experimentation and evaluation capabilities
  • Drive operational excellence by improving on-call readiness, incident response, postmortems, SLIs/SLOs, capacity planning, and resilience patterns (graceful degradation, fallbacks, retries, timeouts, backpressure)
  • Mentor and uplevel other engineers, including senior ICs: provide technical coaching, pair programming, design guidance, and feedback that helps them do the best work of their careers
  • Collaborate across teams (platform, data, SRE, security, and other product engineering teams) to define clear interfaces, SLAs, and ownership boundaries for shared services and infrastructure
  • Fulltime
Read More
Arrow Right

Senior Data Engineer

Microsoft Cloud Operations + Innovation (CO+I) is the engine that powers Microso...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years’ experience in business analytics, data science, data modeling, or data engineering work
  • OR master’s degree in computer science, Math, Software Engineering, Computer Engineering, or related field and 3+ years’ experience in business analytics, data science, data modeling, or data engineering work
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • 8+ years of experience in data engineering with coding and debugging skills in C#, Python, and/or SQL
  • Deploying solutions in Azure Services & Managing Azure Subscriptions
  • Understanding and knowledge about big data and writing queries with Kusto/KQL
  • Understanding and knowledge about extracting data via REST APIs
  • Strong analytical skills with a systematic and structured approach to software design
  • 5+ years of experience in data science, analytics, or machine learning
  • 4+ years of experience in developing solutions with Microsoft Power Platform, including Power BI, Fabric, Power Automate & M365 Dataverse
Job Responsibility
Job Responsibility
  • Apply modification techniques to transform raw data into compatible formats for downstream systems
  • Utilize software and computing tools to ensure data quality and completeness
  • Implement code to extract and validate raw data from upstream sources, ensuring accuracy and reliability
  • Writes efficient, readable, extensible code from scratch that spans multiple features/solutions
  • Develops technical expertise in proper modeling, coding, and/or debugging techniques such as locating, isolating, and resolving errors and/or defects
  • Leverages technical proficiency of big-data software engineering concepts, such as Hadoop Ecosystem, Apache Spark, continuous integration and continuous delivery (CI/CD), Docker, Delta Lake, MLflow, AML, and representational state transfer (REST) application programming interface (API) consumption/development
  • Acquires data necessary for successful completion of the project plan
  • Proactively detects changes and communicates to senior leaders
  • Develops usable data sets for modeling purposes
  • Contributes to ethics and privacy policies related to collecting and preparing data by providing updates and suggestions around internal best practices
  • Fulltime
Read More
Arrow Right