Systems Design Engineer - AI Cluster Software Job at AMD (Austin)

AI Cluster & Data Center Design Engineer

We are seeking a highly skilled systems engineer to architect and design scalabl...

Location

United States , Austin

Salary:

139440.00 - 209160.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Experience in HPC, AI infrastructure, or data center systems engineering
Strong understanding of rack and data center power delivery
Knowledge of GPU/CPU architectures, PCIe, UALink, InfiniBand, and Ethernet networking
Familiarity with AI/ML frameworks and workload characteristics
Excellent problem-solving, communication, and documentation skills
Bachelor's or Master's degree in Electrical Engineering, Computer Engineering, Computer Science or related field

Job Responsibility

Design scalable AI/HPC clusters including compute, storage, and networking with specific focus on power delivery
Evaluate and select CPUs, GPUs, accelerators, interconnects, and memory configurations for optimal cluster performance
Design leading-edge power delivery solutions for high-density AI/GPU deployments
Define power budgets, redundancy schemes, and fault tolerance mechanisms
Design network topologies to maximize overall cluster performance
Understand the network performance needs of different types of workloads
Understand advantages and performance trade-offs of network topologies for AI/HPC clusters
Design and optimize storage solutions to maximize AI/HPC cluster performance
Understand advantages and performance trade-offs of cluster storage solutions, e.g. Lustre, Ceph, etc.
Work across multiple organizations with subject matter experts from hardware, software, network, data center, and operations teams to deliver scalable, efficient, and reliable compute infrastructure

AI Systems Engineer – AI Model (Training & Inference)

The AMD AI Group is looking for a Senior Software Development Engineer to own th...

Location

Canada , Markham

Salary:

106400.00 - 159600.00 CAD / Year

AMD

Expiration Date

Until further notice

Requirements

Industry experience shipping production AI/ML infrastructure, with hands-on work spanning both training and inference.
Bachelor’s or Master’s degree or Ph.D in Computer/Software Engineering, Computer Science, or related technical discipline

Job Responsibility

Enable and optimize large-scale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput.
Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multi-thousand GPU clusters on Kubernetes.
Debug and resolve training-specific issues including gradient norm explosions, non-deterministic behavior across GPU generations, and compute-communication overlap in distributed training (FSDP, DeepSpeed, Megatron-LM).
Optimize RCCL collective communication patterns for training workloads, including all-reduce, all-gather, and reduce-scatter across multi-node topologies.
Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale.
Design and build end-to-end validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations.
Write and optimize high-performance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform open-source baselines.
Drive end-to-end inference enablement on new AMD GPU silicon - be among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations.
Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KV-cache management, speculative decoding, and continuous batching for production throughput/latency targets.
Develop novel approaches to inference acceleration, including bio-inspired algorithms, SLM-assisted batching, and custom scheduling strategies that exploit AMD hardware characteristics.

Fulltime

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

The AI Platform organization builds the end-to-end Azure AI stack, from the infr...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript | OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing
Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters
Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security
Provide operational support and DRI (on-call) responsibilities for the service
Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers
Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together
Provide vision, expertise, and technical leadership to other team members
Help to grow talent in these areas
Embody our culture and values

Fulltime

AIML Software Engineer, AI for Science

At GSK, we are actively working on building a future in which state-of-the-art s...

Location

United Kingdom; Germany; Switzerland; United States , London; Heidelberg; Zug; Cambridge, Massachusetts

Salary:

136125.00 - 226875.00 USD / Year

GSK

Expiration Date

Until further notice

Requirements

A degree in a quantitative or engineering discipline (e.g., computer science, computational biology, bioinformatics, engineering, among others)
OR equivalent work experience as a professional software engineer
Demonstrated advanced programming expertise in Python and in developing and delivering robust, scalable software solutions
Experience with cloud platforms (AWS, GCP, Azure) and cloud-native architectures
Passion for software design and commitment to the development of reusable, scalable, and testable software components
Basic understanding of at least one major deep learning framework (PyTorch, JAX, TensorFlow)
Knowledge of command-line tools and shell scripting
Knowledge of software engineering best practices, including continuous integration (CI) and continuous deployment (CD), containerization, and infrastructure as code
Strong problem-solving and debugging skills, and experience working in cluster settings or cloud-based environments
Fluency in English

Job Responsibility

Design and implement scalable infrastructure and software solutions to support large-scale AI models and agentic systems across the entire software development life cycle
Design and implement sophisticated machine learning and deep learning pipelines that can handle massive amounts of data with optimal resource utilization
Develop and maintain cloud-native architectures that enable seamless deployment and scaling of AI/ML workloads
Deliver robust, tested and high-performance code in an agile environment
Liaise with AI/ML engineers, data scientists, and domain experts to ensure fit-for-purpose infrastructure and data pipelines for cutting-edge scientific projects

What we offer

Competitive base salary
Annual bonus based on company performance
Flexible working options available for most roles
Learning and career development
Access to healthcare & wellbeing programmes
Employee recognition programmes
Health care and other insurance benefits (for employee and family)
Retirement benefits
Paid holidays
Vacation

Fulltime

Senior Software Engineer- AI and Data Governance

At GEICO, we offer a rewarding career where your ambitions are met with endless ...

Location

United States , Palo Alto

Salary:

100000.00 - 215000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Advance knowledge of at least one modern OOP languages such as Go, Python, Java, etc.
Advance knowledge of web technologies such as HTML, CSS, JavaScript is preferred
Understand open-source databases like MySQL, PostgreSQL, etc., familiar with No-SQL databases like Cassandra, MongoDB, Elasticsearch, etc.
Experience in architecting, designing, building automation, workflows, custom objects/apps, declarative functionality, triggers, migration tools in BMC Helix platform and transition such platform to Open Source is a big plus
Experience building and configuring flows, and process builders
Strong understanding of web service integration (GRPC / REST) and enterprise middleware integration tiers
Ability to articulate channel dataflow and process flow including email, messaging, chat, mobile Push and SDK's
Excellent communication skills – needs to be able to lead projects from the front and interact with clients and sponsors on a regular basis
Experience partnering with engineering teams and transferring research to production
Experience with continuous delivery (CI/CD) and Infrastructure as Code

Job Responsibility

Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
Develop and execute technical software development strategy for the Platform Engineering domain including Service Management, Business Continuity, Recovery, Incident Response and Paging platforms
Accountable for the quality, usability, and performance of the solutions
Deep hands-on experience in complex system design and data pipeline and architectures, scale and performance, tuning, with good knowledge on Docker and Kubernetes
Consistently share best practices and improve processes within and across teams
Willing to take on-call and operational support
Experience designing recommendation systems, ranking, personalization, similarity search and embeddings
Experience with NLP, LLMs and RAG, as well as translating natural language into graph or data queries
Experience designing scalable AI systems and Data pipelines

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer, Search & Distributed Systems

We are looking for a Staff Software Engineer who would thrive on being accountab...

Location

USA , Buffalo

Salary:

165000.00 - 260000.00 USD / Year

ACV Auctions

Expiration Date

Until further notice

Requirements

8+ years of software engineering experience, with at least 3+ years operating at a Senior or Staff level focusing on distributed systems and high-throughput platforms.
Deep, authoritative knowledge of Elasticsearch internals. You have managed large-scale clusters and deeply understand mapping, analysis, query optimization, cluster state management, and split-brain mitigation.
Proficiency in the systems upstream and downstream of Search. You have hands-on experience with Kubernetes (EKS/GKE), API Gateway/BFF architectures, and event streams (Kafka).
A proven track record of implementing fault-tolerant patterns (retries, rate limiting, circuit breaking, dead letter queues) in microservice architectures.
Expert-level ability to instrument systems and diagnose complex performance issues using modern observability stacks (Datadog, Prometheus, Grafana, OpenTelemetry).
Strong communication skills with a proven ability to influence cross-functional teams, build consensus around architectural decisions (the Knoster model!), and mentor mid-level and senior engineers.

Job Responsibility

Architect for Scale: Design, configure, and scale our Elasticsearch clusters. You will define our global strategies for shard routing, Index Lifecycle Management (ILM), heap tuning, and data tiering to support massive auction throughput.
Master the Failure Modes: Anticipate and engineer away points of failure. You will design circuit breakers, implement backpressure mechanisms, and tune asymmetric timeouts to prevent retry storms between our BFFs, K8s services, and the Search layer.
Expert Troubleshooting & IR: Act as the ultimate technical escalation point for complex, cross-system performance degradation. You will dive deep into JVM metrics, Garbage Collection pauses, K8s network bottlenecks, and slow logs to uncover and remediate root causes.
Holistic System Ownership: Manage the entire data lifecycle. You will optimize the ingestion pipelines syncing our event datastreams driven by producers and consumers (Kafka) to Elasticsearch, ensuring eventual consistency and data integrity at scale.
Drive Engineering Excellence: Draft authoritative architectural Blueprints, SOPs, and Runbooks. You will elevate the surrounding engineering culture by coaching teams on distributed systems design, observability best practices, and incident management.
Modernize & Innovate: Scan the horizon for emerging technologies. You will help evaluate and integrate next-generation search capabilities (e.g., Vector Search, RAG architectures) to support our broader AI and machine learning initiatives.

What we offer

Multiple medical plans including a high deductible, low cost health plan
Company-sponsored (paid) Short-Term Disability, Long-Term Disability, and Life Insurance
Comprehensive optional benefits such as Dental, Vision, Supplemental Life/AD&D, Legal/ID Protection, and Accident and Critical Illness Insurance
Generous paid time off options, including uncapped vacation days, the greater of 3 paid sick days or in accordance with the applicable state or local paid sick leave law, 6 paid company holidays, 2 floating holidays, parental leave, bereavement leave, jury duty leave, voting leave, and other forms of paid leave as required by applicable law or regulation
Employee Stock Purchase Program with additional opportunities to earn stock in the Company
Retirement planning through the Company's 401(k)

Fulltime

Senior Software Engineer II - Backend - AI Search

AI is one of the fastest growing product areas in Seismic. We believe that AI, p...

Location

India , Hyderabad

Salary:

Not provided

Seismic

Expiration Date

Until further notice

Requirements

7+ years of experience in software engineering and a proven track record of building and scaling microservices and working with data retrieval systems
5+ Experience with C# and .NET, unit testing, object-oriented programming, and web services
3+ Experience with Python, with the ability to work concurrently on Python and .NET repositories
3+ Experience with Redis, including expertise in managing large-scale Redis clusters
2+ Experience with PostgreSQL, including maintaining and performing tuning
Proficient in Test Driven Development (TDD) with hands-on experience using xUnit and Postman to develop automation test scripts
Experience with Infrastructure as Code (Terraform, Pulumi, etc.)
Experience with Event driven architectures with tools like Kafka
Experienced in container technologies such as Docker and proficient in microservice frameworks like Kubernetes (K8s)
Experienced in Continuous Integration and Continuous Deployment (CI/CD) with expertise in developing Jenkins pipelines using Scala

Job Responsibility

Design, develop, and maintain backend systems and services for search functionality, ensuring high performance, scalability, and reliability
Implement and optimize search and AI-driven semantic algorithms, indexing, and information retrieval techniques to enhance search accuracy and efficiency
Collaborate with data scientists, AI engineers, and product teams to integrate AI-driven search capabilities across the Seismic platform
Monitor and optimize search performance, addressing bottlenecks and ensuring low-latency query responses
Provide technical guidance and mentorship to junior engineers, promoting best practices in search backend development
Work closely with cross-functional and geographically distributed teams, including product managers, frontend engineers, and UX designers, to deliver seamless and intuitive search experiences
Stay updated with the latest trends and advancements in search technologies, conducting research and experimentation to drive innovation

Fulltime

New

Network Engineer, AI Infrastructure Repair

Meta is building the next generation of AI infrastructure to power large-scale m...

Location

United States , Sarpy County, NE

Salary:

193000.00 - 271000.00 USD / Year

Select Country

Systems Design Engineer - AI Cluster Software

Job Description

Job Responsibility

Requirements

Looking for more opportunities?