Software Engineer - Reliability GPU Infrastructure Job at Luma AI (Palo Alto)

Senior Software Engineer (Infrastructure)

We're hiring a Senior Software Engineer (Infrastructure) to be a technical drive...

Location

United States , San Francisco

Salary:

160000.00 - 250000.00 USD / Year

Helpcare AI

Expiration Date

Until further notice

Requirements

You're a scrappy infrastructure generalist who has seen it all
You've worked with GPU cloud providers and understand what's needed to build reliable systems on top of them
You consider AWS your second home. You're comfortable spinning up new services and building simple repeatable processes for others to leverage
You thrive in an ambiguous and fast changing space
You bring a senior mindset: you set direction, own decisions, and get things over the finish line
You have incredibly communication skills and can communicate complex technical ideas clearly to both technical and non-technical team members

Job Responsibility

Work across across teams to own and extend our GPU infra as well as our traditional cloud infra (AWS)
Work closely with our external infrastructure partners to ensure stability and reliability for GPU deployments and GPU availability
Empower other engineers to move fast by building amazing developer experiences for setting up new systems

What we offer

flexible work schedule
unlimited PTO
competitive healthcare
gear stipends

Fulltime

Software Engineer - Cloud FinOps & Reliability

This is a foundational engineering position for a technical, data-driven expert ...

Location

United States , Palo Alto

Salary:

120000.00 - 255000.00 USD / Year

Luma AI

Expiration Date

Until further notice

Requirements

5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
Not an accountant
you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers

Job Responsibility

Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders

Fulltime

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...

Location

Serbia; Romania , Belgrade; Timișoara

Salary:

Not provided

Everseen

Expiration Date

Until further notice

Requirements

4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
Bachelors degree or equivalent focusing on the computer science field is preferred
Excellent communication and collaboration skills
Expert knowledge of Python
Experience with CI/CD tools (e.g., GitLab, Jenkins)
Hands-on experience with Kubernetes, Docker, and cloud services
Understanding of ML training pipelines, data lifecycle, and model serving concepts
Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
Experience with ML frameworks (e.g., TensorFlow, PyTorch)

Job Responsibility

Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
Build out documentation in relation to architecture, policies and operations runbooks
Share skills, knowledge, and expertise with members of the data engineering team
Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
Ensure that the developed solutions meet project objectives and enhance user experience

Fulltime

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment

Job Responsibility

Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Principal Software Engineer

In Azure Specialized we are collaboratively working to bring the next generation...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Willing to dive deeply into any level or layer of a problem.
Willing to learn emerging technologies, from hardware to software. Evaluate and make recommendations that advance Azure infrastructure for AI and other GPU-based workloads.
Leads by example within the team by producing extensible and maintainable. Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
Maintains communication with key partners across the Microsoft ecosystem of engineers. Acts as a key contact for leadership to ensure alignment with partners' expectations.
Considers partner teams across organizations and their end goals for products to drive and achieve desirable user experiences and fitting dynamic needs of partners/customers through product development.
Drives identification of dependencies and the development of design documents for a product, application, service, or platform.
Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
Help to establish innovative infrastructure to manage data at scale, work closely with lead customers and build reliable monitoring pipelines

Fulltime

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...

Location

United States , Santa Clara

Salary:

210400.00 - 315600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
Experience in working with large customers such as Cloud Service Providers and global enterprise customers
Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
Excellent communication level from engineer to mid-management to C-level of audience

Job Responsibility

Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
Engage with AMD product groups to drive resolution of application and customer issues
Develop and present training materials to internal audiences, at customer venues, and at industry conferences