CrawlJobs Logo

Senior Researcher - GPU Performance

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

Generative AI is transforming how people create, collaborate, and communicate - redefining productivity across Microsoft 365 and our customers globally. At Microsoft, we run the biggest platform for collaboration and productivity in the world with hundreds of millions of consumer/enterprise users. Tackling AI efficiency challenges is crucial for delivering these experiences at scale. Within our Microsoft wide Systems Innovation initiative, we are working to advance efficiency across AI systems, where we look at novel designs and optimizations across AI stacks: models, AI frameworks, cloud infrastructure, and hardware. We are an Applied Research team driving mid- and long-term product innovations. We closely collaborate with multiple research teams and product groups across the globe who bring a multitude of technical knowledge in cloud systems, machine learning and software engineering. We communicate our research both internally and externally through academic publications, open-source releases, blog posts, patents, and industry conferences. Further, we also collaborate with academic and industry partners to advance the state of the art and target material product impact that will affect 100s of millions of customers. We are looking for a Senior Researcher - GPU Performance – Hardware/Software Codesign researcher to explore hardware/kernel-level optimizations to deliver significant efficiency gains for Large Language Models and Generative AI experiences.

Job Responsibility:

  • Design, implement, and optimize GPU kernels for complex computational workloads such as AI inferencing
  • Research and develop novel optimization techniques for generation of GPU kernels
  • Profile and analyze kernel performance using advanced diagnostic tools
  • Generate automated solutions for kernel optimization and tuning
  • Collaborate with other researchers to improve model performance
  • Document optimization strategies and maintain performance benchmarks
  • Contribute to the development of internal GPU computing frameworks

Requirements:

  • Doctorate in relevant field OR equivalent experience
  • 2+ years of experience in GPU architecture, memory hierarchies, parallel computing and algorithm optimization
  • 2+ years of experience in GPU programming, including performance profiling and optimization tools
  • Reliable C++ programming skills
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have:

  • 5+ years of experience in GPU programming and optimization, expert knowledge of CUDA, ROCm, Triton, PTX, CUTLASS, or similar GPU programming frameworks
  • Experience with machine learning frameworks (PyTorch, TensorFlow)
  • Familiarity with compiler optimization techniques and background in auto-tuning and automated code generation
  • Publication record in relevant conferences or journals (MLSys, NeurIPS, ICML, ICLR, AISTATS, ACL, EMNLP, NAACL, ISCA, MICRO, ASPLOS, HPCA, SOSP, OSDI, NSDI, etc.)

Additional Information:

Job Posted:
January 29, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Researcher - GPU Performance

Senior Research Engineer

We are seeking a highly skilled Senior Research Engineer to collaborate closely ...
Location
Location
United States
Salary
Salary:
210000.00 - 309000.00 USD / Year
assembly.ai Logo
Assembly
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong expertise in the Python ecosystem and major ML frameworks (PyTorch, JAX)
  • Experience with lower-level programming (C++ or Rust preferred)
  • Deep understanding of GPU acceleration (CUDA, profiling, kernel-level optimization)
  • TPU experience is a strong plus
  • Proven ability to accelerate deep learning workloads using compiler frameworks, graph optimizations, and parallelization strategies
  • Solid understanding of the deep learning lifecycle: model design, large-scale training, data processing pipelines, and inference deployment
  • Strong debugging, profiling, and optimization skills in large-scale distributed environments
  • Excellent communication and collaboration skills, with the ability to clearly prioritize and articulate impact-driven technical solutions
Job Responsibility
Job Responsibility
  • Investigate and mitigate performance bottlenecks in large-scale distributed training and inference systems
  • Develop and implement both low-level (operator/kernel) and high-level (system/architecture) optimization strategies
  • Translate research models and prototypes into highly optimized, production-ready inference systems
  • Explore and integrate inference compilers such as TensorRT, ONNX Runtime, AWS Neuron and Inferentia, or similar technologies
  • Design, test, and deploy scalable solutions for parallel and distributed workloads on heterogeneous hardware
  • Facilitate knowledge transfer and bidirectional support between Research and Engineering teams, ensuring alignment of priorities and solutions
What we offer
What we offer
  • competitive equity grants
  • 100% employer-paid benefits
  • flexibility of being fully remote
  • Fulltime
Read More
Arrow Right

Senior MLOps Engineer

If you’re passionate about scalability, automated deployment, and well-optimized...
Location
Location
Romania , Bucharest
Salary
Salary:
Not provided
it-genetics.com Logo
IT Genetics Romania
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • University degree, preferably in engineering (software, industrial, mechanical, process) or a related field
  • Over 5 years of experience in MLOps or machine learning engineering, with a focus on deploying and managing deep learning models at scale
  • Strong skills in Python, CI/CD pipelines, and ML frameworks (e.g., PyTorch, TensorFlow, OpenCV) for automating and scaling ML workflows
  • Expertise in monitoring and alert automation for ML workflows, including data pipelines, training processes, and model performance (e.g., Prometheus, Grafana)
  • Familiarity with distributed training techniques, multi-GPU strategies, and hardware optimization for deep learning
  • Strong communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Design end-to-end architecture for the automated training of ML models
  • Create data pipelines to build relevant datasets and data annotation flows
  • Monitor ML model performance and data drift
  • Handle versioning, deployment, and integration with the software team
  • Develop and manage CI/CD pipelines for building, testing, and deploying models
  • Apply best practices for model versioning, rollback, and A/B testing to ensure reliable and accurate production releases
  • Set up a robust monitoring system and develop automated alerting solutions to proactively identify issues in data pipelines, model training, validation, and data variation
  • Promote MLOps best practices (Infrastructure as Code, reproducibility, security) and continuously improve internal processes to increase reliability and efficiency
  • Research and implement cutting-edge technologies to improve training efficiency (e.g., distributed training, HPC, multi-GPU strategies) for the research team
  • Explore future MLOps frameworks and GPU-based cloud solutions as part of the scalability roadmap
What we offer
What we offer
  • Meal tickets
  • A place where your voice truly matters
  • Performance bonuses
  • A day off on your birthday
  • Private medical subscription
  • Trainings and learning resources
  • Hybrid work model
  • Bookster subscription
  • A friendly, passionate, and solution-oriented team
  • Opportunities to grow or change your role within the company
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
India
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work for the respective hiring region
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Systems Engineer

As a Senior Machine Learning Systems Engineer at Abridge, you’ll play a pivotal ...
Location
Location
United States , San Francisco
Salary
Salary:
221000.00 - 260000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building and deploying machine learning models in production environments
  • Deep understanding of container orchestration and distributed systems architecture
  • Expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Excellent communication skills, with the ability to interface between research and product engineering
Job Responsibility
Job Responsibility
  • Design, deploy and maintain scalable Kubernetes clusters for AI model inference and training
  • Develop, optimize, and maintain ML model serving and training infrastructure, ensuring high-performance and low-latency
  • Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughout optimization, and compute efficiency
  • Optimize compute-heavy workflows and enhance GPU utilization for ML workloads
  • Build a robust model API orchestration system
  • Collaborate with leadership to define and implement strategies for scaling infrastructure as the company grows, ensuring long-term efficiency and performance
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...
Location
Location
Salary
Salary:
Not provided
Pathway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
Job Responsibility
Job Responsibility
  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
  • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
  • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
  • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
  • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break
What we offer
What we offer
  • Intellectually stimulating work environment
  • Be a pioneer: you get to work with realtime data processing & AI
  • Work in one of the hottest AI startups, with exciting career prospects
  • Team members are distributed across the world
  • Responsibilities and ability to make significant contribution to the company’s success
  • Inclusive workplace culture
  • Fulltime
Read More
Arrow Right
New

Director, Payment Integrity

The Director, Payment Integrity provides strategic leadership and oversight for ...
Location
Location
United States , Phoenix
Salary
Salary:
Not provided
azblue.com Logo
Blue Cross Blue Shield of Arizona
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10 years of experience in Payment Integrity, Special Investigation Unit, or healthcare anti-fraud
  • 7+ years of experience in management role
  • 2+ years of experience in Vendor Partner Oversight
  • contracting and vendor management of external payment vendors and/or audit firms
  • Demonstrated success in developing and executing enterprise-wide strategies
  • Bachelor's Degree in business, statistics, healthcare administration or related field of study
  • Intermediate PC proficiency
  • Intermediate skill in use of office equipment
  • Basic skill in word processing and presentation software
  • Intermediate proficiency in spreadsheet, statistical analysis, query / data mining, and business intelligence software
Job Responsibility
Job Responsibility
  • Develop and execute enterprise-wide Payment Integrity strategy aligned with organizational objectives and cost of care targets
  • Lead the design, implementation, and continuous improvement of payment integrity programs, including pre- and post-payment audits, recovery, and investigative functions
  • Establish and maintain governance structures, including cross-functional committees, to oversee payment integrity controls and policy development
  • Representing the organization in internal and external forums, collaborating with industry peers, regulatory bodies, and vendor partners to share best practices and drive innovation
  • Stay abreast of emerging trends, technologies, and regulatory shifts in the healthcare payment integrity space
  • Oversee day-to-day operations, ensuring timely, accurate, and compliant claims payment and recovery activities
  • Direct vendor management, including contracting, performance oversight, and strategic partnerships for payment integrity solutions
  • Drive automation and technology adoption to enhance payment integrity processes and reporting capabilities
  • Lead cost-benefit analyses to determine optimal resource allocation (internal vs. outsourced functions)
  • Set and monitor KPIs, SMART goals, and financial targets for payment integrity initiatives
  • Fulltime
Read More
Arrow Right