CrawlJobs Logo

Member of Technical Staff, AI Networking

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Mountain View

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft AI is hiring a Member of Technical Staff, AI Networking to design and scale the world’s most advanced high-performance networks powering Copilot and next-generation AI systems. Join the team building the fabric that connects frontier-class datacenters, enables multi-gigawatt AI supercomputers, and supports the training of the most sophisticated AI models on the planet.

Job Responsibility:

  • Advanced ROCE transport design, congestion control, ECN/WRED/DCTCP tuning
  • Fabric architecture, topology planning, network modeling, and scaling strategy
  • Telemetry, observability, reliability engineering, and automated troubleshooting
  • Develop and tune the deployment of novel routing techniques to achieve reliability in large networks
  • Work with world class network designers like NVIDIA, Broadcom, and in-house silicon/network co-design teams
  • AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis
  • Gather data and insights to develop the pretraining compute roadmap
  • Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven, product development cycle
  • Embody our Culture and Values

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience

Additional Information:

Job Posted:
April 01, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
Unlock 29494 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of Technical Staff, AI Networking

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Capacity & Efficiency Infrastructure

Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency ...
Location
Location
United States , Mountain View
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures
  • Deep experience in profiling and analyzing performance in large-scale distributed computing systems
  • Deep experience in profiling and analyzing performance in ML models especially GenAI models
  • Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX
  • Experience in leading technical projects and supporting architectural decisions with data
  • Experience building infrastructure for large-scale machine learning or generative AI workloads
  • Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • Track record of contributing to high-performance computing or large-scale AI infrastructure projects
Job Responsibility
Job Responsibility
  • Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
  • Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics
  • Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
  • Drive architectural improvements across various ML services which deliver measurable efficiency improvements
  • Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency
  • Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
  • Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency
  • Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Pre-Training Infrastructure

Microsoft AI is looking for a Member of Technical Staff, Pre-Training Infrastruc...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Experience in distributed computing and large-scale systems
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Proven ability to profile, benchmark, and optimize performance-critical systems
  • Experience in leading technical projects and supporting architectural decisions with data
  • Experience building infrastructure for large-scale machine learning or generative AI workloads
  • Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • Track record of contributing to high-performance computing or large-scale AI infrastructure projects
Job Responsibility
Job Responsibility
  • Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
  • Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
  • Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
  • Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond)
  • Gather data and insights to develop the pretraining compute roadmap
  • Care deeply about conversational AI and its deployment
  • Actively contribute to the development of AI models powering our innovative products
  • Find solutions to overcome roadblocks and deliver your work to users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven product development cycle
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - GPU Infrastructure

Prime Intellect is building the open superintelligence stack - from frontier age...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years hands-on experience with GPU clusters and HPC environments
  • Deep expertise with SLURM and Kubernetes in production GPU settings
  • Proven experience with InfiniBand configuration and troubleshooting
  • Strong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stack
  • Experience with infrastructure automation tools (Ansible, Terraform)
  • Proficiency in Python, Bash, and systems programming
  • Track record of customer-facing technical leadership
  • NVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)
  • Container runtime configuration for GPUs (Docker, Containerd, Enroot)
  • Linux kernel tuning and performance optimization
Job Responsibility
Job Responsibility
  • Partner with clients to understand workload requirements and design optimal GPU cluster architectures
  • Create technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUs
  • Develop deployment strategies for LLM training, inference, and HPC workloads
  • Present architectural recommendations to technical and executive stakeholders
  • Deploy and configure orchestration systems including SLURM and Kubernetes for distributed workloads
  • Implement high-performance networking with InfiniBand, RoCE, and NVLink interconnects
  • Optimize GPU utilization, memory management, and inter-node communication
  • Configure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performance
  • Tune system performance from kernel parameters to CUDA configurations
  • Serve as primary technical escalation point for customer infrastructure issues
  • Fulltime
Read More
Arrow Right

Technical Specialist (Teaching Technologist)

Recognised as a technical specialist in Immersive technology, this post will con...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
emeritus.org Logo
Emeritus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Current specialist knowledge and refined skills in immersive technologies, interactive and immersive media production
  • significant experience in tools for creating virtual reality and augmented reality, building virtual environments and interactions
  • Knowledge of using games engines (Unity or Unreal) to develop for Immersive projects
  • Knowledge of 360 video and 3D graphics for VR
  • 3D Animation/ Modelling and design (Maya, Blender)
  • Thorough knowledge of digital/interactive media software (e.g. DaVinci Resolve, FCP, Premiere, Touch Designer, Stornoway), and AI tools for XR creation
  • Knowledge of analogue and digital sound and spatial audio for VR and Immersive projects
  • Knowledge of Motion Capture, 3D scanning and photogrammetry
  • Detailed, current knowledge, skills and experience of XR hardware, such as Meta Quest VR headsets, mobile handsets, including system setup, troubleshooting, and maintenance, platforms like SteamVR and Oculus SDK for managing VR hardware
  • expertise with specialist equipment like 360 cameras and ambisonic microphones
Job Responsibility
Job Responsibility
  • Apply professional expertise in the design of stimulating learning solutions
  • Provide specialist guidance and support to students, staff or other technical staff in the utilisation of virtual reality, augmented reality, mixed reality, and motion capture environments, development tools, and related technologies, to enable excellent teaching and learning for all types of students
  • Use specialist experience to provide services, guidance, troubleshooting, support and training to students conducting creative immersive projects or research studies
  • Provide workshops and project support across a range of units and particular guidance will be needed as students develop blended or virtual performance and immersive pieces or need to submit sketches and show the process of design work
  • Proactively engage in academic and pastoral student support procedures, provide direct support and/or making referrals to the wider University support network as appropriate
  • Collaborate with Unit and Programme Directors in the design and delivery of teaching within labs or other teaching space(s)
  • Work with key stakeholders (e.g Unit and Programme Directors) to provide specialist solutions in the planning, design and delivery of practical teaching
  • Deliver teaching and develop physical and digital teaching resources
  • Respond independently using initiative and judgement to find solutions and take action
  • Maintain awareness of developments and changes in immersive technologies and media
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - Data Engineer

As Microsoft continues to push the boundaries of AI, we are on the lookout for i...
Location
Location
United States , New York
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling or data engineering work
  • OR Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, or data engineering work
  • OR equivalent experience
  • 4+ years technical engineering experience building data processing applications (batch and streaming) with coding in languages including, but not limited to, Python, Java, Spark, SQL
  • Experience working with Apache Hadoop eco system, Kafka, NoSQL, etc
  • 3+ years experience with data governance, data compliance and/or data security
  • 2+ years' experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP. Extensive use datastores like RDBMS, key-value stores, etc
  • 2+ years' experience building distributed systems at scale and extensive systems knowledge that spans bare-metal hosts to containers to networking
  • Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience
  • Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security
Job Responsibility
Job Responsibility
  • Build scalable data pipelines for sourcing, transforming and publishing data assets for AI use cases
  • Work collaboratively with other Platform, infrastructure, application engineers as well as AI Researchers to build next generation data platform products and services
  • Ship high-quality, well-tested, secure, and maintainable code
  • Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven, product development cycle
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
  • Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
  • Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
  • Experience with exascale-class systems or cloud-scale AI clusters.
  • Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
  • Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.
Job Responsibility
Job Responsibility
  • Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
  • Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
  • Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
  • Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
  • Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
  • Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
  • Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - Backend Engineer

Microsoft AI is looking for a talented Backend engineer to help build the next w...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years' experience building backend API for mobile apps such as GraphQL/Rest APIs/Protobuf/Thrift, and streaming protocols such as websocket/SSE/WebRTC with familiarity in backend and mobile data schema code generation or consistency, version control for mobile releases, analytics, feature flags, a/b testing framework
  • 4+ years' experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP. Extensive use datastores like RDBMS, key-value stores, etc
  • 4+ years' experience building distributed systems at scale and extensive systems knowledge that spans bare-metal hosts to containers to networking
Job Responsibility
Job Responsibility
  • Build secure and performant APIs that power Copilot apps
  • Work collaboratively with other product engineers, Product Managers, and platform engineers to take ambiguous projects and mold them into amazing experiences
  • Ship high-quality, well-tested, secure, and maintainable code
  • Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven, product development cycle
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right