Member of Technical Staff, AI Networking Job at Microsoft Corporation (Mountain View)

Member Of Technical Staff, Microsoft Robotics (Field Robotics)

Microsoft’s Discovery and Quantum (MDQ) division develops and delivers advanced ...

Location

United States , Redmond

Salary:

102100.00 - 202200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 2+ years technical engineering experience
OR Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check

Job Responsibility

Plan, coordinate, and execute field test campaigns for robotic systems across diverse real-world environments, including customer sites, partner facilities, and first-party scenarios, ranging from indoor semi-structured settings to unstructured outdoor terrains, that replicate operational conditions
Develop and maintain structured field test protocols, acceptance criteria, data collection procedures, and reporting templates that ensure repeatable, measurable evaluation of robot hardware, software, and AI model performance
Deploy, configure, and operate robotic platforms in the field, including hardware setup, software provisioning, network configuration, sensor calibration, and integration with site-specific infrastructure
Collect, organize, and analyze field test data including sensor logs, telemetry streams, performance metrics, video recordings, and environmental measurements, producing clear and actionable test reports for engineering and program leadership
Diagnose and troubleshoot field issues across the full robot stack (mechanical, electrical, software, networking, AI behavior), performing initial root cause analysis and resolving complex issues alongside engineering teams with well-documented reproduction steps
Collaborate with hardware engineers, software engineers, AI researchers, and program managers to communicate and integrate field findings, prioritize bug fixes and design improvements, and validate engineering changes through follow-up field testing
Identify and provide feedback on process gaps in field deployment and test procedures, sharing best practices broadly across the team and contributing to continuous improvement of field operations readiness
Maintain detailed documentation around field deployment procedures, site-specific configurations, inventory management, and service support workflows
Participate in test triage and review meetings to share knowledge, actively contribute to faster issue resolution, and identify readiness needs for upcoming field campaigns
Assist in the development of end-to-end readiness programs, including mentoring, knowledge sharing, technical document creation, and development of field operations training materials

What we offer

Certain roles may be eligible for benefits and other compensation

Fulltime

Member of Technical Staff, Microsoft Robotics (Robot Security & Safety)

Microsoft's Discovery and Quantum (MDQ) division develops and delivers advanced ...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Design and implement end-to-end security and safety architectures for robotic systems, spanning the full hardware-software robot stack, including device-level security (secure boot, firmware integrity, encrypted communications), platform-level security (identity, access control, certificate management), and cloud-to-edge security for robot fleet management.
Develop and maintain safety analyses and threat models (FMEA, FTA, HAZOP, STPA) specific to robotic systems operating in human-populated environments, identifying risks from cyber-attack vectors, AI behavior failures, hardware faults, and environmental uncertainties.
Define and enforce safety envelopes and runtime safety monitors for autonomous robot behaviors, including collision avoidance, force limiting, workspace boundaries, and graceful degradation under sensor or actuator failure conditions.
Analyze key security metrics, KPIs, and telemetry data to identify trends in security posture and safety incidents, implementing mitigation strategies and driving continuous improvement of the platform's security and safety posture.
Design and scale solutions to address identified security control issues (network, identity, applications) and current threats specific to robotics deployments, anticipating and articulating risks to leadership.
Develop and implement incident response processes for robotics-specific security and safety events, including physical safety incidents, AI behavior anomalies, fleet compromise scenarios, and supply chain integrity concerns.
Collaborate with robotics engineers, AI researchers, and platform engineers to embed security and safety requirements into the software development lifecycle, including secure coding standards, security testing in CI/CD pipelines, and safety validation in simulation.
Contribute to the development and implementation of security policies and standards for robotic systems, aligning with industry frameworks (NIST, MITRE ATT&CK for ICS, IEC 61508, ISO 13482, ISO 10218, RIA standards) and Microsoft's security requirements.
Engage with regulatory bodies, industry standards organizations, and Microsoft internal AI safety and security communities (e.g., Microsoft's Office of Responsible AI), to stay current on evolving safety and security requirements for autonomous and AI-enabled physical systems.
Leverage automation and AI to improve effectiveness of security operations, including automated vulnerability scanning, anomaly detection in robot telemetry, and AI-assisted threat hunting across the robotics fleet.

Fulltime

Member Of Technical Staff, Microsoft Robotics (Hardware Systems)

Microsoft’s Discovery and Quantum (MDQ) division develops and delivers advanced ...

Location

United States , Redmond

Salary:

102100.00 - 202200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 2+ years technical engineering experience
OR Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Support the design, assembly, integration, and maintenance of robotic hardware platforms, including manipulators, mobile bases, sensor arrays, compute enclosures, power systems, and end-effector tooling
Ingest device specification sheets, electrical schematics, and mechanical drawings into AI systems to support and accelerate configuration, calibration, and troubleshooting of robotic hardware components and subsystems
Perform hardware-software integration for robotic platforms, including sensor bring-up (cameras, LiDAR, force/torque sensors, tactile arrays), compute module configuration, network provisioning, and firmware updates
Develop and execute hardware verification and validation test plans, including functional testing, environmental stress testing, endurance testing, and safety compliance verification for robotic subsystems
Create and maintain design documentation including assembly drawings, wiring diagrams, bills of materials, specifications, and calibration procedures for robotic hardware configurations
Set up and maintain robotics lab environments, including workstations, test fixtures, safety infrastructure, tool inventories, and environmental controls, following established safety guidelines
Identify common project risks (e.g., supplier delays, incomplete specifications, component obsolescence) and develop mitigation plans to keep hardware integration timelines on track
Gather information and participate in make-versus-buy decisions based on complexity, cost, quality, reliability, and schedule impact for robotic hardware components and subsystems
Develop prototype components and assemblies for validation of new robotic capabilities, working closely with software and AI teams to ensure hardware meets requirements for AI model training and evaluation
Communicate project progress and technical status within the project team, including hardware readiness, integration milestones, and issue escalations, providing clear and timely updates to engineering and program leadership.

What we offer

Certain roles may be eligible for benefits and other compensation.

Fulltime

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...

Location

Switzerland , Zürich

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Strong software engineering background building reliable, scalable production systems (Python preferred)
Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)

Job Responsibility

Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
advocate for best practices in security, reproducibility, and cost efficiency
Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals

Fulltime

Member of Technical Staff - Backend Engineer

Microsoft AI is looking for a talented Backend engineer to help build the next w...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years' experience building backend API for mobile apps such as GraphQL/Rest APIs/Protobuf/Thrift, and streaming protocols such as websocket/SSE/WebRTC with familiarity in backend and mobile data schema code generation or consistency, version control for mobile releases, analytics, feature flags, a/b testing framework
4+ years' experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP. Extensive use datastores like RDBMS, key-value stores, etc
4+ years' experience building distributed systems at scale and extensive systems knowledge that spans bare-metal hosts to containers to networking

Job Responsibility

Build secure and performant APIs that power Copilot apps
Work collaboratively with other product engineers, Product Managers, and platform engineers to take ambiguous projects and mold them into amazing experiences
Ship high-quality, well-tested, secure, and maintainable code
Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
Enjoy working in a fast-paced, design-driven, product development cycle
Embody our Culture and Values

Fulltime

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
Experience with exascale-class systems or cloud-scale AI clusters.
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.

Job Responsibility

Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.

Fulltime

Member of Technical Staff, Pre-Training Infrastructure

Microsoft AI is looking for a Member of Technical Staff, Pre-Training Infrastruc...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience in distributed computing and large-scale systems
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Proven ability to profile, benchmark, and optimize performance-critical systems
Experience in leading technical projects and supporting architectural decisions with data
Experience building infrastructure for large-scale machine learning or generative AI workloads
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
Track record of contributing to high-performance computing or large-scale AI infrastructure projects

Job Responsibility

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond)
Gather data and insights to develop the pretraining compute roadmap
Care deeply about conversational AI and its deployment
Actively contribute to the development of AI models powering our innovative products
Find solutions to overcome roadblocks and deliver your work to users quickly and iteratively
Enjoy working in a fast-paced, design-driven product development cycle
Embody our Culture and Values

Fulltime

Member of Technical Staff, Capacity & Efficiency Infrastructure

Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency ...

Location

United States , Mountain View

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures
Deep experience in profiling and analyzing performance in large-scale distributed computing systems
Deep experience in profiling and analyzing performance in ML models especially GenAI models
Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX
Experience in leading technical projects and supporting architectural decisions with data
Experience building infrastructure for large-scale machine learning or generative AI workloads
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
Track record of contributing to high-performance computing or large-scale AI infrastructure projects

Job Responsibility

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Drive architectural improvements across various ML services which deliver measurable efficiency improvements
Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)
Embody our Culture and Values

Fulltime

Select Country

Member of Technical Staff, AI Networking

Job Description

Job Responsibility

Requirements

Looking for more opportunities?