CrawlJobs Logo

Fleet Engineering Debug

United States, Redmond 119800.00 - 234700.00 USD / Year · Job Posted April 11, 2026
Apply Position
Job Link Share

Job Description

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for a Fleet Engineering Debug to help achieve that mission. As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware. We are looking for a Fleet Engineering Debug with a dedicated commitment for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will manage and optimize the cloud infrastructure.

Job Responsibility

  • Execute system level end to end debug solutions for at scale datacenter systems
  • Lead collaboration projects with hardware, firmware and software teams that drive root cause analysis
  • Accountable for successful execution of targeted system level root cause analysis and defect reduction projects
  • Provide technical recommendations on diagnostics or debug deployment technologies
  • Lead debug of complex problems based on technical and business understanding
  • Develop innovative at scaleable debug methodologies, test strategies and test routines in data center solutions
  • Solve problems relating to essential services and build automation to drive debug efficiency
  • Effectively communicate with partners and stakeholders for planning and progress on initiatives using data.

Requirements

  • Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 3+ years technical engineering experience OR Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 5+ years technical engineering experience OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice to have

  • 7+ years of experience of technical leadership as a platform or software architect or validation architect or a lead debug engineer or equivalent industry experience leadership position
  • In-depth understanding of modern computer architectures or System on Chip features like reliability, accessibility and serviceability (RAS) features, virtualization technologies or major architectural blocks like Memory Controllers or Central Processing Units or Storage or Networking solutions for cloud or datacenter infrastructures
  • Ability to lead technical in-depth technical reviews into software solutions used in at scale environments or datacenter infrastructure, cloud native operating systems, or virtualization technologies
  • Platform or software level debug and validation experience
  • Software and data analytical skills.

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Fleet Engineering Debug

8 matching positions

Debug Program Manager

This role serves as the debug execution backbone of AMD’s Customer Program Manag...
Location
Location
United States , Austin
Salary
Salary:
162640.00 - 243960.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in the semiconductor industry
  • Deep hands-on experience with silicon debug (pre‑silicon and post‑silicon)
  • Strong background in product engineering, validation, failure analysis, or customer engineering
  • Proven experience managing complex debug programs across multiple customer segments
  • Strong program management skills with ability to drive execution across global, cross-functional teams
  • Excellent written and verbal communication skills, including executive-level engagement
  • Bachelor’s degree in Electrical Engineering, Computer Engineering, Computer Science, or related field required
Job Responsibility
Job Responsibility
  • Debug Program Leadership - Lead debug execution across hyperscale, OEM, HPC, and enterprise customer programs. Own high‑impact, cross‑customer and systemic issues and maintain visibility into top risks and trends.
  • Customer Program Integration - Partner with Customer Program Managers to align debug execution with customer deliverables, platform readiness, and deployment schedules. Support escalations and executive‑level customer engagements.
  • Technical Debug Coordination - Drive cross‑functional debug efforts across design, validation, product engineering, and failure analysis. Align pre‑ and post‑silicon debug strategies and connect lab debug to real‑world customer environments.
  • Field Failure & Fleet Quality Management - Lead resolution of field failures, fleet anomalies, and data center reliability issues. Aggregate fleet, RMA, and production signals and feed learnings back into design, validation, and manufacturing.
  • Governance & Process Improvement - Own debug tracking, prioritization, risk management, and executive reporting. Apply structured methodologies (8D, CAPA, FMEA) and drive continuous improvement in execution speed and consistency.
  • Fulltime
Read More
Arrow Right

Senior C++ Engineer - Satellite Real-Time Control Systems

The Mission of the Senior C++ Engineer - Satellite Real-Time Control Systems ICE...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
iceye.com Logo
ICEYE
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You love writing modern C++ and know what production-quality code looks like
  • Proven track record of shipping real-time control software for autonomous or safety-critical systems—satellites, drones, robotics, automotive…
  • Understand hard real-time constraints, latency budgeting and deterministic behaviour
  • Comfortable interfacing with sensors, actuators and embedded Linux environments
  • Champion of good engineering practice: rigorous testing at all levels, CI/CD, clear documentation
  • Ownership through full software lifecycle—from whiteboard concepts to on-orbit maintenance
  • Clear communicator who enjoys solving problems with colleagues across disciplines
Job Responsibility
Job Responsibility
  • Write and optimize real-time C++ code that meets strict determinism and latency budgets needed for safe and precise on-orbit execution
  • Build & own the software layer that bridges sensors, actuators and control algorithms - deterministic loops, telemetry pipelines and on-orbit autonomy
  • Drive quality through full development lifecycle: requirements → design → code → HIL/MIL testing → launch → on-orbit support
  • Collaborate with GNC, electronics, ground-segment and mission-ops engineers to debug, iterate and improve performance
  • Lead architecture evolution as our fleet and use-cases grow—refactor, optimise and introduce new technologies where they add value
  • Investigate anomalies: deep-dive into flight telemetry, reproduce issues on ground and roll out fixes that keep the constellation healthy
What we offer
What we offer
  • Occupational healthcare, occupational and private insurance
  • Yearly benefit budget to spend on sport, transport, wellness, lunch, etc
  • Phone subscription with iPhone of choice
  • Relocation support (flight tickets, accommodation, relocation agency support)
  • Time and resources for self-development, research, training, conferences, and certification schemes
  • Inspiring office environment with collaborative spaces and silent workspaces
  • Access to state-of-the-art labs and testing facilities
  • Opportunities to attend international space conferences
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Technical Lead

Provide hands-on technical leadership for the core software platform that powers...
Location
Location
United States , San Francisco
Salary
Salary:
140000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering with 3+ years in technical leadership roles
  • Track record delivering production robotic systems, IoT devices, or autonomous systems at scale
  • Experience designing reliable systems for B2B/enterprise deployments
  • On-device platform expertise: OS configuration, device drivers, system services, networking stack configuration
  • Robotics middleware: ROS/ROS2, real-time systems, sensor integration
  • Infrastructure: Containerization (Docker/K8s), CI/CD pipelines, monitoring/observability
Job Responsibility
Job Responsibility
  • Define and evolve the architecture for on-robot software, including OS configuration, hardware abstraction, middleware, and system services
  • Lead middleware architecture decisions for real-time robot control, sensor integration, and inter-process communication
  • Establish patterns for full-stack development, connecting on-robot systems to cloud services and web interfaces
  • Write production code for high-impact features across the stack: robotics middleware, backend services, and cloud infrastructure
  • Lead critical technical initiatives, including robotic platform software, cloud data pipelines, and fleet management platform
  • Build robust deployment, monitoring, and OTA update systems for production robot fleets
  • Debug the most challenging issues from kernel/driver level through the application layer
  • Establish engineering standards and processes that balance rigor with startup agility
  • Champion reliability, observability, and testing practices across embedded and cloud systems
  • Mentor engineers through code reviews, design discussions, and pairing sessions
What we offer
What we offer
  • Medical, dental, and vision insurance
  • Commuter benefits
  • Flexible paid time off (PTO)
  • Catered lunch
  • 401(k) matching
  • Fulltime
Read More
Arrow Right

Systems Engineer, Diagnostics

As a Systems Engineer on the Diagnostics Engineering team, you will lead efforts...
Location
Location
United States , Palo Alto
Salary
Salary:
110000.00 - 240000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in engineering or a related field, or equivalent experience
  • Hands-on experience debugging complex subsystems involving microprocessors and software-controlled electrical or electromechanical devices
  • Ability to read and interpret C++, Python, and similar embedded systems languages
  • Proficient in data visualization techniques and tools
  • Experience with Linux, Git, command line tools, and standard diagnostic equipment (e.g., oscilloscopes, multimeters, log analyzers)
  • Experience designing and building mechanical test fixtures, diagnostic tools, or custom hardware
  • Solid fundamentals in electrical and embedded systems troubleshooting
  • Experience supporting hardware bring-up, calibration, or production validation workflows
  • Familiarity with ROS2, behavior trees, or motion-planning stacks
Job Responsibility
Job Responsibility
  • Diagnose mechanical, electrical, software, and controls failures, and document root causes
  • Debug complex mechanical failures using engineering fundamentals such as drawings, mechanisms, tolerances, fits, inspection, and measurement techniques
  • Use electrical and system-level instrumentation to investigate faults in robots and sub-components
  • Develop test tools and scripts in C++ and Python to support diagnostics, integration, and data analysis
  • Create clear reporting from diagnostic and fleet data to drive decision-making and track improvements
  • Collaborate with prototyping and design engineers to iterate on hardware changes based on diagnostic findings
  • Communicate design recommendations effectively to hardware, electrical, and controls teams
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, CoreAI Workload Engines

The CoreAI Workloads team builds the foundational inference engines and APIs tha...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
  • Proven ability to design and operate large-scale, production inference services with high reliability and performance requirements, and to ship performance improvements safely via disciplined experimentation.
  • Strong skills in performance analysis: benchmarking, profiling, diagnosing regressions, and turning results into concrete engine/runtime changes.
  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues.
  • Demonstrated technical leadership, including mentoring engineers, driving cross-team architectural alignment, and leveraging AI tools and AI-assisted workflows to accelerate engineering velocity and quality.
  • Hands-on experience with Kubernetes (building and operating services on k8s), including debugging production issues and designing platform abstractions (e.g., custom resources/controllers) and scheduling-aware deployments (e.g., node affinity, taints/tolerations, resource requests/limits).
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries.
Job Responsibility
Job Responsibility
  • Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost).
  • Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails.
  • Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely.
  • Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements.
  • Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails.
  • Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs).
  • Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up).
  • Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/InfiniBand-class fabrics such as RoCE over IB) for distributed inference, without owning low-level kernel/driver enablement.
  • Drive end-to-end features from design through production: observability, diagnostics, performance regression detection, and operational excellence for inference serving.
  • Influence platform architecture and technical direction across teams through design reviews, clear metrics, and technical leadership focused on experimentation velocity and production reliability.
What we offer
What we offer
  • Benefits and other compensation
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, CoreAI Workload Engines

The CoreAI Workloads team builds the foundational inference engines and APIs tha...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 331200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience
  • Proven ability to design and operate large-scale, production inference services with high reliability and performance requirements, and to ship performance improvements safely via disciplined experimentation
  • Strong skills in performance analysis: benchmarking, profiling, diagnosing regressions, and turning results into concrete engine/runtime changes
  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues
  • Demonstrated technical leadership, including mentoring engineers, driving cross-team architectural alignment, and leveraging AI tools and AI-assisted workflows to accelerate engineering velocity and quality
  • Hands-on experience with Kubernetes (building and operating services on k8s), including debugging production issues and designing platform abstractions (e.g., custom resources/controllers) and scheduling-aware deployments (e.g., node affinity, taints/tolerations, resource requests/limits)
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries
Job Responsibility
Job Responsibility
  • Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost)
  • Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails
  • Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely
  • Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements
  • Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails
  • Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs)
  • Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up)
  • Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/InfiniBand-class fabrics such as RoCE over IB) for distributed inference, without owning low-level kernel/driver enablement
  • Drive end-to-end features from design through production: observability, diagnostics, performance regression detection, and operational excellence for inference serving
  • Influence platform architecture and technical direction across teams through design reviews, clear metrics, and technical leadership focused on experimentation velocity and production reliability
  • Fulltime
Read More
Arrow Right

ASIC Engineer Intern - Infra Silicon Enablement

Meta is seeking an ASIC Engineering Intern to join our Release to Production Eng...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining, a Bachelor's degree in Electrical Engineering, Computer Engineering or related engineering fields
  • Completed Coursework in Computer architecture and/or Electrical engineering
  • Experience with troubleshooting, debug and analytics for Silicon products
  • Experience in Linux, Python, C/C++ and/or similar languages (data structures, algorithms, and OOP)
  • Must obtain work authorization in the country of employment at the time of hire and maintain ongoing work authorization during employment
  • Intent to return to a degree-program after the completion of the internship/co-op
Job Responsibility
Job Responsibility
  • Work across all aspects of silicon lifecycle to deliver reliable and performant silicon solutions - from early architecture and design inputs, pre-silicon validation, bring-up and post-silicon characterization and deployment in fleet
  • Create/develop validation and automation tool sets targeted at silicon validation and productization - inclusive of, but not limited to silicon diagnostics, performance analysis, debug tools, bare metal and full stack systems, from early labs to data center deployments
  • Understand production system use cases to improve validation
  • Provide feedback into next generation architecture and design with insights from the production fleet
  • Root-cause, resolve and remediate issues with silicon across the product lifecycle
Read More
Arrow Right

Software Systems Engineer

Meta is seeking a Software Systems Engineer to join our Production Systems Engin...
Location
Location
United States , Bellevue, WA +2 locations
Salary
Salary:
144000.00 - 204000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in systems software engineering, including development in C, C++, or Python for Linux-based production environments
  • 6+ years of experience with large-scale infrastructure systems, including hardware lifecycle management, fleet automation, or data center operations software
  • Experience developing or integrating with low-level systems components such as kernel interfaces, BMC/IPMI/Redfish management stacks, or hardware telemetry frameworks
  • Experience designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines
  • Experience communicating technical decisions and system designs through written documentation and cross-functional stakeholder alignment
Job Responsibility
Job Responsibility
  • Design and develop systems software for managing, provisioning, and monitoring large-scale production hardware infrastructure including compute, storage, and networking components
  • Build and maintain tooling for hardware lifecycle management, fleet health monitoring, and automated remediation of production system failures
  • Collaborate with hardware engineering teams to define software interfaces and firmware integration requirements for new server and accelerator platforms
  • Develop and optimize low-level systems software including kernel modules, device drivers, and platform management agents to improve hardware utilization and reliability
  • Architect scalable infrastructure automation frameworks that reduce manual operational toil and accelerate hardware deployment across Meta's data center fleet
  • Identify and resolve systemic reliability and performance issues across production hardware by analyzing telemetry, failure patterns, and system-level diagnostics
  • Define technical direction for production systems software components, driving alignment across infrastructure engineering and data center operations stakeholders
  • Mentor other engineers on systems software design patterns, debugging methodologies, and production infrastructure best practices
  • Lead cross-functional efforts to evaluate and integrate new hardware technologies into the production environment, including bring-up, validation, and qualification workflows
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right