CrawlJobs Logo

Member of Technical Staff, Compute Orchestration & Scheduling

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Mountain View

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft AI is looking for a Member of Technical Staff, Compute Orchestration & Scheduling to help build the next wave of capabilities of our personalized AI assistant, Copilot. We’re looking for someone who will bring an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective. The right candidate enjoys building world-class consumer experiences and products in a fast-paced environment. You will actively contribute to the development of AI models that are powering our innovative products. You will wear multiple hats and work on engineering, research, and everything in between. Your contributions will span model architecture, data curation, training and inference infrastructures, evaluation protocols, alignment and reinforcement learning from human feedback (RLHF), and many other exciting topics at the cutting edge of AI. Microsoft AI is building foundational models to develop novel responsible and efficient artificial general intelligence. The foundational models require large compute-capacity, and as a Member of Technical Staff, Compute Orchestration & Scheduling you would be responsible for designing and building our compute orchestration and scheduling layer on top of Kubernetes and Ray, working on everything from workload placement and scaling to reliability and developer experience. You’ll work closely with research and framework teams to turn their requirements into scalable abstractions, improve cluster efficiency, and ensure our compute platform is observable, and easy to operate in production. As a contributing member of the core group of engineers, you would also bring to the table best practices driving architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact the business goals of a wide range of users and facilitate the next wave of growth and innovation in AI.

Job Responsibility:

  • Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures
  • Benchmark GB200 and AMD MIxxx GPU clusters
  • Gather data and insights to develop the pretraining compute roadmap
  • Care deeply about conversational AI and its deployment
  • Actively contribute to the development of AI models that are powering our innovative products
  • Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven, product development cycle
  • Embody our Culture and Values

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience

Additional Information:

Job Posted:
April 01, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
Unlock 29494 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of Technical Staff, Compute Orchestration & Scheduling

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...
Location
Location
United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
  • Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
  • Strong communication skills
  • can explain complex systems clearly to senior leader
Job Responsibility
Job Responsibility
  • Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
  • Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
  • Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
  • Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
  • Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
  • Implement robust and fault-tolerant systems for data ingestion and processing
  • Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
  • Identify data gaps and instrumentation issues
  • drive fixes by influencing upstream engineering teams
  • Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...
Location
Location
Switzerland , Zürich
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background building reliable, scalable production systems (Python preferred)
  • Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
  • Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
  • Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
  • Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)
Job Responsibility
Job Responsibility
  • Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
  • Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
  • advocate for best practices in security, reproducibility, and cost efficiency
  • Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
  • Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
  • Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
  • Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
  • Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Staff Systems Software Engineer, Infrastructure Platform

The Infrastructure Engineering organisation at GM is building a cloud-native pla...
Location
Location
United States , Austin; Mountain View; Warren
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science or related field, or equivalent work experience
  • 8+ years of software engineering experience with a strong track record of building and operating production distributed systems
  • Deep platform or infrastructure engineering experience, with hands-on work building APIs, schedulers, orchestrators, or similar systems at scale
  • Strong proficiency in Go, with ability to write clean, maintainable, and performant production code for backend services
  • Solid understanding of distributed systems fundamentals including consistency models, failure handling, idempotency, retry patterns, and circuit breakers
  • Experience with cloud-native technologies such as Kubernetes, Nomad, Consul, or similar orchestration and service discovery platforms
  • Strong API design skills with understanding of RESTful patterns, authentication and authorisation models (OIDC, RBAC), versioning strategies, and error handling
  • Deep experience with relational databases, particularly PostgreSQL, including schema design, indexing strategies, query optimisation, and migration management
  • Architectural thinking with ability to evaluate trade-offs, balance simplicity with flexibility, design for current requirements and future growth, and document decisions effectively
  • Strong communication skills with ability to explain complex technical concepts to both engineering and business stakeholders
Job Responsibility
Job Responsibility
  • Design and implement core platform services including the API gateway, scheduler, lifecycle orchestrator, and synchronisation services using Go and cloud-native patterns
  • Build RESTful APIs with authentication (OIDC, RBAC), authorisation, versioning, and observability, architecting the inventory database system using PostgreSQL for resource metadata, capabilities, and state management
  • Develop intelligent scheduling and orchestration logic that matches workload requirements to resource capabilities with support for automated pooling, reservation modes, and hybrid allocation strategies
  • Build developer CLI tooling and integrate with the control plane, enabling developers to discover, allocate, and manage infrastructure resources through intuitive commands
  • Implement provisioning workflows that coordinate firmware flashing, health checks, power cycling, and resource validation across diverse automotive hardware configurations
  • Collaborate with stakeholders across Infrastructure Engineering, Quality Engineering, and Hardware Infrastructure to understand workflows and integrate with existing systems
  • Lead architectural discussions, conduct code reviews, document technical decisions, and mentor team members on distributed systems patterns and Go development
  • Work with tools and technologies including Go, PostgreSQL, Kubernetes, Nomad, Consul, RESTful APIs with OIDC authentication and RBAC authorisation, Datadog, S3-compatible object storage (MinIO), CI/CD pipelines, and Git/GitHub
What we offer
What we offer
  • From day one, we're looking out for your well-being–at work and at home–so you can focus on realizing your ambitions
  • Fulltime
Read More
Arrow Right
New

Digital Electronic Integrated Circuit (EIC) Engineer

We are seeking an experienced Digital EIC Engineer to join our team in Ílhavo. Y...
Location
Location
Portugal , Ílhavo
Salary
Salary:
Not provided
darwinrecruitment.com Logo
Darwin Recruitment GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 3 years of experience in digital circuit design for SoC/ASIC
  • Hands-on experience in RTL architecture, micro-architecture, block-level implementation, and top-level design
  • Familiarity with static timing analysis, constraints, Lint, LEC, and low-power estimation
  • Experience with SystemVerilog, Verilog, VHDL, Python, TCL, Perl, Linux, GIT
  • Exposure to front-end flows of Cadence, Synopsys, or Mentor Graphics.
Job Responsibility
Job Responsibility
  • Design and implement RTL for datapath, control logic, interfaces, and peripherals (Verilog / SystemVerilog)
  • Perform gate-level synthesis, static timing analysis, and constraint validation
  • Conduct Lint analysis and logic equivalence checking (LEC)
  • Collaborate closely with Functional Verification and Physical Implementation teams
  • Apply low-power design techniques and assist with Design for Test considerations.
What we offer
What we offer
  • Opportunity to work on cutting-edge semiconductor projects
  • Collaborative, innovation-driven environment
  • Competitive compensation and benefits.
  • Fulltime
Read More
Arrow Right
New

Lawn Operative

Due to growth and expansion, we have exciting opportunities for Lawn Care Operat...
Location
Location
United Kingdom , Portsmouth
Salary
Salary:
27248.00 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have full, UK driving licence for manual vehicles
  • Have excellent communication and customer-facing skills
  • Have a passion for lawn care
  • Want a job where they can work independently, outdoors and keep active
  • Be able to deliver top quality customer service
  • Have excellent organisation and timekeeping skills
  • Have the right to work in the UK
Job Responsibility
Job Responsibility
  • Maintaining contact with the customer before a treatment is carried out to inform them when you will be attending. (call ahead).
  • To visit a number of designated customers on a daily basis to apply fertiliser and herbicide.
  • Where possible, inform the customer before any work commences, that you are there.
  • At the conclusion of the work, notify the customer that the treatment has been completed and supply the invoice explaining what work has been carried out.
  • Inform the customer of the next treatment date.
  • Identify any lawn issues and offer any necessary advice on lawn and mowing practice and any additional treatments which may be required.
  • Carrying out essential Spring/Autumn machine work such as Aerators, Scarifiers and lawn top-dressers using a variety of professional lawncare machinery.
  • Aeration and scarification is the reduction of moss, thatch and soil compaction which needs to be carried out in a safe and professional manner. The use of regular garden tools such as rakes, leaf sweepers and brooms to clear scarification waste
  • Working in a team or alone on machine work which can be physically demanding so good general fitness is required.
  • Maintain standards of all health and safety practices, as supplied by Green Thumb Limited.
What we offer
What we offer
  • Company van and mobile phone
  • New uniform annually
  • Paid training and qualification in the application of pesticides and chemicals
  • Ongoing training and development
  • Medical cash plan
  • Christmas Shutdown
  • Free Lawn Treatments
  • Enhanced Paternity & Maternity pay
  • Company Sick pay
  • 24 hour Employee Assistance Helpline
  • Fulltime
Read More
Arrow Right