Member of Technical Staff, Compute Orchestration & Scheduling Job at Microsoft Corporation (Mountain View)

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...

Location

Switzerland , Zürich

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Strong software engineering background building reliable, scalable production systems (Python preferred)
Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)

Job Responsibility

Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
advocate for best practices in security, reproducibility, and cost efficiency
Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals

Fulltime

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR equivalent experience
Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)

Job Responsibility

Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

What we offer

Competitive compensation, equity options, and comprehensive benefits

Fulltime

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...

Location

United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
OR equivalent experience
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
OR equivalent experience
Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
Strong communication skills
can explain complex systems clearly to senior leader

Job Responsibility

Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
Implement robust and fault-tolerant systems for data ingestion and processing
Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
Identify data gaps and instrumentation issues
drive fixes by influencing upstream engineering teams
Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable

Fulltime

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of people management experience leading software engineering teams, including managing principal engineers
Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Job Responsibility

Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate

Fulltime

Staff Systems Software Engineer, Infrastructure Platform

The Infrastructure Engineering organisation at GM is building a cloud-native pla...

Location

United States , Austin; Mountain View; Warren

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science or related field, or equivalent work experience
8+ years of software engineering experience with a strong track record of building and operating production distributed systems
Deep platform or infrastructure engineering experience, with hands-on work building APIs, schedulers, orchestrators, or similar systems at scale
Strong proficiency in Go, with ability to write clean, maintainable, and performant production code for backend services
Solid understanding of distributed systems fundamentals including consistency models, failure handling, idempotency, retry patterns, and circuit breakers
Experience with cloud-native technologies such as Kubernetes, Nomad, Consul, or similar orchestration and service discovery platforms
Strong API design skills with understanding of RESTful patterns, authentication and authorisation models (OIDC, RBAC), versioning strategies, and error handling
Deep experience with relational databases, particularly PostgreSQL, including schema design, indexing strategies, query optimisation, and migration management
Architectural thinking with ability to evaluate trade-offs, balance simplicity with flexibility, design for current requirements and future growth, and document decisions effectively
Strong communication skills with ability to explain complex technical concepts to both engineering and business stakeholders

Job Responsibility

Design and implement core platform services including the API gateway, scheduler, lifecycle orchestrator, and synchronisation services using Go and cloud-native patterns
Build RESTful APIs with authentication (OIDC, RBAC), authorisation, versioning, and observability, architecting the inventory database system using PostgreSQL for resource metadata, capabilities, and state management
Develop intelligent scheduling and orchestration logic that matches workload requirements to resource capabilities with support for automated pooling, reservation modes, and hybrid allocation strategies
Build developer CLI tooling and integrate with the control plane, enabling developers to discover, allocate, and manage infrastructure resources through intuitive commands
Implement provisioning workflows that coordinate firmware flashing, health checks, power cycling, and resource validation across diverse automotive hardware configurations
Collaborate with stakeholders across Infrastructure Engineering, Quality Engineering, and Hardware Infrastructure to understand workflows and integrate with existing systems
Lead architectural discussions, conduct code reviews, document technical decisions, and mentor team members on distributed systems patterns and Go development
Work with tools and technologies including Go, PostgreSQL, Kubernetes, Nomad, Consul, RESTful APIs with OIDC authentication and RBAC authorisation, Datadog, S3-compatible object storage (MinIO), CI/CD pipelines, and Git/GitHub

What we offer

From day one, we're looking out for your well-being–at work and at home–so you can focus on realizing your ambitions

Fulltime

Domestic Abuse Night Support Worker

The Nelson Trust is a charity dedicated to empowering people affected by trauma ...

Location

United Kingdom , Swindon

Salary:

25000.00 GBP / Year

360 Resourcing Solutions

Expiration Date

Until further notice

Requirements

Experience supporting people with complex needs, including domestic abuse, mental health, substance misuse, homelessness, or criminal justice involvement
A good understanding of safeguarding and risk management
Ability to build professional, supportive relationships and communicate clearly
Confidence working independently as a lone worker and as part of a team
Desirable: experience in supported housing, the voluntary or social care sector and/or a relevant qualification or First Aid (training provided)

Job Responsibility

Ensure the safety and security of the building through regular checks
Provide emotional reassurance, active listening and practical support to residents
Respond calmly and professionally to incidents, conflict, or emergencies
Work in line with individual support, safety and risk plans
Complete accurate records, incident reports and handovers for day staff
Follow safeguarding, health and safety and organisational policies at all times
Work confidently alone while remaining connected to the wider team

What we offer

Supportive and dynamic work environment
Opportunities for professional development and training
Chance to make a real impact in the lives of vulnerable women
Auto enrolment pension (6% employer contribution)
25 days holiday per annum plus statutory pro rata
Comprehensive training and development programme
Positive working environment

Service Engineer

At Anasia, we offer more than a job. We provide a long-term career and a chance ...

Location

Egypt , Cairo

Salary:

Not provided

Anasia Egypt for Trading

Expiration Date

Until further notice

Requirements

Bachelor's degree in Engineering or a related technical field. A Master's degree is a plus.
Proven experience from 2 to 4 years as a Service Engineer or in a similar technical support role, preferably in the industry or equipment sector.
Strong technical aptitude and problem-solving skills, with the ability to diagnose and troubleshoot technical issues effectively.
Excellent communication and interpersonal skills, with the ability to build rapport and effectively interact with customers.
Proficiency in reading and interpreting technical manuals, schematics, and drawings.
Strong organizational skills and the ability to manage multiple tasks and priorities in a fast-paced environment.

Job Responsibility

Perform installation, commissioning, and testing of products or equipment at customer sites, ensuring proper functionality and adherence to specifications.
Conduct routine maintenance, inspections, and repairs of products or equipment to ensure optimal performance and minimize downtime.
Respond promptly to customer service requests, troubleshooting and resolving technical issues in a timely and effective manner.
Provide on-site training to the customers
Provide training to customers on the operation and maintenance of products or equipment, ensuring their proper and safe usage.
Collaborate with cross-functional teams, including sales, engineering, and customer support, to address customer needs and provide comprehensive solutions.
Communicate with the customer daily to ensure satisfaction and implement any necessary corrective actions.
Proactively identify opportunities for service improvement, including developing and implementing preventive maintenance programs and recommending product enhancements.
Maintain accurate records of service activities, including service reports, maintenance schedules, and equipment documentation.
Create and maintain service information in the company enterprise database

Fulltime

Temporary Reception & Administration Opportunities

We are currently recruiting experienced temporary receptionists and administrato...

Location

United Kingdom , Stirling

Salary:

13.00 - 13.50 GBP / Hour

Office Angels

Expiration Date

Until further notice

Requirements

Previous experience in reception and/or administrative roles
A professional, friendly, and confident manner
Strong organisational skills and attention to detail
Good IT skills, including Microsoft Word, Excel, and Outlook
Reliability and flexibility for temporary assignments

Job Responsibility

Welcoming visitors and managing front-of-house reception
Handling incoming calls, emails, and correspondence
Diary management and meeting coordination
Data entry, filing, and document preparation
Providing general administrative support to wider teams

What we offer

Competitive pay of £13.50 per hour
A variety of temporary assignments with reputable organisations
Flexible opportunities to suit your availability
Exposure to a range of office environments
Access to discount vouchers with many high street brands
Eye care vouchers and money towards glasses should you require them for VDU purposes
We can search for permanent work whilst you're in assignments and offer expert interview support and advice
Weekly pay
Pension scheme option (with employer contributions)
28 days paid annual leave (Based on a weekly accrual)

Select Country

Member of Technical Staff, Compute Orchestration & Scheduling

Job Description

Job Responsibility

Requirements

Looking for more opportunities?