Critical Infrastructure Platform Engineer Job at Microsoft Corporation (Hyderabad)

Platform Engineer - Infrastructure Runtime

Location

Spain , Barcelona

Salary:

Not provided

Delivery Hero

Expiration Date

Until further notice

Requirements

2+ years of relevant infrastructure experience, ideally in a platform team
Experience working with Kubernetes internals, observability, and cloud-native architecture (EKS a plus)
Familiarity with Golang and/or Python programming
Hands-on experience with AWS services (EC2, S3, IAM, VPC, etc.) and Terraform
Understanding GitOps principles and tooling (Argo CD, Argo Rollouts, or similar)
Experience with CI/CD systems (GitHub Actions or similar)
Basic networking and security knowledge (VPCs, DNS, ingress, TLS, etc.)
Comfortable working on technical projects and collaborating across teams
Analytical mindset for troubleshooting and improving performance of distributed systems
Strong written and verbal communication skills in English

Job Responsibility

Support technical projects that evolve our Kubernetes-based compute platform on AWS, with a focus on reliability, scalability, and developer productivity
Help iterate on our GitOps workflows using Argo CD and Argo Rollouts for safe, automated, and progressive deployments
Maintain and improve CI/CD best practices by building and maintaining scalable GitHub Actions pipelines
Work with secure, multi-tenant infrastructure using Infrastructure as Code (Terraform)
Troubleshoot and help resolve challenging problems in distributed systems, service discovery, container orchestration, and platform observability
Take ownership of critical compute and networking infrastructure to ensure high performance, availability, and cost efficiency
Build and maintain internal tooling and automation scripts using Go and Python
Ensure our systems remain robust, reliable, and support smooth business operations
Collaborate with platform and product engineers to propagate best practices and platform knowledge
Share learnings with the team through documentation, pairing, and technical discussions

What we offer

An enticing equity plan that lets you own a piece of the action
Top-notch private health insurance to keep you at your peak
Monthly Glovo credit to satisfy your cravings
Cobee discounts on transportation, food, and even kindergarten expenses or office-based nursery
Discounted gym memberships to keep you energized
The freedom to work from home two days a week, and the opportunity to work from anywhere for up to three weeks a year and personal days off
Enhanced parental leave
Online therapy and wellbeing benefits to ensure your mental well-being

Fulltime

Staff Software Engineer, Platform Infrastructure

We are seeking an experienced and highly motivated Staff Software Engineer to le...

Location

United States , Pittsburgh

Salary:

171000.00 - 273000.00 USD / Year

Aurora Innovation

Expiration Date

Until further notice

Requirements

Senior or Staff-level experience (P7 equivalent) as a Software Engineer, ideally in infrastructure, developer tooling, or critical shared services
Proven experience leading technical projects and mentoring/directing other engineers
Familiarity with distributed compute technologies, cloud services (e.g., AWS), and large-scale workflow management systems
Demonstrated ability to triage, debug, and perform on-call and incident management for complex, cross-cutting infrastructure issues
Strong communication skills to manage stakeholder alignment and drive cross-team standardization efforts

Job Responsibility

Lead the OTI Team: Serve as the technical lead (TL) for the OTI team within PIE-Compute, driving the strategic vision, execution, and long-term stability of the core infrastructure
Help Define and Optimize the Testing Ecosystem: Lead the design of the next-generation offline testing architecture to meet diverse team needs, reducing redundancy and siloing across the organization
Partner with Test Creation and Test Drive teams to standardize end-to-end test execution and reporting (Creation -> Execution -> Reporting)
Refine the full test lifecycle to ensure performance and scalability, and maintain clear attribution of failures to enhance reliability and efficient debugging
Own Critical OTI Components and Migrations: Take ownership of the shared OTI components, including maintenance and on-call support
Own various offline test Modalities, including step code, workflow code, and general health
Lead the maintenance and development of common OTI tooling, including launching test evaluations, polling APIs, communicating results, and providing recommended pipeline templates
Establish Architecture and Best Practices: Define and enforce data management policies for the testing ecosystem (storage, lifecycling, write strategies, data integrity, and lineage)
Define use cases and feature design for new test modalities, including single versus cross-modality testing strategies
Manage incidents related to offline tests and maintain Standard Operating Procedures (SOPs) for PRs, local workflows, V&V, and releases

What we offer

annual bonus
equity compensation
benefits

Fulltime

Staff Software Engineer, Platform Infrastructure

We are seeking an experienced and highly motivated Staff Software Engineer to le...

Location

United States , Mountain View

Salary:

189000.00 - 303000.00 USD / Year

Aurora Innovation

Expiration Date

Until further notice

Requirements

Senior or Staff-level experience (P7 equivalent) as a Software Engineer, ideally in infrastructure, developer tooling, or critical shared services
Proven experience leading technical projects and mentoring/directing other engineers
Familiarity with distributed compute technologies, cloud services (e.g., AWS), and large-scale workflow management systems
Demonstrated ability to triage, debug, and perform on-call and incident management for complex, cross-cutting infrastructure issues
Strong communication skills to manage stakeholder alignment and drive cross-team standardization efforts

Job Responsibility

Lead the OTI Team: Serve as the technical lead (TL) for the OTI team within PIE-Compute, driving the strategic vision, execution, and long-term stability of the core infrastructure
Help Define and Optimize the Testing Ecosystem: Lead the design of the next-generation offline testing architecture to meet diverse team needs, reducing redundancy and siloing across the organization
Partner with Test Creation and Test Drive teams to standardize end-to-end test execution and reporting (Creation -> Execution -> Reporting)
Refine the full test lifecycle to ensure performance and scalability, and maintain clear attribution of failures to enhance reliability and efficient debugging
Own Critical OTI Components and Migrations: Take ownership of the shared OTI components, including maintenance and on-call support
Own various offline test Modalities, including step code, workflow code, and general health
Lead the maintenance and development of common OTI tooling, including launching test evaluations, polling APIs, communicating results, and providing recommended pipeline templates
Establish Architecture and Best Practices: Define and enforce data management policies for the testing ecosystem (storage, lifecycling, write strategies, data integrity, and lineage)
Define use cases and feature design for new test modalities, including single versus cross-modality testing strategies
Manage incidents related to offline tests and maintain Standard Operating Procedures (SOPs) for PRs, local workflows, V&V, and releases

What we offer

annual bonus
equity compensation
benefits

Fulltime

Senior Systems Engineer - Infrastructure & Platform Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco; San Jose

Salary:

206000.00 - 310000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

Have a keen interest in system design, architecting for performance, scalability, and experience with multiple cloud infrastructure platforms (AWS, GCP, Azure, etc.)
Think carefully about systems: edge cases, failure modes, behaviors, and specific implementations
Know and prefer configuration management systems and toolchains (Chef, Ansible, Terraform, GitHub Actions, etc.)
Have solid programming skills: Python, Go, etc.
Have an urge to collaborate and communicate asynchronously, combined with a desire to record and document issues and solutions
Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it
Have an urge for delivering quickly and effectively, and iterating fast

Job Responsibility

Design, write, and deliver software and services to improve the availability, scalability, reliability, and efficiency of Lambda’s internal IT systems and platforms
Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional events
Work with Lambda Engineering and internal teams to Influence and create new designs, architectures, standards, and methods for large-scale distributed systems
Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning
Be an excellent communicator, producing documentation and related artifacts for the systems you are responsible for

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use

Fulltime

Senior Software Engineer - Platform Infrastructure

We are seeking a Senior Software Engineer II to architect, build, and operate se...

Location

United States

Salary:

192200.00 - 225810.00 USD / Year

Confluent

Expiration Date

Until further notice

Requirements

6+ years of experience in software engineering, SRE, or security engineering roles, with significant experience operating security platform services
Strong backend software development experience (Go, Java, Rust, Python)
Expertise with distributed systems, cloud infrastructure (AWS, GCP, Azure), Kubernetes, service mesh, and container orchestration
Strong understanding of security domains: IAM, OAuth2, OIDC, PKI, secrets management, policy engines, audit pipelines, zero trust architecture
Experience building highly reliable, observable, and resilient production systems
Operational expertise: SLOs, SLIs, error budgets, on-call leadership, incident management
Strong collaboration skills to drive alignment across engineering, security, and compliance stakeholders
Excellent communication skills with ability to influence technical and business leaders
BS, MS, or PhD in computer science or a related field, or equivalent work experience

Job Responsibility

Architect, design, and develop platform services with a strong focus on scalability, security, and developer experience
Lead operational design for reliability: build comprehensive observability, monitoring, and incident response automation into security-critical services
Build automation and tooling to drive self-healing systems, proactive risk detection, failure recovery, and continuous resilience testing
Collaborate with compliance, governance, and risk teams to translate regulatory and policy requirements into scalable technical controls
Lead technical design reviews, security architecture reviews, and incident postmortems for platform-level incidents
Mentor engineers across multiple disciplines on both security and operational best practices
Own end-to-end delivery of services: from initial design and development through deployment, production hardening, and lifecycle maintenance

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Senior-Staff Software Engineer, Platform Infrastructure

As a Senior Software Engineer on this team, you will help architect, design and ...

Location

United States , San Mateo

Salary:

130000.00 - 280000.00 USD / Year

Verkada

Expiration Date

Until further notice

Requirements

Must have a BS, MS, or PhD in Computer Science, or similar technical field of study
Experience and enthusiasm for learning about new infrastructure products, features, and strategies
Comfortable with working at the frontier of infrastructure and software development
Experience in Python and/or Go
Experience with one of the major cloud platforms (preferably AWS)
Strong written and verbal communications

Job Responsibility

Identify and lead critical efforts related to scalability, reliability and efficiency
Influence the features and direction of our platform with your own ideas
Provide technical support for engineers on team
Align with product and org objectives, and coordinate with cross-functional teams on delivering key results

What we offer

Healthcare programs that can be tailored to meet the personal health and financial well-being needs - Premiums are 100% covered for the employee under at least one plan and 80% for family premiums under all plans
Nationwide medical, vision and dental coverage
Health Saving Account (HSA) with annual employer contributions and Flexible Spending Account (FSA) with tax saving options
Expanded mental health support
Paid parental leave policy & fertility benefits
Time off to relax and recharge through our paid holidays, firmwide extended holidays, flexible PTO and personal sick time
Professional development stipend
Fertility stipend
Wellness/fitness benefits
Healthy lunches provided daily

Fulltime

Senior ML Infrastructure Engineer, Inference Platform

About the Team: The ML Inference Platform is part of the AV ML Infrastructure or...

Location

United States , Austin, Texas; Mountain View, California; Sunnyvale, California

Salary:

155420.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years of industry experience, with focus on machine learning systems or high performance backend services
Expertise in either Python, C++ or other relevant coding languages
Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
Strong communication skills and a proven ability to drive cross-functional initiatives
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Design and implement core platform backend software components
Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
Lead technical initiatives across GM’s ML ecosystem
Raise the engineering bar through technical leadership, establishing best practices
Contribute to open source projects
represent GM in relevant communities

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Critical Environment Platform Engineer

As a Critical Environment Platform Engineer, you will perform a key role in deli...

Location

United States , Redmond

Salary:

84200.00 - 165200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree or Trade Certification in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Bachelor's Degree in Computer Science, Information Technology, or related field AND an internship in software engineering, network engineering, service engineering, or systems engineering OR equivalent experience
Familiarity with enterprise large-scale cloud or distributed systems
Demonstrate basic understanding of Configuration management with PowerShell DSC, Puppet, Chef or similar
Project or lab work with server & tooling platforms for multiple infrastructure services such as Windows Server 2022/2019, Linux Server distributions, Hyper-V, Azure Arc, MIT Kerberos, Active Directory, DNS, PowerShell/Python infrastructure-as-code scripting, data protection technologies, service monitoring, incident alarming, compliance, and other services
Basic understanding Windows Server OS-based operating systems distributions performing automated system installation and configuration, file system concepts, resource monitoring, user administration, package management, and process control & management

Job Responsibility

Support a hybrid Continuous Integration (CI)/Continuous Delivery/Deployment (CD) DevOps virtualized infrastructure consisting of Windows & Linux Server Operating System, Hyper-V, Active Directory, Domain Name System, PowerShell scripting, with a focus on data protection technologies, service metrics and Key Performance Indicator (KPI) reporting, documentation skills
Perform, maintain, and continuously improve automated operating system installation and configuration
Configure and maintain hands-on bare metal enterprise-class server systems, including automated periodic firmware and driver updates at scale, Redundant Array of Independent Disks (RAID) and Intelligent Platform Management Interface (IPMI) configurations and hardware troubleshooting
Monitor servers, ensure service levels and KPIs are met and maintained, provide security conscience outcomes maintaining compliance alignment, and ensure 24/7/365 service and infrastructure operations support with continuous optimizations
Transition manual operational processes to automation while leveraging CI/CD DevOps principles
Partner with a global team while helping bring projects to successful outcomes and delivering rigorous documentation artifacts
Share responsibility for automating, securing, configuring, and delivering support of the Critical Environments infrastructure & related programs and projects in existing and future datacenters
Embody our culture and values

What we offer

Career Rotation Programs
Diversity & Inclusion trainings and events
professional certifications

Fulltime

Select Country

Critical Infrastructure Platform Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?