CrawlJobs Logo

Manager, Compute Infrastructure – Linux Operations

India, Hyderabad · Job Posted March 25, 2026
Apply Position
Job Link Share

Job Description

In this vital role you will be responsible for managing complex Linux server environments. The role includes planning, implementation, performance tuning, and maintenance of enterprise server platforms with a focus on reliability, security, and automation.

Job Responsibility

  • Build, mentor, and scale a high-performing Linux and infrastructure operations team
  • Develop and maintain server security and compliance standards
  • Lead troubleshooting of complex, high-impact production issues
  • Contribute to infrastructure design and architecture planning
  • Implement automation using bash, Python, or Ansible
  • Monitor systems and proactively address performance bottlenecks
  • Drive adoption of agentic workflow automation (e.g., self-healing systems, event-driven runbooks)
  • Collaborate with cross-functional teams on infrastructure needs
  • Document system configurations and operational procedures
  • Provide technical guidance and mentorship to junior team members

Requirements

  • Master's degree with 8+ years of experience
  • Bachelor's degree with 10+ years of experience
  • Diploma with 14+ years of relevant experience
  • Advanced knowledge of Linux server operating systems (RHEL, Ubuntu and Amazon Linux)
  • Experience with virtualization (VMware, Nutanix)
  • Strong understanding of networking and storage integration
  • Hands-on expertise in scripting and automation tools
  • Expertise in scripting (Python, Bash) and automation frameworks
  • Leading teams

Nice to have

  • Experience with cloud services (AWS, Azure, GCP)
  • Experience with ITIL processes and frameworks
  • Experience with CI/CD and DevOps practices
  • Experience with workflow automation platforms (ServiceNow, n8n, or similar)
  • Understanding of configuration management and automation tools (Red Hat Satellite Server, Ansible)
  • Experience with DevOps, S/W development and scripting
  • Good knowledge of industry tools
  • Red Hat Certified Engineer (RHCE) (preferred)
  • AWS Solutions Architect (Associate/Professional)
  • Certified Kubernetes Administrator (CKA)
  • ITIL Foundation (preferred)
  • Strong systems thinking and ability to operate across complex, interconnected environments.
  • Ability to communicate complex technical concepts to both technical and non-technical stakeholders
  • Ability to work in a fast-paced environment
  • Decisive and calm under pressure, especially during critical incidents
  • Strong organizational and time management skills
  • Team collaboration and knowledge sharing
  • High curiosity and continuous learning mindset, especially in AI and automation domains
  • Strong collaboration and influencing skills in a global, matrixed organization
  • Ownership mindset with accountability for outcomes, not just tasks.

What we offer

Competitive and comprehensive Total Rewards Plans that are aligned with local industry standards

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Manager, Compute Infrastructure – Linux Operations

8 matching positions

Manager – AI Infrastructure Operations

As a senior leader on our team, you will be responsible for the overall health, ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Technical Leadership: 15+ years of experience in managing and operating complex compute infrastructure, with a minimum of 5 years in a senior or leadership role
  • SRE and Operations Expertise: A strong background as a Site Reliability Engineer or in a similar role, with a proven track record of managing large-scale, mission-critical systems
  • Deep Systems Knowledge: Expert-level proficiency in Linux-based systems, Python scripting, and command-line tools for system administration and automation
  • Troubleshooting Acumen: Exceptional ability to lead and resolve complex technical challenges under pressure, especially during customer or engineering escalations
  • On-Call Leadership: Proven experience managing an on-call rotation and responding to 24/7 technical incidents
  • Communication: Excellent communication and leadership skills, with the ability to effectively mentor junior team members and communicate complex technical concepts to a diverse audience
Job Responsibility
Job Responsibility
  • Lead and Manage Infrastructure: Oversee the operation and reliability of our advanced AI compute infrastructure, defining strategy and setting a high bar for operational excellence
  • Drive Technical Ownership: Act as the primary owner for critical infrastructure systems, ensuring uptime, performance, and capacity are consistently optimized
  • Handle High-Stakes Escalations: Serve as the final point of contact for complex customer and engineering escalations, providing expert-level, hands-on support and driving issues to a rapid and complete resolution
  • Champion Reliability and Automation: Leverage your SRE experience to develop and implement robust monitoring, alerting, and automation solutions, reducing manual toil and preventing future issues
  • Collaborate and Strategize: Partner with cross-functional teams, including engineering and product, to align on long-term infrastructure strategy and support future AI initiatives
  • Innovate and Improve: Continuously evaluate and improve existing processes, tools, and technologies to enhance system reliability and operational efficiency
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Fleet Operations Manager, Data Center Infrastructure

Meta is seeking a forward-thinking, experienced individual to join the Data Cent...
Location
Location
United States , Hillsboro
Salary
Salary:
163000.00 - 238000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS, BA, or BEng in a technical field or commensurate experience
  • Ability to travel up to 30% is required
  • Experience participating in or leading technical projects related to areas such as process improvement, technology, and/or automation, including bringing in additional expertise as needed
  • 5+ years of experience managing teams of technical resources, including people and performance management responsibilities
  • Understanding of data center infrastructure and/or operations, including power, cooling, and/or network systems
  • structured cabling
  • and management of projects, incidents, and vendors
  • Experience using data and metrics to drive decision-making
  • Ability to influence effectively, working on cross-functional teams to advance the needs of the company and adapting teams to meet these needs
  • 10+ years of engineering or operations experience, preferably in a mature engineering or operations environment, working with cross-functional teams
Job Responsibility
Job Responsibility
  • Build and lead a geographically dispersed, high-performing data center operations team, developing both the technical capabilities and leadership qualities of engineers
  • Establish and manage a Data Center Operations Team accountable for the maintenance and operation of server hardware and supporting infrastructure at scale
  • Become a technical expert in Meta's infrastructure, including platforms, tools, systems, architecture, workflows, and performance
  • Provide strategic direction, guidance, and support for site and fleet-level operations
  • Analyze and drive continuous improvement in the engineering and operational performance of our data centers
  • Employ data analytics to identify inefficiencies, opportunities, exceptions, and correlations in a complex, highly interconnected, technical environment
  • Collaborate with cross-functional partner teams to ensure fleet health and maintain targeted capacity levels, resulting in optimized operations, minimized downtime, and seamless scalability
  • Evolve and optimize processes in a globally consistent way to allow Meta to scale and grow effectively
  • Support and mentor engineers in their day-to-day work, as well as in finding opportunities to develop and grow based on their areas of strength and interest
  • Create and drive a culture of ownership, innovation, collaboration, accountability, continuous improvement, and safety
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Infrastructure Manager

We are looking for an experienced IT Infrastructure Manager to lead the performa...
Location
Location
United States , Cleveland
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience managing enterprise IT infrastructure in a lead or management capacity
  • Hands-on expertise with Microsoft Azure, Amazon Web Services, and virtualized environments built on VMware technologies
  • Strong knowledge of Windows Server, Linux systems, Active Directory, and Microsoft 365 Enterprise administration
  • Experience supporting storage platforms, backup technologies, and disaster recovery processes in production environments
  • Familiarity with Cisco technologies, computer hardware, and configuration management practices
  • Demonstrated ability to manage third-party vendors and coordinate infrastructure-related support services
  • Strong analytical and problem-solving skills with a track record of diagnosing complex technical issues and delivering effective solutions
Job Responsibility
Job Responsibility
  • Direct the planning, deployment, upkeep, and retirement of infrastructure platforms that support daily IT operations
  • Oversee the stability and availability of physical servers, virtual machines, host environments, and storage systems across the organization
  • Manage cloud environments in platforms such as Microsoft Azure and Amazon Web Services to maintain secure, efficient, and scalable operations
  • Coordinate response activities for physical security events, escalating issues when appropriate and ensuring required reporting is completed
  • Administer enterprise storage, backup, and recovery solutions to protect data integrity and support restoration needs
  • Partner with external vendors and service providers to deliver ongoing infrastructure support and resolve operational issues effectively
  • Strengthen business continuity by developing, testing, and refining disaster recovery strategies and infrastructure safeguards
  • Lead initiatives that improve infrastructure standards, operational metrics, and service performance through established IT best practices
  • Work closely with internal stakeholders to address security concerns, operational risks, and long-term infrastructure priorities
  • Investigate complex technical issues, identify root causes, and implement sustainable corrective actions to prevent recurrence
What we offer
What we offer
  • Benefits are available to contract/temporary professionals, including medical, vision, dental, and life and disability insurance
  • Eligible to enroll in our company 401(k) plan
  • Free online training
Read More
Arrow Right

Senior Infrastructure Operations Engineer

This role is responsible for architecting, optimizing, and scaling the infrastru...
Location
Location
United States , Atlanta
Salary
Salary:
Not provided
zelis.com Logo
Zelis
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Information Technology, or a related field
  • 5-10 years of experience in Infrastructure Operations, Systems Engineering, or a related field
  • High Proficiency in one or more of the following: VMware, Dell compute, Windows operating systems, Linux operating system distros (RHEL), IAC/CAC Tooling
  • Strong understanding of networking concepts
  • Excellent analytical and troubleshooting skills
  • Strong understanding of infrastructure best practices along with industry & governance standards
  • Excellent verbal and written communication skills
  • Ability to work in a fast-paced environment and manage multiple priorities simultaneously
  • Experience Coaching and training team members
  • Please note at this time we are unable to proceed with candidates who require visa sponsorship now or in the future
Job Responsibility
Job Responsibility
  • VMware Management: Design, optimize, and champion VMware environments
  • Dell Compute Equipment: Design, optimize, and champion Dell server environments including installation, configuration, maintenance, and life-cycle management
  • Tooling and Optimization: Enhance tooling and practices to proactively identify and resolve infrastructure issues with automation
  • Documentation: Write and maintain comprehensive, accurate, and up-to-date documentation of work instructions, configurations, and standards
  • Collaboration: Work closely with cross-functional teams: IT, security, application development, and business partners
  • Incident Response: As Level 3 support, respond to and resolve escalated infrastructure-related incidents and outages
  • Continuous Improvement: Identify opportunities for process improvements and implement best practices
  • Fulltime
Read More
Arrow Right

AI Infrastructure Operations Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...
Location
Location
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing
  • Strong proficiency in Python scripting for automation and system administration
  • Deep understanding of Linux-based compute systems and command-line tools
  • Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM
  • Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner
  • Experience with monitoring and alerting systems
  • Should have a proven track record to own and drive challenges to completion
  • Excellent communication and collaboration skills
  • Ability to work effectively in a fast-paced environment
  • Willingness to participate in a 24/7 on-call rotation
Job Responsibility
Job Responsibility
  • Manage and operate multiple advanced AI compute infrastructure clusters
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues
  • Maximize compute capacity through optimization and efficient resource allocation
  • Deploy, configure, and debug container-based services using Docker
  • Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed
  • Handle engineering escalations and collaborate with other teams to resolve complex technical challenges
  • Contribute to the development and improvement of our monitoring and support processes
  • Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Data Center Production Operations Manager

Meta is seeking a forward thinking experienced individual to join the Data Cente...
Location
Location
United States , Houston
Salary
Salary:
135000.00 - 191000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or BA in technical field or commensurate experience
  • 10+ years experience in high availability technology environments working with cross functional teams
  • 4+ years experience managing teams of technical resources including people and performance management responsibilities
  • Knowledge with Linux and hardware systems support in an Internet operations environment
  • Familiarity with Python, SQL and/or shell scripting knowledge
  • Solid knowledge of enterprise level infrastructure
  • Understanding of out-of-band/lights-out server communication methods, such as IPMI and serial console
  • Proven time and project management skills
  • Having depth and breadth of knowledge of managing servers in a large-scale distributed environment is a core competency of this individual
Job Responsibility
Job Responsibility
  • Managing a Data Center Operations Team accountable for the maintenance and operation of server hardware and supporting infrastructure at scale
  • Accountable for the health of server capacity delivering Meta's products and services from the data center site, and for ensuring operational delivery through collaboration and partnership with peer organizations
  • Work with peer organizations and regional teams that affect and deliver services to data center operations such as network operations, project management, facilities/maintenance management, logistics, hardware design, automated tooling and supply chain operations in order to successfully maintain data center uptime to enable ongoing business growth
  • Mentoring and developing engineers and technicians such that they can run daily operations with minimal supervision
  • Lead a high-quality data center operations team, with a broad range of experiences, perspectives, and backgrounds, developing both the technical and leadership qualities of engineers and technicians
  • Collaborating with other Production Operations Managers in data center sites around the globe to evolve and optimize processes and approaches in a globally consistent way to allow Meta to scale and grow effectively
  • Creating and driving a work environment of ownership, innovation, collaboration, accountability, and safety. Support and contribute thought leadership to the development and implementation of business practices, process and automated tooling which enables the growth and ongoing management of our global data center IT footprint
  • Manage server upgrades, integration, automated OS provisioning process, rebuilds and other projects as required. Understand and debug network, hardware, and Linux OS related issues
  • Identify and participate in the creation of documentation for the global DC knowledge base. Implement process improvements and inform best practices in data center operations
  • Predicting data center growth and scaling issues before they occur and implement solutions
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Compute Linux Engineer

The candidate provides Operate and Admin support on Compute infrastructure and t...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Broad technical knowledge on HPE ISS solutions – Installing, Configuring & Troubleshooting of C7000 enclosures, HPE Synergy, Virtual Connects, Blade Switches- SAS, Ethernet & FC, ProLiant Blades & Storage Blades
  • Operating Systems Knowledge – Install, configure, administration and troubleshoot RHEL/SUSE(as Bare-Metal OS & as VMs on Hypervisors) and VMware
  • Working knowledge on Redhat/SUSE Linux
  • Troubleshooting OS logs for hardware issues from VM-support, HPSreport, SOSreport, Support-Config etc
  • Knowledge on SAN, NAS technologies (Ethernet / iSCSI, FC, FCOE)
  • Knowledge on DAS Storage & HBAs – Smart Array /RAID, SSDs SAS, SATA etc
  • Disaster Recovery planning and conducting DR tests
  • Performed routine Performance Analysis, Capacity analysis, security audit analysis reports to customer for necessary planned changes
  • Linux Vulnerability assessment and Mitigation
  • Serviceguard cluster configuration and management on Linux and Integration with Database and ERP Solution
Job Responsibility
Job Responsibility
  • Resolve customer’s issues via the telephone, email or remote sessions
  • Reproduce issues in-house and responding back in a timely manner
  • Regular follow ups with customers with recommendations, updates and action plans
  • Identify and escalate issues in a timely manner to vendor according to Standard Operating Procedures
  • Leverage internal technical expertise, including peers, mentors, knowledge base, community forums and other internal tools, to provide the most effective solutions to customer issues
  • Collaborate with other CoE/HW teams in diagnosing and isolating the cause of complex issues
  • Maintain quality on case documentation, SLA timeframes and operational metrics
  • Performs within the Productivity Measure of the team (scorecard)
  • Incident Management: Resolve single and cross technology incidents independently
  • Lead the team members to resolve complex or cross technology incidents
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Comprehensive suite of benefits that support physical, financial and emotional wellbeing
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Compute Linux Engineer

HPE Operations is our innovative IT services organization. It provides the exper...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Broad technical knowledge on HPE ISS solutions – Installing, Configuring & Troubleshooting of C7000 enclosures, HPE Synergy, Virtual Connects, Blade Switches- SAS,Ethernet & FC, ProLiant Blades & Storage Blades
  • Operating Systems Knowledge – Install, configure, administration and troubleshoot RHEL/SUSE(as Bare-Metal OS & as VMs on Hypervisors) and VMware
  • Working knowledge on Redhat/SUSE Linux
  • Troubleshooting OS logs for hardware issues from VM-support, HPSreport, SOSreport, Support-Config etc
  • Knowledge on SAN, NAS technologies (Ethernet / iSCSI, FC, FCOE)
  • Knowledge on DAS Storage & HBAs – Smart Array /RAID, SSDs SAS, SATA etc
  • Disaster Recovery planning and conducting DR tests
  • Performed routine Performance Analysis, Capacity analysis, security audit analysis reports to customer for necessary planned changes
  • Linux Vulnerability assessment and Mitigation
  • Serviceguard cluster configuration and management on Linux and Integration with Database and ERP Solution
Job Responsibility
Job Responsibility
  • Resolve customer’s issues via the telephone, email or remote sessions
  • Reproduce issues in-house and responding back in a timely manner
  • Regular follow ups with customers with recommendations, updates and action plans
  • Identify and escalate issues in a timely manner to vendor according to Standard Operating Procedures
  • Leverage internal technical expertise, including peers, mentors, knowledge base, community forums and other internal tools, to provide the most effective solutions to customer issues
  • Collaborate with other CoE/HW teams in diagnosing and isolating the cause of complex issues
  • Maintain quality on case documentation, SLA timeframes and operational metrics
  • Performs within the Productivity Measure of the team (scorecard)
  • Incident Management: Resolve single and cross technology incidents independently. Lead the team members to resolve complex or cross technology incidents
  • Escalation Management: Identify, manage, and lead technical escalations. Participate in formal Escalation when required to support escalation especially during crisis
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right