Manager, Compute Infrastructure – Linux Operations Job at Amgen (Hyderabad)

Manager – AI Infrastructure Operations

As a senior leader on our team, you will be responsible for the overall health, ...

Location

United States , Sunnyvale

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

Technical Leadership: 15+ years of experience in managing and operating complex compute infrastructure, with a minimum of 5 years in a senior or leadership role
SRE and Operations Expertise: A strong background as a Site Reliability Engineer or in a similar role, with a proven track record of managing large-scale, mission-critical systems
Deep Systems Knowledge: Expert-level proficiency in Linux-based systems, Python scripting, and command-line tools for system administration and automation
Troubleshooting Acumen: Exceptional ability to lead and resolve complex technical challenges under pressure, especially during customer or engineering escalations
On-Call Leadership: Proven experience managing an on-call rotation and responding to 24/7 technical incidents
Communication: Excellent communication and leadership skills, with the ability to effectively mentor junior team members and communicate complex technical concepts to a diverse audience

Job Responsibility

Lead and Manage Infrastructure: Oversee the operation and reliability of our advanced AI compute infrastructure, defining strategy and setting a high bar for operational excellence
Drive Technical Ownership: Act as the primary owner for critical infrastructure systems, ensuring uptime, performance, and capacity are consistently optimized
Handle High-Stakes Escalations: Serve as the final point of contact for complex customer and engineering escalations, providing expert-level, hands-on support and driving issues to a rapid and complete resolution
Champion Reliability and Automation: Leverage your SRE experience to develop and implement robust monitoring, alerting, and automation solutions, reducing manual toil and preventing future issues
Collaborate and Strategize: Partner with cross-functional teams, including engineering and product, to align on long-term infrastructure strategy and support future AI initiatives
Innovate and Improve: Continuously evaluate and improve existing processes, tools, and technologies to enhance system reliability and operational efficiency

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Fleet Operations Manager, Data Center Infrastructure

Meta is seeking a forward-thinking, experienced individual to join the Data Cent...

Location

United States , Hillsboro

Salary:

163000.00 - 238000.00 USD / Year

Infrastructure Manager

We are looking for an experienced IT Infrastructure Manager to lead the performa...

Location

United States , Cleveland

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

7+ years of experience managing enterprise IT infrastructure in a lead or management capacity
Hands-on expertise with Microsoft Azure, Amazon Web Services, and virtualized environments built on VMware technologies
Strong knowledge of Windows Server, Linux systems, Active Directory, and Microsoft 365 Enterprise administration
Experience supporting storage platforms, backup technologies, and disaster recovery processes in production environments
Familiarity with Cisco technologies, computer hardware, and configuration management practices
Demonstrated ability to manage third-party vendors and coordinate infrastructure-related support services
Strong analytical and problem-solving skills with a track record of diagnosing complex technical issues and delivering effective solutions

Job Responsibility

Direct the planning, deployment, upkeep, and retirement of infrastructure platforms that support daily IT operations
Oversee the stability and availability of physical servers, virtual machines, host environments, and storage systems across the organization
Manage cloud environments in platforms such as Microsoft Azure and Amazon Web Services to maintain secure, efficient, and scalable operations
Coordinate response activities for physical security events, escalating issues when appropriate and ensuring required reporting is completed
Administer enterprise storage, backup, and recovery solutions to protect data integrity and support restoration needs
Partner with external vendors and service providers to deliver ongoing infrastructure support and resolve operational issues effectively
Strengthen business continuity by developing, testing, and refining disaster recovery strategies and infrastructure safeguards
Lead initiatives that improve infrastructure standards, operational metrics, and service performance through established IT best practices
Work closely with internal stakeholders to address security concerns, operational risks, and long-term infrastructure priorities
Investigate complex technical issues, identify root causes, and implement sustainable corrective actions to prevent recurrence

What we offer

Benefits are available to contract/temporary professionals, including medical, vision, dental, and life and disability insurance
Eligible to enroll in our company 401(k) plan
Free online training

Senior Infrastructure Operations Engineer

This role is responsible for architecting, optimizing, and scaling the infrastru...

Location

United States , Atlanta

Salary:

Not provided

Zelis

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field
5-10 years of experience in Infrastructure Operations, Systems Engineering, or a related field
High Proficiency in one or more of the following: VMware, Dell compute, Windows operating systems, Linux operating system distros (RHEL), IAC/CAC Tooling
Strong understanding of networking concepts
Excellent analytical and troubleshooting skills
Strong understanding of infrastructure best practices along with industry & governance standards
Excellent verbal and written communication skills
Ability to work in a fast-paced environment and manage multiple priorities simultaneously
Experience Coaching and training team members
Please note at this time we are unable to proceed with candidates who require visa sponsorship now or in the future

Job Responsibility

VMware Management: Design, optimize, and champion VMware environments
Dell Compute Equipment: Design, optimize, and champion Dell server environments including installation, configuration, maintenance, and life-cycle management
Tooling and Optimization: Enhance tooling and practices to proactively identify and resolve infrastructure issues with automation
Documentation: Write and maintain comprehensive, accurate, and up-to-date documentation of work instructions, configurations, and standards
Collaboration: Work closely with cross-functional teams: IT, security, application development, and business partners
Incident Response: As Level 3 support, respond to and resolve escalated infrastructure-related incidents and outages
Continuous Improvement: Identify opportunities for process improvements and implement best practices

Fulltime

AI Infrastructure Operations Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...

Location

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing
Strong proficiency in Python scripting for automation and system administration
Deep understanding of Linux-based compute systems and command-line tools
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner
Experience with monitoring and alerting systems
Should have a proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced environment
Willingness to participate in a 24/7 on-call rotation

Job Responsibility

Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges
Contribute to the development and improvement of our monitoring and support processes
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Data Center Production Operations Manager

Meta is seeking a forward thinking experienced individual to join the Data Cente...

Location

United States , Houston

Salary:

135000.00 - 191000.00 USD / Year

Compute Linux Engineer

The candidate provides Operate and Admin support on Compute infrastructure and t...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Broad technical knowledge on HPE ISS solutions – Installing, Configuring & Troubleshooting of C7000 enclosures, HPE Synergy, Virtual Connects, Blade Switches- SAS, Ethernet & FC, ProLiant Blades & Storage Blades
Operating Systems Knowledge – Install, configure, administration and troubleshoot RHEL/SUSE(as Bare-Metal OS & as VMs on Hypervisors) and VMware
Working knowledge on Redhat/SUSE Linux
Troubleshooting OS logs for hardware issues from VM-support, HPSreport, SOSreport, Support-Config etc
Knowledge on SAN, NAS technologies (Ethernet / iSCSI, FC, FCOE)
Knowledge on DAS Storage & HBAs – Smart Array /RAID, SSDs SAS, SATA etc
Disaster Recovery planning and conducting DR tests
Performed routine Performance Analysis, Capacity analysis, security audit analysis reports to customer for necessary planned changes
Linux Vulnerability assessment and Mitigation
Serviceguard cluster configuration and management on Linux and Integration with Database and ERP Solution

Job Responsibility

Resolve customer’s issues via the telephone, email or remote sessions
Reproduce issues in-house and responding back in a timely manner
Regular follow ups with customers with recommendations, updates and action plans
Identify and escalate issues in a timely manner to vendor according to Standard Operating Procedures
Leverage internal technical expertise, including peers, mentors, knowledge base, community forums and other internal tools, to provide the most effective solutions to customer issues
Collaborate with other CoE/HW teams in diagnosing and isolating the cause of complex issues
Maintain quality on case documentation, SLA timeframes and operational metrics
Performs within the Productivity Measure of the team (scorecard)
Incident Management: Resolve single and cross technology incidents independently
Lead the team members to resolve complex or cross technology incidents

What we offer

Health & Wellbeing
Personal & Professional Development
Comprehensive suite of benefits that support physical, financial and emotional wellbeing
Unconditional Inclusion

Fulltime

Compute Linux Engineer

HPE Operations is our innovative IT services organization. It provides the exper...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Broad technical knowledge on HPE ISS solutions – Installing, Configuring & Troubleshooting of C7000 enclosures, HPE Synergy, Virtual Connects, Blade Switches- SAS,Ethernet & FC, ProLiant Blades & Storage Blades
Operating Systems Knowledge – Install, configure, administration and troubleshoot RHEL/SUSE(as Bare-Metal OS & as VMs on Hypervisors) and VMware
Working knowledge on Redhat/SUSE Linux
Troubleshooting OS logs for hardware issues from VM-support, HPSreport, SOSreport, Support-Config etc
Knowledge on SAN, NAS technologies (Ethernet / iSCSI, FC, FCOE)
Knowledge on DAS Storage & HBAs – Smart Array /RAID, SSDs SAS, SATA etc
Disaster Recovery planning and conducting DR tests
Performed routine Performance Analysis, Capacity analysis, security audit analysis reports to customer for necessary planned changes
Linux Vulnerability assessment and Mitigation
Serviceguard cluster configuration and management on Linux and Integration with Database and ERP Solution

Job Responsibility

Resolve customer’s issues via the telephone, email or remote sessions
Reproduce issues in-house and responding back in a timely manner
Regular follow ups with customers with recommendations, updates and action plans
Identify and escalate issues in a timely manner to vendor according to Standard Operating Procedures
Leverage internal technical expertise, including peers, mentors, knowledge base, community forums and other internal tools, to provide the most effective solutions to customer issues
Collaborate with other CoE/HW teams in diagnosing and isolating the cause of complex issues
Maintain quality on case documentation, SLA timeframes and operational metrics
Performs within the Productivity Measure of the team (scorecard)
Incident Management: Resolve single and cross technology incidents independently. Lead the team members to resolve complex or cross technology incidents
Escalation Management: Identify, manage, and lead technical escalations. Participate in formal Escalation when required to support escalation especially during crisis

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Select Country

Manager, Compute Infrastructure – Linux Operations

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?