CrawlJobs Logo

Platform Engineer – AIOps & Infrastructure

· Job Posted May 31, 2026
Apply Position
Job Link Share

Job Description

The Platform Engineer – AIOps & Infrastructure will be responsible for designing, automating, and maintaining scalable infrastructure and platform services for AI/ML operations. This role combines Platform Engineering, DevOps, Cloud Infrastructure, and MLOps, ensuring high availability, observability, security, and operational excellence across production environments.

Job Responsibility

  • Design and maintain scalable cloud-native infrastructure for AI/ML workloads
  • Manage Kubernetes environments, container orchestration, and platform services
  • Build and optimize CI/CD pipelines and Infrastructure-as-Code frameworks
  • Support MLOps and LLMOps workflows, including deployment, monitoring, and lifecycle management
  • Implement monitoring, logging, alerting, and observability solutions
  • Drive DevSecOps, automation, security, and reliability best practices
  • Collaborate with AI Engineers, Data Scientists, and Infrastructure teams to support production AI systems
  • Participate in troubleshooting, incident response, and platform optimization initiatives

Requirements

  • Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent experience
  • 5+ years of experience in Platform Engineering, DevOps, Cloud Infrastructure, SRE, MLOps, or related fields
  • Strong experience with AWS, Azure, or GCP
  • Hands-on expertise with Kubernetes, Docker, and Infrastructure-as-Code tools (Terraform, CloudFormation, or similar)
  • Experience building CI/CD pipelines and automation workflows
  • Strong scripting skills using Python, Bash, or similar languages
  • Experience with monitoring and observability platforms such as Grafana, Prometheus, Datadog, or ELK
  • Advanced English proficiency (B2 - C1)
  • Comfortable working remotely with minimal supervision
  • Proactive, detail-oriented, and collaborative
  • Ability to thrive in a fast-paced, startup-like environment.

Nice to have

  • Experience supporting enterprise-scale AI/ML or Generative AI platforms in production
  • Strong knowledge of MLOps and LLMOps ecosystems
  • Experience with MLflow, Kubeflow, Airflow, SageMaker, or similar tools
  • Familiarity with GPU workloads, distributed systems, and AI inference infrastructure
  • Experience implementing DevSecOps, governance, compliance, and security frameworks
  • Background in high-availability and scalable cloud environments

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Platform Engineer – AIOps & Infrastructure

8 matching positions

New

Platform Engineering Manager

As Platform Engineering Manager at Power Design, you'll lead the buildout of our...
Location
Location
United States , St Petersburg
Salary
Salary:
Not provided
powerdesigninc.us Logo
Power Design
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Education: Bachelor's degree in Computer Science, Computer Engineering, Information Systems, or a related field
  • equivalent professional experience considered
  • Experience: 7–10 years of progressive experience in infrastructure engineering, platform engineering, DevOps, or SRE — with meaningful time in both hands-on implementation and technical leadership
  • Preferred certifications: HashiCorp Terraform Associate, AWS/Azure Solutions Architect, CKA/CKAD, or equivalent cloud or platform engineering certifications
  • Hands-on production experience with at least one major cloud platform (Azure, AWS, GCP, or OCI)
  • breadth across multiple platforms strongly preferred
  • Demonstrated history of evaluating infrastructure decisions through a cloud-first lens, identifying when to leverage cloud services rather than defaulting to on-premises solutions
  • Hands-on expertise with Terraform or a comparable IaC framework
  • GitOps pipeline experience (GitHub Actions, Azure DevOps, GitLab CI, or similar)
  • Production experience implementing enterprise observability and AIOps tooling (Datadog, Dynatrace, New Relic, Prometheus/Grafana, or equivalent), including anomaly detection, event correlation, and automated remediation workflows
Job Responsibility
Job Responsibility
  • Design, build, and maintain automation for infrastructure provisioning, configuration, and lifecycle management — with security controls built in from the start
  • Lead the evaluation, selection, and implementation of Power Design's first enterprise observability and AIOps platform, owning the decision end-to-end from vendor assessment through production rollout
  • Develop and maintain observability tooling, dashboards, and automated remediation workflows covering metrics, logging, tracing, and alerting across cloud and on-premises environments
  • Build and enforce CI/CD pipelines for infrastructure and platform services using GitOps best practices
  • Continuously evaluate the infrastructure footprint and identify workloads where cloud migration would improve resilience, reduce complexity, or lower cost — and build the business case to act on it
  • Apply a security-first lens to every platform decision, including IAM/RBAC design, secrets management, Zero Trust implementation (Zscaler), and policy-as-code
  • Create self-service infrastructure workflows — provisioning automation, access workflows, and internal developer tooling — to reduce ticket volume and enable engineering teams to move faster
  • Leverage AI-assisted tooling for anomaly detection, event correlation, and operational insights to drive a proactive operations model
  • Establish and own design standards, architectural consistency, and IaC strategy across the Platform Engineering function
  • Provide technical leadership and mentorship to platform engineers
  • Fulltime
Read More
Arrow Right

Senior AIOps Engineer (Platform & Infrastructure)

Groupon is moving beyond "experimenting" with AI to running it at massive scale....
Location
Location
Prague; Warsaw; Valencia; Madrid
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in Platform Engineering, SRE, or DevOps within a cloud-native environment
  • Deep experience managing stateful and stateless workloads (Helm, Istio, Docker)
  • Hands-on experience deploying and operating AI/ML tools or data-intensive systems in production
  • Strong skills in Python or Go to build custom API wrappers and automate operational tasks
  • Expertise in Prometheus, Grafana, and ELK stack to ensure end-to-end observability of complex AI requests
Job Responsibility
Job Responsibility
  • Architect the AI Stack: Design and operate core infrastructure on Kubernetes, including Vector Databases, LLM Gateways (LiteLLM), and workflow automation tools (n8n)
  • Enable at Scale: Drive AI adoption by creating self-service "Golden Paths" using Terraform and Helm, allowing engineering teams to deploy RAG pipelines with one click
  • Operational Excellence: Implement centralized observability, tracing (Langfuse), and governance to ensure our AI systems are reliable, auditable, and secure
  • Fiscal Discipline: Own the "AI Bill"—monitoring token usage and latency to optimize spend while maintaining high performance
What we offer
What we offer
  • End-to-end Ownership: Real authority to standardize how a global company builds with AI
  • Career Growth: This is a high-visibility role within a new, strategic team with potential for leadership progression
Read More
Arrow Right

Lead Engineer – Platform Engineering

We are looking for a Lead DevOps Engineer to join the Platform Engineering team ...
Location
Location
United States , St Petersburg, Florida
Salary
Salary:
Not provided
raymondjames.com Logo
Raymond James
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience with virtualization platforms (e.g., VMware vSphere/ESXi, Hyper‑V, KVM/Nutanix)
  • Hands‑on experience with configuration management tools such as Ansible
  • Implement and support enterprise load balancer solutions (e.g., F5 BIG-IP, NGINX, Azure/AWS load balancers), including configuration, automation, and traffic‑routing policies
  • Familiarity with AI‑assisted operations tools (AIOps), or how they can fit into the workflow
  • Solid understanding of CI/CD systems (GitHub Actions, Azure DevOps, Jenkins, GitLab CI)
  • Advanced scripting skills in Python, PowerShell, and/or Bash
  • Experience with provisioned workflow development in Service Now
  • Strong knowledge of monitoring and logging platforms (Prometheus/Grafana, Splunk, Elastic, Datadog, etc.)
  • Understanding of security best practices, IAM/RBAC, secrets management, and compliance frameworks
  • Strong networking and systems fundamentals (TCP/IP, DNS, load balancing, storage)
Job Responsibility
Job Responsibility
  • Design, build, and maintain automation for VM provisioning, configuration, and lifecycle management
  • Enhance and support CI/CD pipelines for infrastructure and platform services
  • Provide technical leadership and mentorship to engineers across the platform engineering team
  • Use AI‑assisted tooling when beneficial for anomaly detection, event correlation, and operational insights
  • Work on standardized VM images, templates, and OS baselines to ensure consistency and security
  • Improve platform reliability through monitoring, alerting, and SRE‑aligned practices
  • Develop and maintain observability tooling, dashboards, and automated remediation workflows
  • Ensure security best practices across VM platforms, including RBAC, secrets management, and patching
  • Optimize VM capacity, performance, and resource utilization across environments
  • Collaborate with development, cloud, and security teams to deliver stable, self‑service platform capabilities
  • Fulltime
Read More
Arrow Right

Account Manager, Global System Integrator

Account Manager for OpsRamp business focusing on Global System Integrators. This...
Location
Location
United States , New Jersey or Texas
Salary
Salary:
194500.00 - 456500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High passion for learning new technologies in the market
  • Excellent communication skills (oral, written, and presentation) with the ability to articulate and sell on value propositions to GSI's
  • Deep understanding of GSI's Business Ecosystems and knack to position product / platform as integrated solutions with partners
  • A track record for being detail-oriented with a demonstrated ability to self-motivate and follow-through on projects
  • Strong problem-solving skills with an ability to analyze problems and develop actionable and appropriate tactical plans quickly
  • Strong sales acumen with an understanding across IT Operations Management and AI Ops Platform solutions (ITOM & AIOps)
  • Strong understanding of Strategic Consulting, Systems Integration, Global Delivery Models, Managed / IT Outsourcing Services, Infrastructure Management Services, IT / Data Center Transformation, ITOM & AIOps, Cloud Computing, Platform Service, etc.
  • Exceptional interpersonal and relationship management skills
  • Proven ability to build and maintain executive-level relationships
  • Bachelor of Engineering, MBA (Preferred)
Job Responsibility
Job Responsibility
  • Develop and maintain executive relations within Global System Integrators (GSI's) to broaden awareness and acceptance of OpsRamp AIOps Solutions to power their Managed Services Platforms
  • Recruit new GSI's in line with the company's direction to drive growth for the OpsRamp business
  • Develop and execute a strategic business plan that meets and exceeds revenue targets
  • Align with cross-functional stakeholders including Product Management, Engineering, Marketing, Sales, and Operations
  • Create incremental revenue opportunities with GSI's via new joint solution offerings, new markets, and joint customer pursuits
  • Develop and maintain a robust deal pipeline with targeted solutions to continuously grow the business and generate incremental revenue
  • Provide timely, concise, accurate information of account and opportunity status, plans, and events
  • Manage and report business through accurate forecasting, stakeholder updates, and quarterly business reviews
  • Exceed revenue growth expectations
  • Achieve quarterly and annual bookings targets by growing joint partner business across the globe
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Career development programs
  • Fulltime
Read More
Arrow Right

GSI Sales

As part of the OpsRamp presales team, revolutionize cloud computing by deliverin...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High passion for learning new technologies in the market
  • Excellent communication skills (oral, written, and presentation) with the ability to articulate and sell on value propositions to GSIs
  • Deep understanding of GSIs Business Ecosystems and knack to position product/platform as integrated solutions with partners
  • A track record for being detail-oriented with a demonstrated ability to self-motivate and follow-through on projects
  • Strong problem-solving skills with an ability to analyze problems and develop actionable and appropriate tactical plans quickly
  • Adaptability and flexibility to work in a startup environment
  • Strong sales acumen with an understanding across ITOM & AIOps solutions
  • Strong understanding of Strategic Consulting, Systems Integration, Global Delivery Models, Managed/IT Outsourcing Services, Infrastructure Management Services, IT/Data Center Transformation, ITOM & AIOps, Cloud Computing, Platform Service, etc.
  • Exceptional interpersonal and relationship management skills
  • Proven ability to build and maintain executive-level relationships
Job Responsibility
Job Responsibility
  • Develop and maintain Exec relations within Global System Integrators (GSIs) to broaden awareness and acceptance of OpsRamp AIOps Solutions to power their Managed Services Platforms
  • Recruit new GSIs in line with the company’s direction to drive growth for the OpsRamp business
  • Develop and execute a strategic business plan that meets and exceeds revenue targets
  • Align with cross-functional stakeholders including Product Management, Engineering, Marketing, Sales, and Operations
  • Create incremental revenue opportunities with GSIs via new joint solution offerings, new markets, and joint customer pursuits
  • Develop and maintain a robust deal pipeline with targeted solutions to continuously grow the business and generate incremental revenue
  • Provide timely, concise, accurate information of account & opportunity status, plans, and events
  • Manage and report business through accurate forecasting, stakeholder updates, and quarterly business reviews
  • Exceed revenue growth expectations
  • Achieve quarterly and annual bookings targets by growing joint partner business across the globe
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Specific programs catered to helping reach career goals
  • Inclusive environment celebrating individual uniqueness
  • Fulltime
Read More
Arrow Right

Senior Sre – Data & Middleware Observability & Incident Reduction Vice President

The Senior Incident Operations & Optimization Specialist for Data & Middleware i...
Location
Location
United States , Irving
Salary
Salary:
125760.00 - 188640.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A minimum of 8+ years of hands-on experience in database administration, middleware engineering, or enterprise data platform operations
  • Proven experience in event management, alert tuning, and incident reduction for data and middleware services, with measurable results
  • Direct, hands-on experience with modern AIOps and event management platforms
  • Deep knowledge of both relational (e.g., Oracle, SQL Server) and NoSQL (e.g., MongoDB) database technologies, including clustering, replication, and performance tuning
  • Expertise in middleware platforms, including messaging technologies (e.g., MQ, Kafka) and application servers (e.g., WebSphere, Tomcat)
  • Hands-on experience developing robust automation solutions using relevant scripting languages (e.g., Python, Shell) and modern automation frameworks
  • Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms
  • Excellent analytical abilities with a systematic approach to troubleshooting complex data platform architectures and correlating infrastructure issues with application impact
  • Exceptional communication skills with the ability to collaborate effectively with DBAs, middleware engineers, and application teams, and to present technical concepts to diverse audiences
  • Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field
Job Responsibility
Job Responsibility
  • Analyze and optimize monitoring across all database and middleware platforms to address high-volume, low-value alerts, identify patterns in incident generation, and determine root causes
  • Develop and implement domain-specific correlation, de-duplication, and suppression rules on AIOps and event management platforms
  • Create logic that understands database cluster relationships, messaging dependencies, and application-to-database connections
  • Architect and develop automation playbooks for incident data enrichment and automated remediation of common database and middleware issues, such as connection pool resets or service restarts
  • Identify monitoring gaps across the data and middleware landscape, proposing enhancements to ensure comprehensive health monitoring and address blind spots in transactional flows
  • Partner closely with Database Administration (DBA), middleware engineering, and application teams to validate correlation logic, build consensus on threshold changes, and provide expert guidance on event management best practices
  • Continuously validate the effectiveness of implemented rules and automation, ensuring critical health indicators remain highly visible
  • Lead post-implementation reviews and drive iterative improvements
What we offer
What we offer
  • medical, dental & vision coverage
  • 401(k)
  • life, accident, and disability insurance
  • wellness programs
  • paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
  • Fulltime
Read More
Arrow Right

Senior Product Marketing Manager

Are you passionate about cloud computing and the future of intelligent cloud ope...
Location
Location
United States , Redmond
Salary
Salary:
106400.00 - 203600.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Marketing, Computer Science, Business or related field AND 3+ years experience in business OR Bachelor's Degree in Marketing, Computer Science, Business or related field AND 5+ years experience in business OR equivalent experience
  • Strong background in B2B audience marketing, cloud infrastructure, AIOps platform, or adjacent technical domains
  • Proven experience launching complex, technical products and shaping new or emerging categories
  • Deep comfort with technical concepts (cloud architecture, AI systems, automation, APIs)
  • Exceptional positioning, messaging, and storytelling skills
  • Strategic thinker who can also execute with speed and precision
  • Customer-obsessed and insight-driven
Job Responsibility
Job Responsibility
  • Develop and lead the outbound marketing strategy for agentic cloud operations, from early-category definition to scale
  • Develop differentiated positioning, messaging frameworks, and value propositions for technical and business audiences
  • Define customer personas, and use cases across platform, infrastructure, and AI-driven operations teams
  • Partner closely with Integrated Marketing and Audience Marketing to execute outbound marketing campaigns, track results and optimize campaigns or programs
  • Lead go-to-market planning and execution for major product launches and feature releases for the agentic cloud ops portfolio
  • Craft the core narrative around agentic systems across key cloud operations domains and lifecycle i.e. deployment/configuration, observability, resiliency, optimization, and security
  • Translate complex technical concepts into clear, compelling stories without oversimplifying
  • Partner with Go-To-Market managers to build enablement assets (pitch decks, demos, battlecards, case studies)
  • Equip Go-To-Market and field teams to sell a new category with confidence and consistency
  • Support enterprise, mid-market, and developer-led motions as needed
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right