CrawlJobs Logo

Site Reliability Engineer SRE – ML platform

thirdeyedata.ai Logo

Thirdeye Data

Location Icon

Location:
United States , Sunnyvale

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Responsibility:

  • Continuous Deployment using GitHub Actions, Flux, Kustomize
  • Design and implement cloud solutions, build MLOps on AWS cloud
  • Data science model containerization, deployment using Docker, VLLM, Kubernetes
  • Communicate with a team of data scientists, data engineers, and architects, and document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference
  • Knowledge of ML models and LLM

Requirements:

  • 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS
  • Good understanding of Apache SOLR
  • Proficient with Linux administration
  • Knowledge of ML models and LLM
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing, benchmarking, and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Additional Information:

Job Posted:
December 26, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer SRE – ML platform

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Executive Director – AI and Machine Learning

At CVS Health, we’re building a world of health around every consumer and surrou...
Location
Location
United States , Work At Home, New Jersey
Salary
Salary:
175100.00 - 334750.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
December 31, 2025
Flip Icon
Requirements
Requirements
  • PhD or Master's degree in AI/ML, Computer Science, Statistics, Engineering, or equivalent experience
  • 15+ years leading Enterprise Machine Learning, Infrastructure, Data Science, and/or SRE practices
  • 5+ years applying Machine Learning to optimize technology operations (AIOps)
  • 10+ years at a leadership level or above, within a Fortune 500 company with significant scale
  • Proven experience leading AI governance, establishing and maintaining robust ML Ops environments, leading development of large-scale AI and ML platforms and solutions, and developing strategic partnerships with internal clients, industry experts, and vendors
  • Ability to develop and implement a comprehensive AI/ML strategy that aligns with the organization's business goals
  • Deep understanding of AI/ML technologies, including model development, deployment, MLOps, GenAIOps, and LLMOps practices
  • Demonstrated knowledge of and significant experience building and operating on-premise AI processor (e.g., GPU clusters) and platform architectures for the deployment and management of enterprise AI workloads
  • Experience with and commitment to ensuring AI/ML solutions are developed and deployed ethically, with a focus on fairness, transparency, and accountability
  • Familiarity with industry standards and regulations related to AI and Machine Learning
Job Responsibility
Job Responsibility
  • Develop, implement, and enhance governance frameworks and policies to ensure effective oversight of operational and security-focused AI and ML solutions
  • Establish and enforce standards for the build, management, governance, and utilization of AI models and model execution platforms
  • Establish and socialize a framework for the documentation, proposal, evaluation, build, delivery, and ongoing value assessment of scalable operations and security-focused AI/ML solutions
  • Evaluate and certify foundational models for use within CVS Health, ensuring alignment with organizational goals and security requirements
  • Regularly assess and enhance the governance model and associated standards to address emerging challenges and opportunities
  • Establish and maintain robust MLOps, GenAIOps, and LLMOps practices
  • Build and manage pipelines to enable teams to design AI-powered applications, develop and experiment with models, and deploy, monitor, and maintain them in production
  • Drive delivery of AI and ML solutions providing provide deep insights and reporting on operations and security data
  • Develop proactive AI-driven solutions to measurably reduce time to detect security and operational issues, provide adaptive recommendations, and automate remediation
  • Deliver solutions to enable users to interact with operational data driving measurable improvements in productivity, performance, and innovation
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Fulltime
!
Read More
Arrow Right
New

Kitchen Lead

Oversee meal prep, planning, and cleaning for up to 2000 guests per meal. Assist...
Location
Location
United States , Glorieta
Salary
Salary:
15.00 USD / Hour
Christian Career
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to manage peers, full time staff, summer, seasonal and hourly staff in a positive and respectful way
  • A personal relationship with Christ
  • Strong work ethic
  • Highly organized with an eye for detail
  • Excellent personal hygiene
  • Strong sense of urgency and timeliness
  • A servant’s heart, with a strong calling towards hospitality
  • Problem solving skills, ability to think quickly to adjust plans and still achieve desired outcomes
Job Responsibility
Job Responsibility
  • Oversee meal prep, planning, and cleaning for up to 2000 guests per meal
  • Assist in the supervision of all hourly kitchen staff, seasonal staff, & volunteers
  • Assist and participate in cleaning projects and managerial tasks as necessary to meet policy and procedural compliance
  • Assist in training and ongoing development of full-time, seasonal, hourly staff and volunteers
  • Maintains leadership in modeling cleanliness, readiness and food quality standards
  • Responsible for documentation and execution of food safety plans, daily checklists, food waste management, and stock rotation on a shift-by-shift basis
  • Support food service manager in the ordering and receiving of all stock items
  • Monitor food preparation and storage to ensure adherence to ServSafe training
  • Clean and maintain all kitchen appliances, equipment, and areas where food is served
  • Communicate professionally and effectively with guests about all areas of the dining experience which includes allergen information
What we offer
What we offer
  • On-site housing with utilities and appliances included
  • Medical coverage through CHM, employer-funded HRA, and pharmacy benefit plan - 100% paid by employer for employee and spouse
  • PTO - base amount of 20 days annually, sick leave, and volunteer time off
  • Retirement - After 1 year of employment, employer contributes 4% NEC and up to 4% matched
  • Camp program and retail discounts, including a free session of camp annually, friends and family lodging options and use of camp facilities for personal gatherings
  • On-site meals for employee and family during the summer and special occasions
  • Overtime pay
  • Fulltime
Read More
Arrow Right
New

Data Architect

We are seeking a highly experienced Data Architect with 12+ years of experience ...
Location
Location
United Arab Emirates , Dubai
Salary
Salary:
Not provided
NorthBay
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in Data Engineering and Data Architecture
  • Proven experience working as a Data Architect on large-scale AWS platforms
  • Strong experience designing enterprise data lakes and data warehouses
  • Hands-on experience with batch data processing and orchestration frameworks
  • Excellent communication and stakeholder management skills
  • Ability to work onsite in Dubai, UAE
  • AWS Glue (ETL, Data Catalog)
  • Amazon EMR (Batch Processing)
  • AWS Lambda (Serverless Data Processing)
  • Amazon MWAA (Apache Airflow)
Job Responsibility
Job Responsibility
  • Design and own end-to-end AWS data architecture for enterprise platforms
  • Define data architecture standards, best practices, and reference models
  • Architect batch and event-driven data pipelines using AWS native services
  • Lead data ingestion, transformation, and orchestration workflows
  • Design and implement solutions using AWS Glue, EMR, Lambda, and MWAA (Airflow)
  • Architect data lakes and data warehouses using Amazon S3 and Amazon Redshift
  • Design NoSQL data solutions using Amazon DynamoDB
  • Implement data governance, metadata management, and access control using AWS DataZone
  • Ensure monitoring, logging, and observability using Amazon CloudWatch
  • Partner with engineering, analytics, and business teams to translate requirements into scalable data solutions
  • Fulltime
Read More
Arrow Right
New

Prenatal Account Executive

Ready to redefine what's possible in molecular diagnostics? Join a team of brill...
Location
Location
United States , Louisville
Salary
Salary:
184569.00 - 248269.00 USD / Year
BillionToOne
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum three (3) years of outside field sales experience within the healthcare sector, directly calling upon providers in specified geographic territory
  • Demonstrated successful sales track record, understanding of buyer/decision maker types, exhibit effective selling, listening, presentation skills, and ability to assess and respond to customer needs (National awards a plus)
  • Excellent organizational and communication skills (written and verbal) with demonstrated ability to effectively present to both internal and external customers
  • Effective time management skills required with a demonstrated ability to assess and prioritize opportunity required
  • Exceptionally bright, flexible, self-motivated and results oriented with strong interpersonal and analytical skills and the ability to think strategically as well as execute tactically
  • Must act with a sense of urgency, with a focus on closing business
  • Ability to assess the needs of medical professionals and staff members with a focus on consultative sales, coordination of logistics, and problem solving
  • Strong desire to work in a startup environment and must work independently with an internal drive to be successful
  • Working knowledge and application of HIPAA laws, privacy, and ethics surrounding patient privacy and information
  • Demonstrated values and ethics that support BillionToOne's mission, goals, and professional code of conduct
Job Responsibility
Job Responsibility
  • Increasing utilization of UNITY Fetal Risk Screen and driving market development through direct sales to individual OBGYNs, MFMs, and Genetic Counselors
  • Identifying, developing, and managing commercial relationships with key opinion leaders in medicine and other key healthcare professionals
  • Effectively prospecting and cultivating new business and maintaining key relationships
  • Identifying and capitalizing on commercial opportunities for growth within a specific region or geography – predominately in OBGYN, MFM, and GC clinics, as well as hospital systems and Federally Qualified Health Centers
  • Creating and implementing a strategic business plan to grow utilization quickly in your geography
  • Managing the full lifecycle of the product sales process, including new business development and lead generation
  • Attending local tradeshows, industry conferences and networking events
What we offer
What we offer
  • Working alongside brilliant, kind, passionate and dedicated colleagues, in an empowering environment, toward a global vision, striving for a future in which transformative molecular diagnostics can help millions of patients
  • Open, transparent culture that includes weekly Town Hall meetings
  • The ability to indirectly or directly change the lives of hundreds of thousands patients
  • Multiple medical benefit options
  • employee premiums paid 100% of select plans, dependents covered up to 80%
  • Extremely generous Family Bonding Leave for new parents (16 weeks, paid at 100%)
  • Supplemental fertility benefits coverage
  • Retirement savings program including a 4% Company match
  • Increase paid time off with increased tenure
  • Latest and greatest hardware (laptop, lab equipment, facilities)
  • Fulltime
Read More
Arrow Right
New

Senior Frontend Engineer

NorthBay is seeking a Senior Front-End Engineer with deep expertise in JavaScrip...
Location
Location
Salary
Salary:
Not provided
NorthBay
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of front-end development experience with strong command over JavaScript and TypeScript
  • Deep expertise in React, Vue.js (Nuxt3), Next.js, and Svelte
  • Strong knowledge of responsive design, accessibility (WCAG), and front-end performance optimization
  • Experience with Lit.js for building modular and reusable UI components
  • Familiarity with Git workflows, modern CI/CD pipelines, and build tools like Vite, Webpack, or Rollup
Job Responsibility
Job Responsibility
  • Lead the development of scalable and dynamic front-end applications using React, Vue.js (Nuxt3), Next.js, Svelte, and TypeScript
  • Build reusable components and design systems, including lightweight Lit.js web components
  • Optimize front-end performance through SSR, code splitting, lazy loading, hydration, and runtime optimization
  • Collaborate closely with ML engineers, backend developers, and designers to deliver intuitive, AI-powered user experiences
  • Work with state management libraries like Redux, Vuex, Pinia, or Svelte Stores
  • Build accessible and responsive applications that are cross-browser and cross-device compatible
  • Contribute to testing, documentation, and CI/CD automation to maintain high code quality
  • Participate in architectural discussions, code reviews, and mentor junior front-end engineers
  • Fulltime
Read More
Arrow Right
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.