CrawlJobs Logo

Site Reliability Engineer SRE – ML platform

thirdeyedata.ai Logo

Thirdeye Data

Location Icon

Location:
United States , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Responsibility:

  • Continuous Deployment using GitHub Actions, Flux, Kustomize
  • Design and implement cloud solutions, build MLOps on AWS cloud
  • Data science model containerization, deployment using Docker, VLLM, Kubernetes
  • Communicate with a team of data scientists, data engineers, and architects, and document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference
  • Knowledge of ML models and LLM

Requirements:

  • 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS
  • Good understanding of Apache SOLR
  • Proficient with Linux administration
  • Knowledge of ML models and LLM
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing, benchmarking, and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Additional Information:

Job Posted:
December 26, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer SRE – ML platform

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right
New

Staff Accountant

Our client, a successful manufacturing company, seeks a detail-oriented Staff Ac...
Location
Location
United States , Colorado Springs
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Accounting, Finance, or related field
  • 2+ years of relevant accounting experience, preferably within manufacturing
  • Strong knowledge of journal entries, month-end close, payroll, and financial reporting
  • Experience with bank reconciliations and governmental filings preferred
  • Exceptional organizational, analytical, and communication skills
  • Proficiency in Microsoft Excel and accounting software (industry-specific systems a plus)
  • Demonstrated ability to work independently and collaborate within cross-functional teams
Job Responsibility
Job Responsibility
  • Lead and execute month-end closing activities, ensuring all financial data is accurately recorded and reported
  • Compile, prepare, and input a wide variety of journal entries (including accruals, prepaids, depreciation, reversals, and more)
  • Manage payroll processing for 85-90 employees, including wage calculations, payroll accruals, and reversal of prior month accruals
  • Calculate and record depreciation and write off prepaid expenses as required
  • Record, review, and reconcile accrued interest and related journal entries
  • Monitor the sales value of finished goods, including reversal of prior month figures to ensure accuracy
  • Analyze cost of goods sold and provide periodic reports to management
  • Maintain accuracy and integrity of balance sheet accounts through ongoing reconciliation and review
  • Prepare comprehensive monthly and yearly financial statements for internal and external stakeholders
  • Perform bank reconciliations for two primary accounts and resolve discrepancies
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • eligible to enroll in our company 401(k) plan
Read More
Arrow Right
New

Yard Assistant

We’re looking for a proactive and hard-working Yard Assistant to join our friend...
Location
Location
United Kingdom , Port Talbot
Salary
Salary:
25872.99 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Previous experience in a similar yard or warehouse role (preferred)
  • A helpful, positive attitude and strong customer service skills
  • Physically fit and comfortable working outdoors in all weather
  • A team player who can also use initiative when needed
  • Forklift licence (desirable, or willingness to train)
  • Good attention to detail and a commitment to safety
Job Responsibility
Job Responsibility
  • Serving customers in the yard and assisting with loading their vehicles
  • Picking and preparing orders for delivery or collection
  • Loading/unloading company and supplier vehicles safely and accurately
  • Checking materials for quality, quantity, and damage
  • Monitoring and maintaining stock levels and layout
  • Supporting with regular stock takes and housekeeping duties
  • Reporting any accidents, damages, or security issues
  • Operating forklift trucks (following appropriate training/certification)
  • Ensuring high health & safety standards are maintained across the site
What we offer
What we offer
  • Profit Share Bonus Scheme
  • Online discount portal including money off retail brands and holidays
  • Employee Care Helpline and access to a digital GP
  • staff discount scheme
  • Death in Service benefit
  • formal training and career progression opportunities
  • Fulltime
Read More
Arrow Right
New

Head of Growth Marketing

Our client is one of the pioneers of the AI industry. They are not only a techno...
Location
Location
United States
Salary
Salary:
120000.00 - 240000.00 USD / Year
80twenty.com Logo
80Twenty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You've made things go viral repeatedly and you understand why
  • Experience in a growth, marketing, or founder role at a consumer startup (B2C)
  • Deep fluency in internet culture
  • Hands-on and resourceful – you'll write posts, engage directly, and respond to DM's directly
  • Excellent taste and communication style
Job Responsibility
Job Responsibility
  • Own social presence and make it genuinely compelling
  • Build and nurture relationships with creators, and influencers, who align with the brand
  • Create viral moments around product launches, features, and company milestones
  • Drive organic buzz that gets people talking ( X, TikTok, Reddit, Discord, and other platforms)
  • Work closely with leadership to shape brand voice and positioning
  • Experiment constantly – test hooks, formats, and channels until things go viral!
Read More
Arrow Right
New

Product Design Director

At AKQA, we believe in the imaginative application of art and science to design ...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
akqa.com Logo
AKQA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years’ experience in product design or interaction design at leading studios or in-house teams
  • Proven leadership in delivering large-scale, multi-platform digital products from concept to launch
  • Mastery of Figma and contemporary design-system workflows
  • Strong understanding of human-centred design, accessibility, motion, and micro-interaction principles
  • Familiarity with AI design tools, prototyping in Framer, Principle, ProtoPie, or After Effects, and collaboration with machine-learning or data-science teams
  • Demonstrable success in uniting brand and product through interface design
  • Experience mentoring mid- and senior-level designers
  • Comfortable presenting to senior clients and executives
  • Systems thinker with a deep sensitivity to brand and narrative
  • Calm, credible communicator
Job Responsibility
Job Responsibility
  • Lead concept, design, and delivery of world-class digital products and connected experiences
  • Partner with UX, Creative Technology, and AI Engineering teams to shape AI-native interfaces, generative UI systems, and adaptive design components
  • Establish and evolve brand-led design systems that perform across screen, voice, gesture, and spatial interfaces
  • Embed accessibility and inclusivity from first principles
  • Drive experimentation across new modalities: voice and audio UX, wearable ecosystems, computer vision, and mixed reality
  • Mentor, grow, and inspire a diverse team of product designers
  • Collaborate closely with strategists and client partners to translate brand purpose into tangible, useful digital products
  • Represent AKQA in new business opportunities
  • Contribute to AKQA’s global culture of learning
Read More
Arrow Right