Reliability Engineer Intern Job at Amazon Pforzheim GmbH (Shenzhen)

Artificial intelligence-assisted Reliability Engineer Intern

Amazon development center develops innovative consumer-centric safety total solu...

Location

Taiwan , Taipei

Salary:

Not provided

Amazon Pforzheim GmbH

Expiration Date

Until further notice

Requirements

Enrolled in or have completed a Master's degree or above in engineering or equivalent
Speak, write, and read fluently in Mandarin
Currently enrolled in a master’s program in Computer Science, Electrical Engineering, Chemical Engineering, Mechanical Engineering, or related field with focus on deep learning and computer vision
Programming experience in C++ and Python
Experience in implementing computer vision algorithms using multiple toolkits
Proficient in both English and Chinese

Job Responsibility

Develop and implement novel machine learning algorithms, focusing on computer vision and generative AI applications
Design and optimize scalable AI systems for large-scale datasets, ensuring high performance and accuracy
Apply state-of-the-art Machine Learning and AI research to solve complex reliability challenges and product lifespan projection
Create clear technical documentation and reports to effectively communicate concepts and results
Work closely with senior reliability engineers in reporting reliability execution progress, and failure analysis
Participate in evaluating and developing reliability test methodologies to reduce test time and increase test coverage

Parttime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...

Location

United States , San Francisco

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
Strong background operating, supporting, and troubleshooting distributed systems at scale
Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
Familiarity with AWS environments, including serverless and container-based architectures
Experience working with relational databases such as Postgres and performance analysis in production systems
Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design

Job Responsibility

Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices

What we offer

Medical, vision, dental, and life and disability insurance
Enrollment in company 401(k) plan

New

Site Reliability Engineer

We’re hiring an SRE to help improve the availability, performance, scalability, ...

Location

Israel , Netanya/Tel Aviv

Salary:

Not provided

JFrog

Expiration Date

Until further notice

Requirements

2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
Hands-on experience with Kubernetes-based containerized workloads
Experience with at least one public cloud provider: AWS, GCP, or Azure
Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
Understanding of incident management processes, alerting systems, and production support workflows
Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment

Job Responsibility

Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence

Fulltime

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Experience with CI/CD pipelines, GitLab, and GitHub preferred
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

Fulltime

Staff Site Reliability Engineer

Trimble is seeking a Staff Site Reliability Engineer (P4) to join our Corporate ...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor’s Degree or equivalent in Computer Science, Engineering, Information Systems, or a related field
OR equivalent practical experience
Minimum of 10 years of experience in IT operations, including deep knowledge of networking, computing, and storage
Minimum of 5 years of experience with AWS and/or Azure cloud computing environments, with at least 2 years in an architect/design role
Windows and Linux deployment experience, including common services for each platform
Proficiency in at least one scripting language (preferably Python or Powershell/.NET) and proficiency utilizing Git as a source control system
Strong background in application operations, including Incident Management, Change Management, and Capacity Management
Excellent troubleshooting and problem-solving skills, knowledge of security best practices, a strong desire to learn independently, and exceptional written/verbal communication skills with a customer-service mindset

Job Responsibility

Cloud Architecture & Enhancement: Develop new and enhance current shared public cloud services with a strict focus on Availability, Operations, Performance, Capacity, Security, and User Experience
Technical Leadership: Provide input and expertise relating to cloud hosting solutions (full infrastructure design and management). Transform business requirements into scalable operational designs
Collaboration & Planning: Attend and provide input on product planning sessions with internal development teams. Act as an expert on Business System services to communicate the value of our platform
Automation & Documentation: Identify and implement automation solutions. Develop and maintain critical documentation, including architecture diagrams, service descriptions, build/deploy processes, and operations run books
Mentorship & Support: Provide technical escalation and mentoring to other team members. Train operations teams to provide Level 1/2 support for shared public cloud services, acting as the ultimate Level 3 escalation point
Standards & Governance: Manage AWS/Azure best practice expectations and ensure alignment with corporate standards
Global Collaboration: Work effectively within a global team framework. Strike a balance between Indian and US time zones to attend business stakeholder meetings, address production issues, and serve as a reliable escalation point (including off-hours tasks when necessary)

Fulltime

Select Country

Reliability Engineer Intern

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?