CrawlJobs Logo

Reliability Engineer Intern

China, Shenzhen · Job Posted December 25, 2025
Apply Position
Job Link Share

Job Description

Amazon develops innovative consumer-centric product solutions. As a reliability engineer intern, you will be part of an exciting team to assist developing, testing, and delivering products. You will assist a senior reliability engineer for the development and implementation of methodologies/techniques to enhance product reliability. You will work closely with cross-functional teams located in both the US and China throughout the design cycle to assist driving and executing product qualification. The selected candidate will be responsive, flexible, and able to succeed within an open collaborative peer environment.

Job Responsibility

  • Participate in creating reliability test plans including resource allocations, validation schedule assumptions, and validation items scope.
  • Participate in implementing specific validation items in reliability test plans with schedule.
  • Work closely with senior reliability engineer in reporting reliability execution progress and issues.
  • Participate in evaluating and developing reliability test methodologies to reduce test time and increase test coverage.
  • Travel domestically to supplier sites as projects require.

Requirements

  • Master's degree or above in mechanical engineering, electrical engineering, material science, physics or equivalent
  • Speak, write, and read fluently in Mandarin
  • Can work 5 days per week during summer holiday for at least 3 months duration
  • Is willing to work in Shenzhen

Nice to have

  • Understanding of the principles and basic structures of measuring instruments such as HALT, chambers, oscilloscopes, and mustimeters, etc.
  • Clear oral and written communication skills (Chinese and English)
  • Strong organizational and problem-solving skills
  • Demonstrated critical thinking capability
  • Self-motivated and proactive
  • Past experience or coursework with reliability knowledge, statistical knowledge and be able to use reliability and statistical method to do data analysis.
  • Past experience with failure analysis techniques to isolate failure for test issues.

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Reliability Engineer Intern

8 matching positions

Artificial intelligence-assisted Reliability Engineer Intern

Amazon development center develops innovative consumer-centric safety total solu...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Enrolled in or have completed a Master's degree or above in engineering or equivalent
  • Speak, write, and read fluently in Mandarin
  • Currently enrolled in a master’s program in Computer Science, Electrical Engineering, Chemical Engineering, Mechanical Engineering, or related field with focus on deep learning and computer vision
  • Programming experience in C++ and Python
  • Experience in implementing computer vision algorithms using multiple toolkits
  • Proficient in both English and Chinese
Job Responsibility
Job Responsibility
  • Develop and implement novel machine learning algorithms, focusing on computer vision and generative AI applications
  • Design and optimize scalable AI systems for large-scale datasets, ensuring high performance and accuracy
  • Apply state-of-the-art Machine Learning and AI research to solve complex reliability challenges and product lifespan projection
  • Create clear technical documentation and reports to effectively communicate concepts and results
  • Work closely with senior reliability engineers in reporting reliability execution progress, and failure analysis
  • Participate in evaluating and developing reliability test methodologies to reduce test time and increase test coverage
  • Parttime
Read More
Arrow Right

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
  • Strong background operating, supporting, and troubleshooting distributed systems at scale
  • Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
  • Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
  • Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
  • Familiarity with AWS environments, including serverless and container-based architectures
  • Experience working with relational databases such as Postgres and performance analysis in production systems
  • Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design
Job Responsibility
Job Responsibility
  • Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
  • Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
  • Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
  • Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
  • Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
  • Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
  • Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
  • Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
  • Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
  • Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Enrollment in company 401(k) plan
Read More
Arrow Right
New

Site Reliability Engineer

We’re hiring an SRE to help improve the availability, performance, scalability, ...
Location
Location
Israel , Netanya/Tel Aviv
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
  • Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
  • Hands-on experience with Kubernetes-based containerized workloads
  • Experience with at least one public cloud provider: AWS, GCP, or Azure
  • Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
  • Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
  • Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
  • Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
  • Understanding of incident management processes, alerting systems, and production support workflows
  • Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment
Job Responsibility
Job Responsibility
  • Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
  • Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
  • Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
  • Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
  • Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
  • Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
  • Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
  • Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Trimble is seeking a Staff Site Reliability Engineer (P4) to join our Corporate ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree or equivalent in Computer Science, Engineering, Information Systems, or a related field
  • OR equivalent practical experience
  • Minimum of 10 years of experience in IT operations, including deep knowledge of networking, computing, and storage
  • Minimum of 5 years of experience with AWS and/or Azure cloud computing environments, with at least 2 years in an architect/design role
  • Windows and Linux deployment experience, including common services for each platform
  • Proficiency in at least one scripting language (preferably Python or Powershell/.NET) and proficiency utilizing Git as a source control system
  • Strong background in application operations, including Incident Management, Change Management, and Capacity Management
  • Excellent troubleshooting and problem-solving skills, knowledge of security best practices, a strong desire to learn independently, and exceptional written/verbal communication skills with a customer-service mindset
Job Responsibility
Job Responsibility
  • Cloud Architecture & Enhancement: Develop new and enhance current shared public cloud services with a strict focus on Availability, Operations, Performance, Capacity, Security, and User Experience
  • Technical Leadership: Provide input and expertise relating to cloud hosting solutions (full infrastructure design and management). Transform business requirements into scalable operational designs
  • Collaboration & Planning: Attend and provide input on product planning sessions with internal development teams. Act as an expert on Business System services to communicate the value of our platform
  • Automation & Documentation: Identify and implement automation solutions. Develop and maintain critical documentation, including architecture diagrams, service descriptions, build/deploy processes, and operations run books
  • Mentorship & Support: Provide technical escalation and mentoring to other team members. Train operations teams to provide Level 1/2 support for shared public cloud services, acting as the ultimate Level 3 escalation point
  • Standards & Governance: Manage AWS/Azure best practice expectations and ensure alignment with corporate standards
  • Global Collaboration: Work effectively within a global team framework. Strike a balance between Indian and US time zones to attend business stakeholder meetings, address production issues, and serve as a reliable escalation point (including off-hours tasks when necessary)
  • Fulltime
Read More
Arrow Right