CrawlJobs Logo

Associate Site Reliability Engineer

United States, Santa Monica Employment contract 30.53 - 56.48 USD / Hour · Job Posted June 29, 2026
Apply Position
Job Link Share

Job Description

The Associate Site Reliability Engineer helps keep Marketing Technology services reliable, observable, and supportable in production. This role is designed as an entry-level-to-early-career SRE position, but grounded in the type of operational work seen in the team today: incident triage, troubleshooting infrastructure and application behavior, improving observability, and reducing toil through automation.

Job Responsibility

  • Monitor service health, respond to alerts, and participate in incident response for cloud and application environments
  • Investigate reliability issues across Kubernetes, networking, DNS, application runtime behavior, and dependent services, escalating when appropriate
  • Build and maintain dashboards, alerting, runbooks, and operational documentation that improve detection and recovery speed
  • Contribute small automations and scripts that reduce repetitive operational work and improve environment hygiene
  • Support release and deployment reliability by validating changes, helping with rollback readiness, and improving change safety
  • Participate in post-incident follow-up and help close corrective actions that prevent recurrence

Requirements

  • 1–3 years of experience in SRE, DevOps, cloud infrastructure, or software engineering internships / early-career roles
  • Familiarity with Linux, HTTP, DNS, containers, Kubernetes, Git-based workflows, and scripting in Bash, Python, or similar languages
  • Exposure to monitoring, logs, metrics, dashboards, and incident management practices
  • Strong troubleshooting mindset, clear communication, and willingness to learn in a production-support environment

What we offer

  • Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance
  • 401(k) with Company match, tuition reimbursement, charitable donation matching
  • Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave
  • Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others
  • If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Associate Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer

Our client, a leader in the HCM space is in need of a Senior Site Reliability En...
Location
Location
United States , Reston
Salary
Salary:
67.50 - 97.50 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience support large scale cloud infrastructure, automation and DevOps preferably in an AWS environment
  • Ability to build, maintain, and consume CI/CD pipelines and tools
  • Proficient w/ Terraform to automate critical infrastructure
  • Experience supporting Kubernetes based platforms to ensure high availability
  • Active TS SCI w/ CI Poly
Job Responsibility
Job Responsibility
  • Ensuring Kubernetes based platform is maintained, healthy, and ensures high availability, scalability and security
  • Automating infrastructure provisioning, configuration management, application deployments using Terraform and Argo CD
  • Handling troubleshooting and documentation associated with the platform
  • Collaborating with multiple cross functional teams
  • Proficient at building, maintaining and consuming CI/CD pipelines
  • Fulltime
Read More
Arrow Right

Sr Associate Site Reliability Engineering

Workday is looking for a highly skilled SRE with a focus on Open-Source database...
Location
Location
Australia , North Sydney
Salary
Salary:
Not provided
Workday
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience managing databases for enterprise cloud applications at scale
  • 3+ years of working in MySQL and/or PostgreSQL database environments in Private and Public Cloud (AWS and/or GCP)
  • Expertise with Python/GO
  • Experience with Infrastructure automation (Terraform, Ansible, etc.), CI/CD pipelines (GIT, Jenkins, Argo etc), and configuration management tools (Ansible, Chef etc)
  • Experience working with private and public clouds (IAAS, AWS, etc.) and capacity management principles
  • Working knowledge in technologies like Kubernetes/docker
  • Great teammate with excellent interpersonal skills as well as the ability to prioritize multiple tasks in a fast-paced environment
  • Available for on-call support on a rotating basis
  • BS/MS or equivalent experience in Computer Science or a related technical field
Job Responsibility
Job Responsibility
  • Ensuring the entire Workday's Data related needs are met with high performance and scale, while providing utmost high availability that our customers expect from Workday
  • Ensuring seamless operation of 1000s of production and non-production databases across multiple data centers, public clouds and geographies
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...
Location
Location
Spain; Portugal; United Kingdom
Salary
Salary:
Not provided
parserdigital.com Logo
Parser Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
  • Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
  • Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
  • Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
  • Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
  • Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
  • Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
  • Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
  • Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
  • Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams
Job Responsibility
Job Responsibility
  • Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
  • Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
  • Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
  • Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
  • Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
  • Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
  • Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
  • Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives
What we offer
What we offer
  • The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
  • The opportunity to form part of an amazing, multicultural community of tech experts
  • A highly competitive compensation package
  • Medical insurance
  • English lessons
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Under general supervision, the Site Reliability Systems Administrator II is resp...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and three (3) to five (5) years of related experience or an equivalent combination
  • Intermediate knowledge of appropriate networks, products, and protocols
  • Knowledge of Unix, Windows NT/2000/98, Internet Security, Oracle ERP, Distributed computing systems
  • Knowledge of job associated database/software/documentation/programming languages/monitoring and version control tools
  • Troubleshooting skills
  • Problem solving skills
  • Demonstrated knowledge and adherence to Change Management processes
  • Ability to interface well with customers, end users, partners, and associates
Job Responsibility
Job Responsibility
  • Defines, designs, and administers network systems used for data communications and recommends improvements to problems of moderate scope
  • Responsible for making sure that the company network works
  • Manages the load configuration of a central data communication processor under limited guidance and makes some recommendations for the purchase or upgrade of data networks
  • Exercises some discretion in proposing and implementing network system enhancements (software and hardware updates)
  • Serves as a point of contact for performance analysis, scalability, and service architecture/database administration issues
  • Coordinates equipment orders including terminals and cable installation, as well as upgrading, monitoring, testing, and servicing the database/systems
  • Helps to negotiate and place orders with common carriers
  • Performs other duties as assigned
What we offer
What we offer
  • options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay
  • Fulltime
Read More
Arrow Right

Lead Site Reliability engineer

Solution, Reliability and Monitoring Entity main objective is to define, provide...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
airbus.com Logo
Airbus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, information technology or other related discipline with 7+ years of experience
  • Solid experience designing and building secure solutions in AWS (Amazon Web Services)
  • Extensive experience in systems administration or a combination of software/systems experience
  • Some experience in scripting and automation of asset
  • Solid knowledge of Operating Systems & ability to perform troubleshooting required
  • Extensive knowledge of Cloud Technology concepts & ability to perform complex troubleshooting required
  • Solid knowledge of networking for enterprise environments required
  • Solid knowledge of Virtual Machine concepts and management of infrastructure
  • Demonstrated ability to identify root cause of issues and to recommend permanent, long term, fixes
  • Demonstrated ability to perform complex troubleshooting in AWS environment and providing guidance to other teams
Job Responsibility
Job Responsibility
  • Define, implement, and manage cloud-based infrastructure
  • Work closely with the Software Factory’s (SWF) Solution Architects to facilitate the transition from Development to In-Support phase
  • Creating/Animating an hosting network with SWF
  • Representing Hosting Group in the different Trains
  • Coordinating with Solution Architects (SAs) to support the technical architecture decisions related to Hosting
  • Supporting SWF for new components onboarding
  • Coordinate with SWF Systems & Architecture team for future planning
  • Contribute to Prioritization Reviews for the different trains
  • Guide products in Service Level Objectives (SLO) definitions & monitoring based on Hosting Operations feedbacks
  • Define, share and broadcast Guidelines and Non-Functional Requirements (NFR) related to: hosting, deployment and monitoring
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enabl...
Location
Location
United States , Redmond
Salary
Salary:
84200.00 - 165200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Associate's Degree in Computer Science, Information Technology, or related field OR Bachelor's Degree in Computer Science, Information Technology, or related field OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Experience operating large-scale, distributed systems in a lab/validation
  • Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines
  • Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell)
  • with experience reading lower-level system code
  • Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
  • Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems
Job Responsibility
Job Responsibility
  • Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments
  • Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases
  • Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments
  • Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments
  • Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments
  • Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management
  • Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments
  • Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Site Reliability Engineer II - (Microsoft 365 Enterprise + Cloud). We are lookin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Mid-level years of software development: automation-related experience is most valued
  • Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable
  • Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on
  • Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps
  • Consequent understanding of monitoring in distributed systems
  • Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack
  • understanding of how applications are affected by the above, and ability to debug same
  • Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems
  • Practical experience running large scale online systems is always an advantage
Job Responsibility
Job Responsibility
  • Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
  • identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance
  • Drives the adoption of innovative solutions across engineering teams working with related products within an organization
  • Apply advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights
  • Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability
  • Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles
  • leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention
  • Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale
  • reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

The Site Reliability Engineer (SRE) at NTT DATA is a critical role focused on en...
Location
Location
Belgium , Diegem
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent in Computer Science, Information Technology, or a related field
  • Seasoned hands-on experience in a Site Reliability Engineering role or related roles, including experience in designing and maintaining highly available and scalable systems
  • Seasoned hands-on experience with Linux/Unix systems, networking, and system administration
  • In-depth knowledge of cloud platforms (such as AWS, Azure, or Google Cloud) and associated services
  • Seasoned proficiency in multiple programming languages like Python, Java, Go, or Ruby
  • Seasoned understanding of complex infrastructure architectures, including scalable and fault-tolerant designs
  • experience with infrastructure-as-code tools (such as Terraform or CloudFormation) and containerization technologies (such as Docker or Kubernetes)
  • Seasoned experience in designing and implementing robust automation frameworks, CI/CD pipelines, and deployment strategies
  • Seasoned experience in incident management, troubleshooting complex system issues, and conducting post-incident analysis
  • Seasoned understanding of DevOps principles, Agile methodologies, and a strong commitment to continuous improvement and learning
Job Responsibility
Job Responsibility
  • Monitors system health, performance metrics, and alerts to identify and respond to incidents promptly and diagnoses issues, troubleshoots problems, and restores services in a timely manner
  • Implements incident response processes to minimize downtime and improve system availability
  • Designs, develops, and maintains automation tools, scripts, and processes to streamline system management tasks, deployments, and configuration changes
  • Implements infrastructure-as-code principles to ensure consistency and repeatability
  • Optimizes system resources, configurations, and processes to enhance performance, scalability, and efficiency
  • Uses monitoring tools and performance testing to identify bottlenecks and implement optimizations
  • Collaborates with teams to forecast system resource needs, plans for capacity growth, and ensures adequate scalability
  • Leads incident response efforts, coordinates with cross-functional teams, and drives the resolution of system issues
  • Performs thorough post-incident analysis to identify root causes and implements preventive measures to minimize future incidents
  • Identifies opportunities for automation and drives the implementation of self-healing, monitoring, and deployment of automation tools and frameworks
  • Fulltime
Read More
Arrow Right