Associate Site Reliability Engineer Job at Activision (Santa Monica)

Senior Site Reliability Engineer

Our client, a leader in the HCM space is in need of a Senior Site Reliability En...

Location

United States , Reston

Salary:

67.50 - 97.50 USD / Hour

ClearBridge Technology Group

Expiration Date

Until further notice

Requirements

5+ years of experience support large scale cloud infrastructure, automation and DevOps preferably in an AWS environment
Ability to build, maintain, and consume CI/CD pipelines and tools
Proficient w/ Terraform to automate critical infrastructure
Experience supporting Kubernetes based platforms to ensure high availability
Active TS SCI w/ CI Poly

Job Responsibility

Ensuring Kubernetes based platform is maintained, healthy, and ensures high availability, scalability and security
Automating infrastructure provisioning, configuration management, application deployments using Terraform and Argo CD
Handling troubleshooting and documentation associated with the platform
Collaborating with multiple cross functional teams
Proficient at building, maintaining and consuming CI/CD pipelines

Fulltime

Sr Associate Site Reliability Engineering

Workday is looking for a highly skilled SRE with a focus on Open-Source database...

Location

Australia , North Sydney

Salary:

Not provided

Workday

Expiration Date

Until further notice

Requirements

4+ years of experience managing databases for enterprise cloud applications at scale
3+ years of working in MySQL and/or PostgreSQL database environments in Private and Public Cloud (AWS and/or GCP)
Expertise with Python/GO
Experience with Infrastructure automation (Terraform, Ansible, etc.), CI/CD pipelines (GIT, Jenkins, Argo etc), and configuration management tools (Ansible, Chef etc)
Experience working with private and public clouds (IAAS, AWS, etc.) and capacity management principles
Working knowledge in technologies like Kubernetes/docker
Great teammate with excellent interpersonal skills as well as the ability to prioritize multiple tasks in a fast-paced environment
Available for on-call support on a rotating basis
BS/MS or equivalent experience in Computer Science or a related technical field

Job Responsibility

Ensuring the entire Workday's Data related needs are met with high performance and scale, while providing utmost high availability that our customers expect from Workday
Ensuring seamless operation of 1000s of production and non-production databases across multiple data centers, public clouds and geographies

Fulltime

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...

Location

Spain; Portugal; United Kingdom

Salary:

Not provided

Parser Limited

Expiration Date

Until further notice

Requirements

Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams

Job Responsibility

Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives

What we offer

The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
The opportunity to form part of an amazing, multicultural community of tech experts
A highly competitive compensation package
Medical insurance
English lessons

Fulltime

Site Reliability Engineer II

Under general supervision, the Site Reliability Systems Administrator II is resp...

Location

United States , Birmingham

Salary:

Not provided

Alliance Automotive UK LV Ltd

Expiration Date

Until further notice

Requirements

Typically requires a bachelor's degree and three (3) to five (5) years of related experience or an equivalent combination
Intermediate knowledge of appropriate networks, products, and protocols
Knowledge of Unix, Windows NT/2000/98, Internet Security, Oracle ERP, Distributed computing systems
Knowledge of job associated database/software/documentation/programming languages/monitoring and version control tools
Troubleshooting skills
Problem solving skills
Demonstrated knowledge and adherence to Change Management processes
Ability to interface well with customers, end users, partners, and associates

Job Responsibility

Defines, designs, and administers network systems used for data communications and recommends improvements to problems of moderate scope
Responsible for making sure that the company network works
Manages the load configuration of a central data communication processor under limited guidance and makes some recommendations for the purchase or upgrade of data networks
Exercises some discretion in proposing and implementing network system enhancements (software and hardware updates)
Serves as a point of contact for performance analysis, scalability, and service architecture/database administration issues
Coordinates equipment orders including terminals and cable installation, as well as upgrading, monitoring, testing, and servicing the database/systems
Helps to negotiate and place orders with common carriers
Performs other duties as assigned

What we offer

options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay

Fulltime

Lead Site Reliability engineer

Solution, Reliability and Monitoring Entity main objective is to define, provide...

Location

India , Bangalore

Salary:

Not provided

Airbus

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, information technology or other related discipline with 7+ years of experience
Solid experience designing and building secure solutions in AWS (Amazon Web Services)
Extensive experience in systems administration or a combination of software/systems experience
Some experience in scripting and automation of asset
Solid knowledge of Operating Systems & ability to perform troubleshooting required
Extensive knowledge of Cloud Technology concepts & ability to perform complex troubleshooting required
Solid knowledge of networking for enterprise environments required
Solid knowledge of Virtual Machine concepts and management of infrastructure
Demonstrated ability to identify root cause of issues and to recommend permanent, long term, fixes
Demonstrated ability to perform complex troubleshooting in AWS environment and providing guidance to other teams

Job Responsibility

Define, implement, and manage cloud-based infrastructure
Work closely with the Software Factory’s (SWF) Solution Architects to facilitate the transition from Development to In-Support phase
Creating/Animating an hosting network with SWF
Representing Hosting Group in the different Trains
Coordinating with Solution Architects (SAs) to support the technical architecture decisions related to Hosting
Supporting SWF for new components onboarding
Coordinate with SWF Systems & Architecture team for future planning
Contribute to Prioritization Reviews for the different trains
Guide products in Service Level Objectives (SLO) definitions & monitoring based on Hosting Operations feedbacks
Define, share and broadcast Guidelines and Non-Functional Requirements (NFR) related to: hosting, deployment and monitoring

Fulltime

Site Reliability Engineer

The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enabl...

Location

United States , Redmond

Salary:

84200.00 - 165200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Associate's Degree in Computer Science, Information Technology, or related field OR Bachelor's Degree in Computer Science, Information Technology, or related field OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Experience operating large-scale, distributed systems in a lab/validation
Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines
Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell)
with experience reading lower-level system code
Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems

Job Responsibility

Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments
Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases
Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments
Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments
Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments
Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management
Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments
Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals

Fulltime

Site Reliability Engineer II

Site Reliability Engineer II - (Microsoft 365 Enterprise + Cloud). We are lookin...

Location

Ireland , Dublin

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Mid-level years of software development: automation-related experience is most valued
Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable
Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on
Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps
Consequent understanding of monitoring in distributed systems
Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack
understanding of how applications are affected by the above, and ability to debug same
Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems
Practical experience running large scale online systems is always an advantage

Job Responsibility

Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance
Drives the adoption of innovative solutions across engineering teams working with related products within an organization
Apply advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights
Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability
Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles
leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention
Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale
reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization
Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale

Fulltime

Site Reliability Engineer

The Site Reliability Engineer (SRE) at NTT DATA is a critical role focused on en...

Location

Belgium , Diegem

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor's degree or equivalent in Computer Science, Information Technology, or a related field
Seasoned hands-on experience in a Site Reliability Engineering role or related roles, including experience in designing and maintaining highly available and scalable systems
Seasoned hands-on experience with Linux/Unix systems, networking, and system administration
In-depth knowledge of cloud platforms (such as AWS, Azure, or Google Cloud) and associated services
Seasoned proficiency in multiple programming languages like Python, Java, Go, or Ruby
Seasoned understanding of complex infrastructure architectures, including scalable and fault-tolerant designs
experience with infrastructure-as-code tools (such as Terraform or CloudFormation) and containerization technologies (such as Docker or Kubernetes)
Seasoned experience in designing and implementing robust automation frameworks, CI/CD pipelines, and deployment strategies
Seasoned experience in incident management, troubleshooting complex system issues, and conducting post-incident analysis
Seasoned understanding of DevOps principles, Agile methodologies, and a strong commitment to continuous improvement and learning

Job Responsibility

Monitors system health, performance metrics, and alerts to identify and respond to incidents promptly and diagnoses issues, troubleshoots problems, and restores services in a timely manner
Implements incident response processes to minimize downtime and improve system availability
Designs, develops, and maintains automation tools, scripts, and processes to streamline system management tasks, deployments, and configuration changes
Implements infrastructure-as-code principles to ensure consistency and repeatability
Optimizes system resources, configurations, and processes to enhance performance, scalability, and efficiency
Uses monitoring tools and performance testing to identify bottlenecks and implement optimizations
Collaborates with teams to forecast system resource needs, plans for capacity growth, and ensures adequate scalability
Leads incident response efforts, coordinates with cross-functional teams, and drives the resolution of system issues
Performs thorough post-incident analysis to identify root causes and implements preventive measures to minimize future incidents
Identifies opportunities for automation and drives the implementation of self-healing, monitoring, and deployment of automation tools and frameworks

Fulltime

Select Country

Associate Site Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?