Junior Site Reliability Engineer Job at accesso

Junior Site Reliability Engineer

We are looking for an early-career Site Reliability Engineer to join our global ...

Location

Finland , Helsinki

Salary:

Not provided

Aiven Deutschland GmbH

Expiration Date

Until further notice

Requirements

Ability to Code: basic programming skills, with a preference for Python
Linux Fundamentals: comfortable working in a terminal and have a grasp of Linux systems administration and networking
Analytical Problem Solving: enjoy the detective work of debugging
AI Curiosity: interested in how AI is changing the infrastructure landscape
Operational Mindset: ready to contribute to a rotation

Job Responsibility

Handle essential operational duties, including stakeholder-driven tasks like managing account lifecycles and service adjustments
Improve our observability framework and automate manual toil to create a self-healing, highly visible production environment
Participate in our on-call rotation to maintain platform health

What we offer

Participate in Aiven’s equity plan
Balance work and life with our hybrid work policy
Choose the equipment you need to set yourself up for success
Use your Professional Development Plan budget for learning opportunities
Receive holistic wellbeing support through our global Employee Assistance Program
Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
Enjoy country-specific benefits for our global cast

Fulltime

Site Reliability Engineer

We are looking for a Lead Site Reliability Engineer (SRE) with strong experience...

Location

India , Bangalore

Salary:

Not provided

Karix

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE / DevOps / Production Engineering roles
Strong expertise in troubleshooting distributed systems and microservices architecture
Hands-on experience with Kafka, RabbitMQ, and Redis
Strong knowledge of Kubernetes and container orchestration
Experience with CI/CD pipelines and deployment automation
Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
Experience with Infrastructure as Code (Terraform, Ansible)
Strong scripting skills (Python, Bash, or similar)
Database experience: MySQL / Oracle / MongoDB
Strong problem-solving, ownership mindset, and ability to lead initiatives

Job Responsibility

Lead troubleshooting and resolution of complex production issues in distributed systems
Drive reliability engineering practices, ensuring high availability and performance of systems
Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
Architect, manage, and optimize Kubernetes clusters for scalability and resilience
Manage CI/CD pipelines and drive deployment automation
Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
Lead incident management, root cause analysis (RCA), and post-mortem reviews
Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability

What we offer

Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement

Fulltime

Site Reliability Engineer

Engineering to make a system more resilient and efficient frees up time and mone...

Location

United States , Annapolis Junction

Salary:

86900.00 - 198000.00 USD / Year

Booz Allen Hamilton

Expiration Date

Until further notice

Requirements

5+ years of experience creating and maintaining highly reliable and scalable systems to reduce issues and downtime, including design and implementation of physical servers, storage systems, and network infrastructures
5+ years of experience providing technical support for system upgrades, rollouts, and enhancements
3+ years of experience developing and deploying infrastructure solutions
3+ years of experience employing and sustaining VMware for v6.x and later, including the design and implementation of virtual data centers
3+ years of experience designing and deploying highly available storage solutions for technologies, including SAN storage and high-capacity storage solutions
Experience with data center design and buildout
Experience transforming large-scale software, data center, or on-premises infrastructure programs to a virtualized architecture
Ability to interact with clients and lead, train, and mentor junior system administrators
Top Secret clearance
Bachelor's degree

Job Responsibility

Lead the development of more robust systems for Booz Allen by building a resilient infrastructure
Build in redundancy, implement monitoring tools, and automate wherever possible
Reduce toil by scripting routine tasks and automating self-repair
Support your team of engineers and act as a subject matter expert for our clients

What we offer

Health benefits
Life benefits
Disability benefits
Financial benefits
Retirement benefits
Paid leave
Professional development
Tuition assistance
Work-life programs
Dependent care

Fulltime

Site Reliability Engineer III

We're looking for a senior Site Reliability Engineer to join our small, high-own...

Location

United States

Salary:

148320.00 - 185400.00 USD / Year

AbsenceSoft

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, DevOps, or a related engineering role
Advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty
Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK
Experience building and operating CI/CD pipelines using Jenkins and GitHub
Proficiency in Python, Go, or Bash for automation
Hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management
Demonstrated experience leading incident response in complex, distributed systems
Working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives
Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection
A collaborative, ownership-driven mindset with strong communication skills

Job Responsibility

Architect, implement, and operate scalable, resilient, and secure AWS infrastructure
Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured
Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub
Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management
Define and maintain SLOs, SLIs, and error budgets
Serve as a senior technical responder across the full incident lifecycle within a shared on-call rotation
Lead blameless postmortems
Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives
Contribute to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation
Mentor junior SREs through code reviews, incident pairing, and documentation

What we offer

Impact that matters
Flexibility and trust
Remote-first and results driven
Growth and development
Access to learning resources, leadership programs, and real opportunities to take on new challenges
Competitive rewards
Comprehensive benefits
Performance-based bonus program
Equity opportunities
Time for life

Fulltime

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...

Location

Spain; Portugal; United Kingdom

Salary:

Not provided

Parser Limited

Expiration Date

Until further notice

Requirements

Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams

Job Responsibility

Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives

What we offer

The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
The opportunity to form part of an amazing, multicultural community of tech experts
A highly competitive compensation package
Medical insurance
English lessons

Fulltime

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer (SRE) with deep experience...

Location

United States , Chicago

Salary:

131000.00 USD / Year

Realign

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, DevOps, or Cloud Engineering
Expertise in AWS core services (EC2, ECS/EKS, Lambda, S3, VPC, RDS, IAM, CloudFront, etc.)
Hands-on experience with Terraform, Ansible, or other IaC tools
Strong scripting/coding skills (Python, Go, Shell, etc.)
Experience with Kubernetes, containerization, and orchestration
Deep knowledge of Linux systems and networking
Experience with Service Meshes (e.g., Istio, App Mesh)
Familiarity with AWS Well-Architected Framework
Experience building self-healing systems and automated remediation
Background in security, compliance, or multi-account/multi-region AWS architectures

Job Responsibility

Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS
Develop and improve CI/CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness
Own and implement monitoring, alerting, logging, and distributed tracing with tools like Dynatrace/ Datadog
Troubleshoot production incidents, conduct blameless postmortems, and improve incident response processes
Optimize systems for cost, performance, and reliability
Drive chaos engineering and resilience testing
Collaborate with development teams to embed SRE practices like SLAs, SLOs, and error budgets
Mentor junior SREs and promote DevOps/SRE culture across the org

Fulltime

Site Reliability Engineer

As a member of Kalshi’s engineering team, you’ll help build the next-generation ...

Location

United States , New York

Salary:

100000.00 - 250000.00 USD / Year

Kalshi

Expiration Date

Until further notice

Requirements

4+ years of software engineering experience
Experience designing, building, scaling, and maintaining production services and service-oriented architectures
Strong system design, coding, debugging, performance-tuning, and observability skills
High-quality coding practices with strong testing discipline
Excellent written and verbal communication
comfort working transparently across teams
Strong interpersonal skills across junior-to-principal engineering levels
Ability to think clearly under pressure and dive into any layer of the stack
Passion for building an open financial system that connects the world
Willingness to participate in on-call rotations and swiftly resolve issues

Job Responsibility

Improve observability, reliability, and service availability by defining and measuring key metrics
Build automation and systems that eliminate toil and reduce operational burden
Collaborate with core infrastructure engineers to performance-tune and optimize cloud deployments (Docker, Terraform, Kubernetes, EC2, etc.)
Partner with product teams to minimize service disruptions and automate incident response
Identify and analyze reliability problems across the stack, designing and implementing software for significant, long-term improvements
Mentor engineers and drive a culture where reliability is a core engineering value
Write high-quality, well-tested code that supports internal and external customer needs
Debug complex technical issues and improve system usability, operability, and diagnosability
Review feature designs across the company and ensure security, safety, scalability, and architectural clarity
Build and maintain integrations with third-party vendors

What we offer

equity and benefits

Fulltime

Principal Site Reliability Engineer

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of ...

Location

Portugal

Salary:

Not provided

OutSystems

Expiration Date

Until further notice

Requirements

STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
8+ years of experience in Software Engineering or SRE, ideally within high-growth, cloud-native environments
Expertise in Observability: Proven ability to implement SLIs/SLOs and telemetry systems that provide actionable insights into complex distributed systems
Cloud Mastery: Deep architectural knowledge of AWS/GCP/Azure, specifically regarding networking, security, and cost-optimization
Strategic Impact: Demonstrated success in leading cross-functional initiatives that improved system reliability or developer velocity at an organizational scale
System Design & Architecture: Expertise in designing highly available, fault-tolerant distributed systems (Microservices, Event-driven architecture)
Development: Professional-level proficiency in Go, Python, or Rust, with the ability to contribute to core product codebases and build custom internal tooling
Cloud Ecosystems: Deep-tier mastery of AWS, GCP, or Azure (specifically IAM, VPC networking, Transit Gateways, and Cross-region redundancy)
Orchestration at Scale: Extensive experience managing Kubernetes (K8s) in production, including Custom Resource Definitions (CRDs), Service Mesh (Istio/Linkerd), and Admission Controllers
Infrastructure as Code (IaC): Advanced usage of Terraform, CloudFormation, or Spacelift, focusing on modularity, state management, and CI/CD integration for infrastructure.

Job Responsibility

Help define and execute the strategic vision and roadmap for the Site Reliability Engineering function
Provide leadership and mentorship to more junior SREs, fostering a culture of innovation, collaboration, and operational excellence
Collaborate with leadership and other stakeholders to ensure cross-functional alignment
Take active participation, collaborate effectively with development teams, and influence the design of a highly reliable and scalable infrastructure, leveraging cloud technologies and industry best practices
Collaborate with development teams at all stages of the product development lifecycle to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant
Drive the adoption, definition, and improvement of Service Level Objectives (SLOs)
Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents
Oversee incident response efforts, ensuring quick resolution and minimal downtime, and effective RCA/post-mortems
Automate every operational task, with a special focus on fast incident detection & recovery
Foster a culture of continuous improvement and knowledge sharing

What we offer

A company that is always growing, changing, and innovating
Real career opportunities
Work colleagues that are as smart, hard-working, and driven as you
Disrupting the status quo is in our DNA
We ask “why” a lot
OutSystems nurtures an inclusive culture of diversity, where everyone feels empowered to be their authentic self and perform at their best.

Fulltime

Select Country

Junior Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Junior Site Reliability Engineer

Junior Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer III

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer

Principal Site Reliability Engineer

Our AI answers in your language