DevOps Site Reliability Engineer Job at Tech Mahindra (Montreal)

Site Reliability Engineer

We are recruiting a Senior SRE for a company that provides an advanced data, ope...

Location

Portugal , Lisboa

Salary:

Not provided

Precise

Expiration Date

Until further notice

Requirements

Up to 5 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
Incident management expertise, including triaging, escalation, and resolution of high-severity outages
Proficiency in at least one coding language Python or Java) for automation and debugging
Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
Excellent communication skills to articulate technical challenges and solutions effectively
Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
Fluency in English (spoken and written) is required
Must have the legal right to work in the country

Fulltime

Site Reliability Engineer

We are recruiting a Junior SRE for a company that provides an advanced data, ope...

Location

Portugal , Lisboa

Salary:

Not provided

Precise

Expiration Date

Until further notice

Requirements

Up to 2-3 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
Incident management expertise, including triaging, escalation, and resolution of high-severity outages
Proficiency in at least one coding language Python or Java) for automation and debugging
Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
Excellent communication skills to articulate technical challenges and solutions effectively
Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
Fluency in English (spoken and written) is required
Must have the legal right to work in the country

Fulltime

Site Reliability Engineer

Join our client, a leading financial institution at the forefront of innovation,...

Location

United States , Austin

Salary:

57.00 - 63.33 USD / Hour

Aquent

Expiration Date

Until further notice

Requirements

Proven experience leading engineering teams and delivering projects using Scrum and efficient release practices
Strong background in converting high-level designs into low-level designs and providing technical oversight
Demonstrated experience in designing, architecting, and deploying cloud-native applications, specifically on GCP
Proficiency with various database technologies, including MongoDB, Aerospike, SQL Server, and PostgreSQL
Expertise in containerization technologies such as Docker and Kubernetes, and building/managing CI/CD pipelines
Experience leveraging AI-Driven software development tools to enhance productivity, code comprehension, and documentation
Proven track record of integrating and applying AI/Machine Learning models for data analytics, visualization, automation, and problem-solving
Ability to maintain high quality standards while delivering within tight schedules
Exceptional collaborative mindset with a bias for action, engaging effectively with product management, architects, and other domains
Strong ability to work with internal, external, and offshore stakeholders

Job Responsibility

Drive Technical Leadership & Project Delivery: Lead engineering teams through the entire project lifecycle, leveraging agile methodologies like Scrum to ensure efficient delivery and robust release practices
Architect & Design Cloud-Native Solutions: Translate high-level architectural visions into detailed low-level designs, providing expert technical oversight for the development and deployment of cutting-edge cloud-native applications
Champion Reliability & Scalability: Design, architect, and deploy highly available and scalable cloud-native applications on platforms such as GCP, ensuring optimal performance and resilience
Optimize Data Management: Leverage your expertise with diverse database technologies, including MongoDB, Aerospike, SQL Server, and PostgreSQL, to build and maintain robust data solutions
Advance DevOps & Automation: Implement and optimize containerization strategies using technologies like Docker and Kubernetes, and establish sophisticated CI/CD pipelines to streamline development and deployment
Innovate with AI/ML: Integrate and apply AI/Machine Learning models to enhance data analytics, visualization, automation, and creatively solve complex business and technical challenges
Foster Collaboration & Mentorship: Work closely with diverse stakeholders across product management, architecture, and other engineering domains, while actively mentoring and coaching multiple teams to elevate technical capabilities
Influence & Present Solutions: Effectively engage subject matter experts, present complex architectural solutions to governance boards and stakeholders, and advocate for data-driven proposals

What we offer

subsidized health, vision, and dental plans
paid sick leave
retirement plans with a match

Site Reliability Engineer

You develop cloud platform according to modern principles. You advise our custom...

Location

Spain , Valencia

Salary:

Not provided

MaibornWolff GmbH

Expiration Date

Until further notice

Requirements

Ideally, a degree in computer science or comparable training
Sound technical understanding
Idea of how to build and run a secure application in the cloud
Experience with container orchestration, ideally with Kubernetes
Experience with Infrastructure-as-Code tools such as Terraform, Helm, Ansible, or CDK
Experience in setting up the release management process using modern CI/CD systems
Knowledge of a cloud provider (AWS, Azure, Google Cloud) certified in the best case
Development skills in at least one object-oriented, functional or scripting language
Very good English and good German Skills

Job Responsibility

Develop cloud platform according to modern principles
Advise customers on the sensible use of services in the cloud with regard to effort, costs and maintenance
Live a vibrant DevOps culture internally and carry it to customers
Help the customer to introduce the correct release processes and implement them based on the modern CI/CD tools (Azure DevOps, Gitlab, Github)
Develop and integrate monitoring and logging infrastructure to improve application maintainability
Design and develop scalable and fail-safe IT architectures

What we offer

Home Office & Office
Flexible Working Hours
Part-Time Models
Working Time Account
Sabbatical
30 days of paid vacation
An annual training budget of 1.5 gross monthly salaries for training, certifications, conferences, and more
Corporate seminars
Christmas parties
Private health and dental insurance

Site Reliability Engineer

As a highly skilled Site Reliability Engineer (SRE), you will contribute to buil...

Location

United States , New York City; San Francisco

Salary:

160000.00 - 300000.00 USD / Year

Hebbia

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
5+ years software development experience at a venture-backed startup or top technology firm
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role
Strong expertise in managing CI/CD pipelines and deployment automation
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop)
Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar
Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation
Familiarity with security best practices and tools for infrastructure and application security
Excellent problem-solving skills and the ability to troubleshoot complex issues

Job Responsibility

Assist in managing deployment pipelines to facilitate smooth and efficient software releases
Help implement and maintain observability solutions for monitoring system performance and reliability
Support local development environments to optimize developer workflows
Work with development teams to ensure infrastructure aligns with project requirements
Contribute to improving the security of our infrastructure by assisting with proactive measures and audits
Assist in developing and maintaining automation scripts and tools to enhance operational efficiency
Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations
Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure

What we offer

PTO: Unlimited
Insurance: Medical + Dental + Vision + 401K
Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late
Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent
Fertility benefits: $15k lifetime benefit
New hire equity grant: competitive equity package with unmatched upside potential

Fulltime

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...

Location

United States , San Mateo

Salary:

Not provided

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
Proven ability to troubleshoot complex issues across the entire stack
Excellent communication, collaboration, and problem-solving skills

Job Responsibility

Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts

Fulltime

Site Reliability Engineer

Corporate Tools is looking for a Site Reliability Engineer. You will be a tradit...

Location

United States

Salary:

175000.00 USD / Year

Corporate Tools

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Software Engineering, or equivalent practical experience
5+ years of experience in software engineering
2+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
Deep experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code tools such as Terraform, CloudFormation, or Pulumi
Strong proficiency with Kubernetes, Docker, and container orchestration in production environments
Hands-on experience with observability and monitoring tools like Prometheus, Grafana, OpenTelemetry, Sentry, or New Relic
Proven ability to design and implement highly available, fault-tolerant systems and lead proactive incident response efforts
Experience with performance tuning, database optimization, and caching strategies (e.g., PostgreSQL, Redis, Memcached)
Demonstrated ability to drive reliability improvements, reduce operational toil, and foster a culture of resilience and continuous improvement
Experience leading reliability-focused initiatives such as post-incident reviews, capacity planning, and root cause analysis

Job Responsibility

Stop problems before they start
Fix issues quickly and learn from them
Help keep systems steady, secure, and running
Work closely with DevOps engineers to build out tools and automation
Take ownership

What we offer

100% employer-paid medical, dental and vision for employees
Annual review with raise option
22 days Paid Time Off accrued annually, and 4 holidays
After 3 years, PTO increases to 29 days
Employees transition to flexible time off after 5 years with the company—not accrued, not capped, take time off when you want
Paid Parental Leave
Up to 6% company matching 401(k) with no vesting period
Quarterly allowance
Open concept office with friendly coworkers
Creative environment where you can make a difference

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

DevOps Site Reliability Engineer

Tech Mahindra

Location:
Canada , Montreal

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Requirements:

Nice to have:

Additional Information:

Job Posted:
January 20, 2026

Expiration:
January 31, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for DevOps Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Software Engineer, Site Reliability

Site Reliability Engineer

Site Reliability Engineering Manager

DevOps Site Reliability Engineer

Tech Mahindra

Location:Canada , Montreal

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Requirements:

Nice to have:

Additional Information:

Job Posted:January 20, 2026

Expiration:January 31, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for DevOps Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Software Engineer, Site Reliability

Site Reliability Engineer

Site Reliability Engineering Manager

Location:
Canada , Montreal

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
January 20, 2026

Expiration:
January 31, 2026