Site Reliability Engineer Job at Hebbia (New York City)

New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
Understand the concept of container orchestration platforms (e.g. Kubernetes)
Understand the concept of scripts: Powershell, Python
Understand the difference between NoSQL and SQL databases, and how to maintain them

Job Responsibility

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)

Fulltime

New

Site Reliability Engineer

Location

South Africa , Johannesburg

Salary:

Not provided

Nintex

Expiration Date

Until further notice

Requirements

You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
You provide detailed estimates for work items you propose or assigned.
You assist in decision-making around tooling, automation practices, and testing solutions.
You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty

Job Responsibility

You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
You are called into incidents and bring trusted knowledge in your platform domain.
You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
You build, promote and support infrastructure patterns and practices within Nintex.
You provide coaching/mentoring to other Engineers on the team
You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
You continuously monitor our platform performance and take immediate action to improve it
You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
You design and build complex infrastructure components for distributed systems as Kubernetes.

What we offer

Global Gratitude and Recharge Days
Flexible, paid time off policy
Employee wellness programs and counseling resources
Meaningful peer recognition and awards
Paid parental leave
Invention/patenting assistance
Community impact, paid volunteer time, and opportunities
Intercultural learning and celebration
Multiple tools through which to learn and grow, and an incredible global community

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...

Location

Hong Kong , Hong Kong

Salary:

1200000.00 HKD / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Genuine passion in Linux & Open-source
Excellent knowledge of Python
Use of CI/CD, Docker, Ansible, Chef, Puppet
Knowledge of large-scale storage systems (on-prem)

Job Responsibility

Help architect a resilient, multi-petabyte storage solutions & build new data centres
Automate anything and everything with Python & config tools
Innovate whilst bringing in new ideas

What we offer

Flexible hours/work options
Working in one of the world’s most elite teams
Invest heavily in cutting-edge and next-gen tech
Technologists only report to other technologists
Brand new skyline Manhattan office
Start-up style environment

Fulltime

Site Reliability Engineer

As a Staff Software Engineer, you will play a key role in designing, building, a...

Location

United States , San Jose

Salary:

120500.00 - 243000.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Minimum of 5 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
Proficiency with Linux systems, especially Debian-based distributions
Strong experience with cloud platforms such as AWS and GCP
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
Solid programming skills in Python and/or Golang
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
Experience with GitOps workflows
Proven track record in implementing and maintaining CI/CD pipelines
Strong background in security and familiarity with security programs
Experience with monitoring and logging tools (Prometheus, Grafana, ELK)

Job Responsibility

Enhance Infrastructure as Code (IAC) and enforce best practices
Optimize cloud infrastructure for scalability, security, and cost-effectiveness
Develop internal tools to support and streamline cloud platform operations
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
Address container image vulnerabilities and standardize remediation processes
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
Troubleshoot complex production issues to ensure system reliability and customer satisfaction
Fine-tune distributed systems such as Apache Kafka and Cassandra
Collaborate with development, security, and operations teams to align infrastructure with application needs

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

New

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...

Location

United Kingdom , London

Salary:

150000.00 GBP / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Genuine passion in Linux & Open-source
Excellent knowledge of Python
Use of CI/CD, Docker, Ansible, Chef, Puppet
Knowledge of large-scale storage systems (on-prem)

Job Responsibility

Help architect a resilient, multi-petabyte storage solutions & build new data centres
Automate anything and everything with Python & config tools
Innovate whilst bringing in new ideas

What we offer

Flexible hours/work options
Working in one of the world’s most elite teams
Invest heavily in cutting-edge and next-gen tech
Technologists only report to other technologists
Brand new skyline Manhattan office
Start-up style environment

Fulltime

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
Reliability & SRE: Improve availability, resiliency, and incident response
own service health across regions
Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability

Fulltime

Site Reliability Engineer

Location

United Kingdom , Newcastle

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Engineering or a related field
At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure
Excellent collaboration skills with leading cross-functional work
Demonstrated success in managing infrastructure in production environments
Expertise in capacity planning and cost optimisation for efficient operations
Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS
Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation
Demonstrated success with Kubernetes or other containerization technologies
Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub
Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc

Job Responsibility

Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments
Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting
Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes
Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention
Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed
Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions
Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks
Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles
Participate in on-call rotations and handle critical incidents with confidence and expertise
Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle

Job Responsibility

Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns

What we offer

Structured environment to accelerate technical skills
Direct guidance from experienced engineering professionals
Projects that improve productivity, quality, safety, transparency and sustainability
Collaborative and supportive team
Entrepreneurial spirit empowering proactive doers
Flexible work arrangements

Fulltime

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language