Cloud Engineer / Site Reliability Engineer (SRE) Job at Beacon Hill (Orlando)

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...

Location

United States , Austin

Salary:

Not provided

Dutech Systems

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes

Job Responsibility

Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Site Reliability Engineer (SRE)

Wissen Technology is hiring for Site Reliability Engineer (SRE). At Wissen Techn...

Location

India , Bangalore South

Salary:

Not provided

Wissen

Expiration Date

Until further notice

Requirements

Strong experience in Java Application Support
Proven expertise with Terraform
Solid Cloud knowledge (AWS, Azure, or GCP)
9+ years of professional experience in SRE or related roles
Hands-on experience with MongoDB and Kafka
Experience with GitHub Actions for CI/CD automation
Strong problem-solving skills and ability to work independently during critical incidents
Excellent communication and stakeholder management skills

Job Responsibility

Ensure reliability, scalability, and performance of mission-critical systems
Provide Java application support and troubleshoot production issues
Implement and maintain Infrastructure as Code (IaC) using Terraform
Manage and optimize cloud infrastructure across AWS, Azure, or GCP
Automate CI/CD pipelines using GitHub Actions
Administer and support MongoDB and Kafka clusters
Drive incident response, root cause analysis, and postmortem documentation
Collaborate with cross-functional teams to enhance observability, monitoring, and alerting capabilities

Fulltime

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...

Location

India , Chennai

Salary:

Not provided

Dalet

Expiration Date

Until further notice

Requirements

Cloud platforms: AWS, Azure
Containerisation & Orchestration: Kubernetes
Infrastructure as Code: Terraform
Configuration Management: Ansible
Packaging & Deployment: Helm
Databases: MariaDB, MongoDB
Monitoring, observability, networking, and cloud security.

Job Responsibility

Act as a senior technical authority for APAC Site Reliability Engineering activities
Drive best practices in reliability, operations, and engineering standards
Promote technical excellence, collaboration, and accountability across stakeholders
Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
Collaborate closely with engineering to improve platform components, automation, and operational processes

What we offer

Great career opportunities around the world
Truly collaborative environment with supportive leadership
Cutting edge technologies (AI, Cloud, Cybersecurity...)
Talented and passionate team members
Fun working environment

Fulltime

Site Reliability Engineer (SRE) - Identity Access Management IAM

Join us as a Site Reliability Engineer (SRE) - Identity Access Management. You w...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience in designing, implementing, deploying, and running highly available, fault-tolerant, auto-scaling and auto-healing systems
Strong expertise in AWS (essential), (Azure, and GCP (Google cloud platform) is a plus), including Kubernetes (ECS is essential, Fargate and GCE is a plus) and server-less architectures
Strong experience in running disaster recovery, zero downtime solutions and in designing and implementing continuous delivery across large-scale, distributed, cloud-based micro service and API service solutions with 99.9%+ uptime
Hands-on experience coding in Python, Bash and JSON/Yaml (Configuration as Code)
The ability to drive reliability best practices across engineering teams, embed SRE principles into the DevSecOps lifecycle and partner with engineering, security and product teams, to balance reliability and feature velocity
Experience in hands-on configuration, deployment and operation of ForgeRock COTS based IAM (Identity Access management) solutions (PingGateway, PingAM, PingIDM, PingDS) with embedded security gates, HTTP header signing, access token and data at rest encryption, PKI based self-sovereign identity, or open source

Job Responsibility

Applying software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Principal Site Reliability Engineer (Sovereign Cloud)

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
Strong communication, strategic thinking, and problem-solving skills
Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Ready to understand and dissect new technology stacks quickly

Job Responsibility

Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
Participate in on-call rotations to support critical business and production systems

Fulltime

Sr Principal Site Reliability Engineer (Sovereign Cloud)

The Prisma Access team is seeking a seasoned Principal Site Reliability Engineer...

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

10+ years of experience in Infrastructure, SRE, or DevOps roles
BS or MS in Computer Science, a related field, or equivalent professional experience
7+ years of experience with GCP, and expertise in their architecture, services and PKI concepts for cloud security
Expert troubleshooting skills to resolve cloud infrastructure and service issues, effectively identifying root cause and devising effective solutions
Proficiency in automation using Python and shell scripting
Expertise in Infrastructure as Code (IaC) with Terraform and Helm, leveraging AI tools for development
Solid experience with Kubernetes, container networking, and container workloads
Strong Linux administration skills
Proficiency with CI/CD pipelines, GitOps principles, and tooling like GitLab and Jenkins
Excellent written and verbal communication skills, with the ability to collaborate effectively to drive outcomes

Job Responsibility

Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments for our sovereign customers
Lead cross-functional initiatives to ensure applications are production-ready, scalable, secure, and resilient
Develop expertise in new technologies, embracing continuous learning and the adoption of AI tools
Develop tools and automation frameworks, championing Infrastructure as Code (IaC) and Monitoring as Code (MaC) principles
Automate robust deployments and orchestrate end-to-end monitoring and alerting solutions
Participate in on-call rotations to support critical business and production systems
Lead root cause analysis of critical issues, driving improvements and preventing recurrence
Champion the success of SRE and DevOps initiatives, aligning technical decisions with business goals

Fulltime

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Palo Alto Networks runs a large infrastructure and is one of the largest GCP cus...

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

10+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
7+ years building high availability, scalable cloud-native applications on AWS and GCP
BS or MS in Computer Science, a related field, or equivalent professional experience required
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Passion for infrastructure and monitoring as code
Solid experience in container workloads and Kubernetes
Familiarity with PKI concepts, Networking concepts
In-depth knowledge of different security controls ( app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate in on-call rotations to support critical business and production systems
Lead root cause analysis of critical business and production issues

Fulltime

Select Country

Cloud Engineer / Site Reliability Engineer (SRE)

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?