Site Reliability Engineer Job at Gamma (San Francisco)

Site Reliability Engineer

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

94300.00 - 141500.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

5–8 years of relevant experience in technical support, platform operations, or engineering
Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
Experience working with business partners, engineering teams, or technology stakeholders
Demonstrated experience supporting IT services, platform operations, or infrastructure components
Strong verbal and written communication skills, with the ability to document technical issues clearly
Experience supporting operational workstreams or participating in platform improvement initiatives
Participation in resilience‑related or stability‑focused activities preferred
Ability to collaborate effectively with cross‑functional teams
Strong organizational skills and ability to manage daily workload and task priorities
Working knowledge of Generative AI concepts preferred

Job Responsibility

Understand how application support functions within the broader technology organization and contributes to business objectives
Assist with vendor coordination and day‑to‑day interactions with offshore managed services
Support efforts to improve service levels, including participating in incident management, problem management, and knowledge‑sharing initiatives
Partner with development and engineering teams to support application stability and operational readiness
Assist in collecting capacity, performance, and latency data to support platform planning efforts
Support application onboarding activities using established guidelines and standards
Contribute to fostering a collaborative and supportive team environment that encourages skill development
Participate in cost‑efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
Assist in preparing materials for business review meetings and help align technology activities with business needs
Follow established support processes and tool standards and provide input on improvement opportunities

Fulltime

Site Reliability Engineer

We are looking for a talented Site Reliability Engineer (SRE) with a strong back...

Location

United States , Parsippany

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, Engineering, or a related field
4+ years of experience in site reliability engineering or a similar role
Proficiency in Google Cloud services (Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery, Pub/Sub, etc.)
Familiarity with Google BI and AI/ML tools (Looker, BigQuery ML, Vertex AI, etc.)
Experience with automation tools (Terraform, Ansible, Puppet)
Familiarity with CI/CD pipelines and tools (Azure pipelines Jenkins, GitLab CI, etc.)
Strong scripting skills (Python, Bash, etc.)
Knowledge of networking concepts and protocols
Service mesh experience a plus
Experience with monitoring tools (Prometheus, Grafana, etc.)

Job Responsibility

Ensure the reliability and uptime of critical services and infrastructure
Design, implement, and manage cloud infrastructure using Google Cloud services
Develop and maintain automation scripts and tools to improve system efficiency and reduce manual intervention
Implement monitoring solutions and respond to incidents to minimize downtime and ensure quick recovery
Work closely with development and operations teams to improve system reliability and performance
Conduct capacity planning and performance tuning to ensure systems can handle future growth
Create and maintain comprehensive documentation for system configurations, processes, and procedures

Fulltime

Site Reliability Engineer

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

94300.00 - 141500.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

5–8 years of relevant experience in technical support, platform operations, or engineering
Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
Experience working with business partners, engineering teams, or technology stakeholders
Demonstrated experience supporting IT services, platform operations, or infrastructure components
Strong verbal and written communication skills, with the ability to document technical issues clearly
Experience supporting operational workstreams or participating in platform improvement initiatives
Participation in resilience related or stability focused activities preferred
Ability to collaborate effectively with cross functional teams
Strong organizational skills and ability to manage daily workload and task priorities
Working knowledge of Generative AI concepts preferred

Job Responsibility

Understand how application support functions within the broader technology organization and contributes to business objectives
Assist with vendor coordination and day to day interactions with offshore managed services
Support efforts to improve service levels, including participating in incident management, problem management, and knowledge sharing initiatives
Partner with development and engineering teams to support application stability and operational readiness
Assist in collecting capacity, performance, and latency data to support platform planning efforts
Support application onboarding activities using established guidelines and standards
Contribute to fostering a collaborative and supportive team environment that encourages skill development
Participate in cost efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
Assist in preparing materials for business review meetings and help align technology activities with business needs
Follow established support processes and tool standards and provide input on improvement opportunities

Fulltime

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Work with all aspects of a high throughput and multi-tenant service
Collaborate effectively within the team and with partner teams across Microsoft
Be part of the on-call rotation for maintaining service health
Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
Document and define existing data engineering processes, data and technology, while evaluating them for optimization
System Reliability & Uptime – Ensuring high availability of services
Incident Management – Detecting, responding to, and mitigating system failures
Performance Monitoring – Tracking system health and resolving bottlenecks
Automation & Tooling – Reducing manual work through scripts and automation

Fulltime

Site Reliability Engineer

We are looking for a Lead Site Reliability Engineer (SRE) with strong experience...

Location

India , Bangalore

Salary:

Not provided

Karix

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE / DevOps / Production Engineering roles
Strong expertise in troubleshooting distributed systems and microservices architecture
Hands-on experience with Kafka, RabbitMQ, and Redis
Strong knowledge of Kubernetes and container orchestration
Experience with CI/CD pipelines and deployment automation
Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
Experience with Infrastructure as Code (Terraform, Ansible)
Strong scripting skills (Python, Bash, or similar)
Database experience: MySQL / Oracle / MongoDB
Strong problem-solving, ownership mindset, and ability to lead initiatives

Job Responsibility

Lead troubleshooting and resolution of complex production issues in distributed systems
Drive reliability engineering practices, ensuring high availability and performance of systems
Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
Architect, manage, and optimize Kubernetes clusters for scalability and resilience
Manage CI/CD pipelines and drive deployment automation
Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
Lead incident management, root cause analysis (RCA), and post-mortem reviews
Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability

What we offer

Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement

Fulltime

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...

Location

United States , San Francisco

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
Strong background operating, supporting, and troubleshooting distributed systems at scale
Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
Familiarity with AWS environments, including serverless and container-based architectures
Experience working with relational databases such as Postgres and performance analysis in production systems
Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design

Job Responsibility

Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices

What we offer

Medical, vision, dental, and life and disability insurance
Enrollment in company 401(k) plan

Site Reliability Engineer

Join us as a Site Reliability Engineer at Barclays, where you will play a pivota...

Location

United Kingdom , Glasgow

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience on Lunix/Unix, Java Applications, Oracle, SQL (PLQSP/MySQL) and RDBMS with exposure to writing queries, tuning concepts, data analysis etc.
Experience of at least two of the Middleware technologies amongst JBOSS Apache Tomcat, Glassfish.
Experience of Cloud technologies, AWS, OpenShift/APIs.
Experience of monitoring tools AppDynamics. ITRS and scheduling Tools – TWS and/or Autosys.

Job Responsibility

Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them.
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth.

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Site Reliability Engineer

Location

Malaysia , Kuala Lumpur

Salary:

15000.00 MYR / Month

Randstad

Expiration Date

August 21, 2026

Requirements

Cloud Infrastructure: AWS
Containerization: Kubernetes and Docker
Operating Systems: Linux and Unix Systems
Database Systems: Oracle Database
Programming/Scripting: Python, Java, or Go for automation scripting
Automation & Infrastructure Tools: Ansible and Terraform
Monitoring & Observability: Prometheus, Grafana, and Nagios
Integration: API and Networking Integration

Job Responsibility

Maintain continuous system monitoring and configure active alerts to prevent failures
Automate manual operational tasks, system monitoring, and infrastructure provisioning
Participate in deep-dive troubleshooting and rigorous post-mortem analysis to minimize downtime
Manage the technical resumption of high-priority, Service at Risk (S@R), and medium/high severity incidents within SLAs
Direct second- and third-level support teams and perform Root Cause Analysis (RCA)
Review system dependencies and manage changes, releases, and rollouts for minimal stability impact
Lead the team to actively achieve the organization's strict conduct, compliance, and market principles
Take end-to-end accountability for incident, problem, change, and risk management related to the production platform
Surface operational/security risks and provide monthly governance dashboards outlining trends and Service Improvement Plans (SIP)

Fulltime

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language