Lead SRE Job at Inetum (Lisbon)

Lead SRE

About the Role Deliver cloud-native solutions and patterns that are highly elast...

Location

United States of America , Fort Lauderdale

Salary:

Not provided

Beacon Hill

Expiration Date

Until further notice

Requirements

Deliver cloud-native solutions and patterns that are highly elastic
Empower stakeholders and reduce toil through self-service pipelines
Mentor your team in solving deep technical issues, advanced cloud infrastructure topics, and complex coding problems
Set an example of methodical, systematic task execution for your team
Work with project managers and stakeholders to provide status and reporting
Act as an ambassador to other teams, finding common ground and defining clear agreements
Drive projects to schedule
Perform code reviews with an eye toward rigor and best practice
Apply continuous process improving techniques across the operation
Automate everything

Job Responsibility

Deliver cloud-native solutions and patterns that are highly elastic
Empower stakeholders and reduce toil through self-service pipelines
Mentor your team in solving deep technical issues, advanced cloud infrastructure topics, and complex coding problems
Set an example of methodical, systematic task execution for your team
Work with project managers and stakeholders to provide status and reporting
Act as an ambassador to other teams, finding common ground and defining clear agreements
Drive projects to schedule
Perform code reviews with an eye toward rigor and best practice
Apply continuous process improving techniques across the operation
Automate everything

Fulltime

Lead SRE

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engin...

Location

United States , St Louis

Salary:

Not provided

Zeektek

Expiration Date

Until further notice

Requirements

Bachelor's degree
AWS Certified DevOps Engineer – Professional
Dynatrace Professional
One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
7+ years in SRE/Production Engineering/Platform roles
2+ years leading initiatives or teams
Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
Deep experience in AWS Cloud infrastructure

Job Responsibility

Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
Define and govern SLOs/SLIs/Error Budgets for critical services
enforce guardrails and drive reliability roadmaps
Lead performance tuning collaboration with application teams to ensure high availability and low latency
Define and own infrastructure tuning to ensure scalability leading to high availability
Lead Metrics and automation driven Reliability
Dedug systems across layers
Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)

What we offer

Weekly Direct Deposit
401K Matching
Competitive medical, dental and vision insurance
Consistent communication throughout your project
ZeekTek Referral Program

Fulltime

Credit Risk Support Lead- SRE

Join Barclays as a Credit Risk Support Lead- SRE role, where to effectively moni...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

14+ years’ experience in production support
High energy, hands-on and results & goal-oriented
Expertise in log debugging, root cause analysis and troubleshooting live issues
Experience on observability tools like ESaaS, AppD / ITRS , Netcool
Experience in data analysis to identify underlying themes impacting stability, performance, and customer experience
Ensures and promotes ITIL best practices for Incident, Problem, Change, Release management (including managing and running triages, conducting root cause analysis, post incident reviews etc)
Strong Credit Risk business knowledge
Negotiate SLAs/OLAs with customer and other support elements
Business (IT) Continuity Management
KPI reporting and monitoring

Job Responsibility

Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients. Develop the support model and service offering to improve the service to customers and stakeholders.
Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance.
Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing.
Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues by fixing or replacing faulty hardware components, reinstalling software, or applying configuration changes.
Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management.
Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues.
Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency. Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur.

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Credit Risk Support Lead- SRE

Embark on a transformative journey as a Credit Risk Support Lead-SRE. At Barclay...

Location

United States , Whippany

Salary:

150000.00 - 215000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Good domain knowledge with end-to-end responsibility of IT services, including day-to-day operations, incidents and changes
Robust understanding of regulatory compliance, risk frameworks, audit, and metric monitoring of service health and control effectiveness
Overseeing support teams, effective delegation, and communication with business users and senior stakeholders
Ability to prioritize issues, refine support procedures, and drive continuous improvement across RTB and support processes
Solid understanding of the software development lifecycle and how application support integrates to enhance delivery, stability, and reliability

Job Responsibility

Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients
Develop the support model and service offering to improve the service to customers and stakeholders
Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance
Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing
Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues
Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management
Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues
Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency
Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur

What we offer

Medical coverage
Dental coverage
Vision coverage
401(k)
Life insurance
Paid leave
Incentive award
Competitive holiday allowance
Life assurance
Private medical care

Fulltime

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Fulltime

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...

Location

Mexico , Miguel Hidalgo

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8+ years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)

Job Responsibility

Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation

What we offer

Opportunities to learn and develop every day through a wide range of programs
Internal digital platforms that promote self-learning
Development programs according to Leadership skills
Specialized training according to the role
Learning experiences with internal and external providers
Recognition programs for seniority, behavior, leadership, moments of life, among others
Financial wellness programs that will help you reach your goals in all stages of life
A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Lead Mainframe SRE

Location

Greece , Athens

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, Information Systems, or related field (or equivalent experience)
6-8 years of experience in relevant roles
Extensive experience managing enterprise middleware and batch-processing platforms within large-scale production environments
Strong expertise in re-hosted middleware services and production-control ecosystems, including technologies such as MQ, CICS/TX, Stonebranch UAC, scheduling platforms, and cross-platform operational tooling
Strong experience supporting Linux-hosted middleware environments, DB2 LUW interfaces, queue and log monitoring, and disaster recovery operations
Strong understanding of operational standards, root cause analysis, change-window governance, service continuity, and platform resilience practices
Experience leading modernization, migration, or decommissioning initiatives within critical enterprise platforms
Proven ability to lead small technical teams and coordinate multiple stakeholders, vendors, and operational groups
Strong stakeholder-management and communication skills, with the ability to gather requirements and align technical solutions with business needs
Comfortable operating within highly critical, governance-driven, and cross-functional enterprise environments

Job Responsibility

Own the end-to-end technical service for re-platformed middleware and production-control platforms across production, non-production, and disaster recovery environments
Lead and mentor a small team of engineers, providing technical leadership, operational guidance, and coordination of day-to-day platform activities
Act as the primary technical point of contact for stakeholders, client representatives, and operational teams, gathering requirements and translating business needs into reliable technical solutions
Lead Stonebranch UAC and batch operations activities, including scheduling standards, post-batch support, bundle promotion governance, and resolution of cross-domain batch-processing issues
Drive service improvements, platform upgrades, cluster and scheduler health initiatives, and operational standards across highly critical enterprise services
Act as the command point for high-severity incidents, out-of-hours change windows, root cause analysis activities, and cross-team operational coordination
Coordinate closely with IBM, Stonebranch, infrastructure teams, and application owners to ensure operational continuity and service resilience
Support disaster recovery readiness, service recovery testing, and operational governance activities across the platform landscape
Guide the controlled migration, modernization, or decommissioning of legacy batch-processing services while protecting production stability
Define and maintain standards for availability, operational excellence, monitoring, service governance, and platform reliability

What we offer

Health insurance for the employee and one dependent family member (100% paid by NTT DATA)
Meal vouchers of 120€ per month (x12)
Corporate mobile phone: subscription & device
Teleworking equipment allowance
Internal Trainings Platform Account
Access to Open Up mental health service
28 days of paid annual leave consisting of your legal holidays and compensation days

Sre Team Lead (Fedramp / Security)

Coralogix is a modern, full-stack observability platform transforming how busine...

Location

United States , Los Angeles

Salary:

230000.00 - 270000.00 USD / Year

Coralogix

Expiration Date

Until further notice

Requirements

2+ years of experience as a Team Lead / Tech Lead
At least 5 years of experience as a DevOps Engineer/ SRE in production environments
At least 2 years of experience Experience with FedRAMP compliance (High/Moderate levels), vulnerability management, and continuous monitoring, including scanning, patching, and reporting - Advantage
In-depth experience with Kubernetes - operating & monitoring are key parts
High familiarity with monitoring tools such as Coralogix, Grafana, Prometheus
Experience in AWS or other cloud providers
Experience with infrastructure as a code (Terraform, Crossplane, etc.)
Understanding of networking - from networking layers to different networking protocols (http, grpc, ssl)
Some software engineering experience, preferably in Golang
An advantage - operating data pipelines

Job Responsibility

Lead and mentor a team of engineers, including hiring, onboarding, and performance management
Work in high scale environments - Coralogix data pipeline processes 55Tb of data each day
Adopt cutting edge technologies with end-to-end responsibility
Building internal tools to expand our platform capabilities
Collaborate with R&D to improve stability & reliability of the system
Lead the product roadmap - our product is designed for engineers. Therefore, our engineers promote, enhance, and take a crucial part in influencing the product roadmap
Perform operational duties for FedRAMP cloud products, including deployments, on-call support, and incident management

What we offer

comprehensive and inclusive employee benefits for healthcare, dental, and mental health benefits
401(k) plan and match
paid sick time and paid time off

Fulltime

Select Country

Lead SRE

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Lead SRE

Lead SRE

Lead SRE

Credit Risk Support Lead- SRE

Credit Risk Support Lead- SRE

Site Reliability Engineering (SRE) / Lead Engineer

SRE Lead Design & Support Engineer

Lead Mainframe SRE

Sre Team Lead (Fedramp / Security)

Our AI answers in your language