This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a mid-senior SRE/DevOps Engineer (5–8 years) to build and scale a cloud-native, event-driven platform powering high-throughput logistics and fulfillment systems. This role will be responsible for establishing infrastructure foundations, CI/CD pipelines, observability, and system reliability, while working closely with backend, data, and architecture teams to ensure production stability and scalability.
Job Responsibility:
Design and implement robust CI/CD pipelines (GitLab CI, Jenkins, or similar)
Enable automated build, test, and deployment workflows
Implement blue-green / canary deployments for zero-downtime releases
Ensure release traceability, rollback mechanisms, and deployment governance
Design, provision, and manage infrastructure on AWS (primary) and/or GCP
Build infrastructure using Infrastructure as Code (Terraform preferred)
Create reusable modules for scalable, secure, and standardized environments
Optimize cost, performance, and scalability of cloud resources
Deploy and manage applications using Docker & Kubernetes
Manage Kubernetes workloads using Helm charts
Implement auto-scaling, resource optimization, and high availability patterns
Ensure platform readiness for high-throughput microservices
Define and implement SLIs, SLOs, and SLAs
Drive improvements in system reliability, uptime, and performance
Lead incident response, debugging, RCA (root cause analysis), and postmortems
Build resilient systems with self-healing and fault-tolerant mechanisms
Implement end-to-end observability across services: Metrics (Prometheus / Cloud Monitoring)
Logs (ELK / Kibana / Cloud Logging)
Tracing (OpenTelemetry / Jaeger)
Build actionable alerting systems to reduce noise and improve response time
Enable faster production debugging and performance analysis
Support and scale event-driven architectures (Kafka, Pub/Sub, SQS/SNS or similar)
Ensure reliability of asynchronous workflows and message processing systems
Work closely with backend teams to: Improve service resilience and fault handling
Optimize event processing and throughput
Support distributed microservices architecture
Work with PostgreSQL (RDS) for: Performance tuning
High availability and failover setups
Backup and recovery strategies
Collaborate with data teams supporting Snowflake / data pipelines (nice to have)
Drive production stabilization efforts for high-growth systems
Identify and resolve bottlenecks in performance and scalability
Improve MTTR (Mean Time to Recovery) and incident response efficiency
Enable platform readiness for scale and high transaction volumes
Implement secure DevOps practices
Manage IAM roles, secrets, and access controls
Ensure adherence to cloud security best practices
Requirements:
5–8 years of experience in DevOps / SRE roles
Strong hands-on experience with AWS (preferred) and/or GCP