CrawlJobs Logo

Software Architect, Reliability Engineering

United States 227840.00 - 335000.00 USD / Year · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

As an Architect in SRE, you will drive the technical strategy, vision and outcomes for Twilio’s Reliability Engineering organization. You will define and lead solutions and initiatives that ensure Twilio products are reliable worldwide, and you will define standards and guide engineering teams on best practices for designing, building, and operating resilient systems. This role is pivotal to Twilio’s commitment to operational excellence, scalability, and pragmatic, large-scale systems design in the cloud.

Job Responsibility

  • Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes
  • Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs
  • Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services
  • Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability
  • Ensure integrity and quality across the service lifecycle
  • design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management
  • Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling
  • Establish and champion reliability practices and drive systemic improvements
  • Mentor and grow engineers and technical leaders
  • Track and apply emerging SRE, cloud, and large-scale systems best practices
  • introduce pragmatic innovations that improve reliability at scale

Requirements

  • 15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect
  • Strong experience in driving strategic technical decisions and defining long-term technical vision
  • In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization
  • Experience driving cross-org technical architecture outcomes
  • Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience)
  • Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments
  • Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS
  • Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure
  • Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting
  • Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling
  • Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations
  • Experience running cross-functional post-incident reviews and driving improvements
  • Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs
  • Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams
  • Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments
  • Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs
  • Ability to influence and build effective working relationships with all levels of the organization

Nice to have

  • Specific experience owning and operating large AWS footprints
  • Knowledge of Kubernetes architecture and concepts
  • Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming
  • Passion for building reliable products, with prior projects in high-availability systems

What we offer

  • competitive pay
  • generous time off
  • ample parental and wellness leave
  • healthcare
  • a retirement savings program
  • equity plan
  • corporate bonus plan
  • health care insurance
  • 401(k) retirement account
  • paid sick time
  • paid personal time off
  • paid parental leave

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Architect, Reliability Engineering

8 matching positions

Principal Software Engineering Architect

Step into a role where your ideas spark innovation and your impact is demonstrat...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of experience designing and operating large-scale enterprise services, including production systems
  • Experience building and operating large-scale infrastructure and network management systems
  • Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, ARM, CloudFormation) to automate deployment and configuration
  • Experience designing resilient, secure, and highly available architectures in cloud or hybrid environments
  • Experience applying AI/ML or generative AI technologies (e.g., LLMs) to real-world engineering problems
  • Experience building solutions from concept to production
  • Experience improving monitoring, observability, and incident response for mission-critical systems
Job Responsibility
Job Responsibility
  • Partner with stakeholders to define user requirements across key scenarios, with an emphasis on AI-driven operations, intelligent automation, and agent-enabled user experiences
  • Lead the identification of dependencies and drive the development of design documents for a product, application, service, or platform, incorporating AI-first and agentic architectures that enable autonomous operations and continuous optimization
  • Mentor others to write and review high-quality, maintainable, and extensible code, while embedding AI-assisted development practices and enabling engineers to effectively leverage copilots and intelligent agents
  • Collaborate with cross-functional teams to drive project plans, release plans, and execution, integrating AI-powered insights and agent-driven workflows to accelerate delivery and improve decision making
  • Take end-to-end ownership of services as a Designated Responsible Individual (DRI), including on-call responsibilities, while advancing autonomous operations through agent-based monitoring, incident detection, and response to improve reliability and resilience
  • Continuously learn and apply new technologies and best practices to improve availability, scalability, and operational excellence, driving adoption of AI-driven observability, predictive insights, and self-healing systems at scale
  • Embody our culture and values.
  • Fulltime
Read More
Arrow Right

Staff Engineer, Software Reliability Engineering

We are seeking a Staff Engineer to join our dynamic team in Bengaluru, India. In...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
sandisk.com Logo
Sandisk
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in CSE or ECE or EEE, Software Engineering, or related field
  • Master's degree preferred
  • 5 years of software development experience of python scripting and test case development
  • Advanced proficiency in programming languages such as Java, Python, or C++
  • Proficient in version control systems, preferably GitHub
  • Solid understanding of software architecture and design patterns
  • Experience with API development and integration
  • Strong skills in performance optimization and debugging
  • Experience with Agile methodologies and full software development lifecycle
  • Excellent problem-solving and analytical skills
Job Responsibility
Job Responsibility
  • Architect, design, and implement high-performance, scalable test suite for Reliability testing
  • Collaborate with cross-functional teams to define and implement new features and products
  • Lead code reviews and provide mentorship to junior developers
  • Optimize test performance and ensure high-quality, efficient code
  • Troubleshoot and resolve complex technical issues
  • Stay current with emerging technologies and industry trends, recommending improvements to our technology stack
  • Contribute to the development of technical standards and best practices
  • Participate in Agile ceremonies and help drive continuous improvement in our development processes
  • Fulltime
Read More
Arrow Right

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
  • 3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
  • Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
  • Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
  • Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
  • Create a culture of operational excellence, continuous improvement, and psychological safety within the team
  • Conduct regular 1:1s, performance reviews, and career development conversations
  • Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
  • Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
  • Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
  • Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
  • Ensure alignment between team objectives and broader engineering and business goals
  • Advocate for and allocate resources toward reducing technical debt and improving developer experience
  • Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
  • 3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
  • Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
  • Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
  • Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
  • Create a culture of operational excellence, continuous improvement, and psychological safety within the team
  • Conduct regular 1:1s, performance reviews, and career development conversations
  • Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
  • Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
  • Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
  • Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
  • Ensure alignment between team objectives and broader engineering and business goals
  • Advocate for and allocate resources toward reducing technical debt and improving developer experience
  • Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Lead Systems Software Architect

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of industry experience in embedded systems-level software development
  • Strong experience with embedded Linux or Android-based systems
  • Proficiency in one or more systems programming languages such as C/C++ (Rust or similar is a plus)
  • Deep understanding of ARM-based SoCs, multimedia pipelines, and system constraints
  • Experience with DRM, content protection, secure boot
  • Experience collaborating with SoC vendors and ODM/OEM partners
  • Experience with NPU/DSP/AI accelerator blocks on embedded SoCs
  • Ability to build or integrate end-to-end flows where AI is in the loop
  • Proficient in using AI tools for debugging, code review, test selection, and log analysis
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Own complex features or subsystems end-to-end, from design and implementation through bring-up, validation, and production support
  • Translate product and business goals into concrete designs, tasks, and implementation plans
  • Design, implement, and maintain core platform software for Roku device programs and platforms
  • Contribute to and influence hardware–software partitioning, platform APIs, and integration patterns
  • Drive and model best practices for coding standards, code reviews, testing strategies, and CI/CD
  • Implement and optimize video/audio pipelines, codecs, and rendering paths
  • Contribute to end-to-end multimedia system design for TVs and streaming devices
  • Define and help maintain benchmarks and test scenarios for media, graphics, and system behavior
  • Implement and maintain secure boot, DRM integrations, and content protection features
  • Lead the product evaluation and enablement of candidate SoCs and companion chipsets
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • Healthcare (medical, dental, and vision)
  • Life, accident, disability, commuter, and retirement options (401(k)/pension)
  • Time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right

Digital Software Engineering Lead Analyst – Vice President

The Digital S/W Engineer Lead Analyst is a lead-level professional role. This in...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of progressive software development experience, demonstrating expert-level proficiency in JavaScript and Java frameworks (e.g., React.js, Spring Boot), and databases (e.g., Oracle, MongoDB, PostgreSQL)
  • Expert in Modern Application Architecture: Mastery of modern application architecture principles, including microservices, event-driven architectures, serverless, and cloud-native patterns
  • Deep expertise in Data Structures, Algorithms, and Object-Oriented Design Principles with Java
  • Proven leadership in leveraging and integrating Artificial Intelligence (AI) and Machine Learning (ML) tools to optimize development workflows, enhance code quality, and drive intelligent features
  • Extensive experience with Microservices frameworks (e.g., Spring Boot, Quarkus), Event-Driven Services (e.g., Kafka, RabbitMQ), and advanced Cloud-Native Application Development (AWS, Azure, GCP)
  • Multiple years of experience leading the design and implementation of Service-Oriented and Microservices architectures, including advanced REST, GraphQL, and gRPC implementations
  • Full Stack Architecture & Leadership: Demonstrated ability to architect, design, develop, and maintain complex, enterprise-grade full-stack solutions, encompassing both front-end and back-end components of robust web applications, with an emphasis on scalability and performance
  • Front-End Expertise: Expert-level proficiency in designing and developing highly intuitive, performant, and accessible user interfaces using cutting-edge JavaScript frameworks (e.g., React, Angular, Vue), advanced HTML5, and CSS (e.g., SASS/LESS, CSS-in-JS)
  • Back-End Mastery: Extensive experience in architecting and developing scalable server-side logic and sophisticated APIs using languages such as Java, Python, or similar, with a focus on high-throughput and low-latency systems
  • Advanced Database & Data Architecture Expertise: Comprehensive knowledge of SQL and PL/SQL, with a deep understanding of Relational Database Management Systems (RDBMS), particularly Oracle, including advanced database design, performance tuning, data warehousing, and NoSQL databases
Job Responsibility
Job Responsibility
  • Strategic Technical Leadership: Provide expert guidance and strategic oversight across the entire software development lifecycle, partnering continuously with senior stakeholders to align technical solutions with business objectives
  • Architectural Stewardship: Lead the design and evolution of robust, scalable, and secure enterprise applications, defining architectural patterns and ensuring adherence to best practices in cutting-edge technologies and software design patterns
  • Team & Project Leadership: Drive complex engineering initiatives within Agile delivery teams, fostering a culture of collaboration, excellence, and continuous improvement. Lead sprint goal achievement, oversee code quality, and actively participate in and lead broader Citi technical communities and advanced Agile/Scrum processes
  • Mentorship & Coaching: Act as a technical mentor and coach for junior and intermediate engineers, fostering their growth, critical thinking, and advanced problem-solving capabilities
  • Advanced Problem Solving & Troubleshooting: Exhibit mastery in analyzing and resolving intricate coding, application performance, and design challenges. Lead cross-functional efforts to diagnose and troubleshoot complex system issues
  • Proactive Root Cause Analysis: Spearhead thorough investigations to identify systemic root causes of development and performance bottlenecks, leading the implementation of comprehensive, long-term defect resolutions and preventative measures
  • Technical Vision & Acumen: Demonstrate a profound and forward-looking understanding of technical requirements, emerging trends, and their strategic implications for solutions under development, ensuring future-proof designs
  • Containerization, Orchestration & Cloud Strategy: Drive the strategic adoption and optimization of Docker for application containerization, Kubernetes for efficient service orchestration, and other cloud-native technologies to build resilient and scalable infrastructure
  • Communication, Risk & Stakeholder Management: Master effective communication of progress, proactively anticipate and mitigate technical and project bottlenecks, provide expert escalation management, and adeptly identify, assess, track, and manage issues and risks at strategic and operational levels
  • Process and System Optimization: Champion and lead initiatives to streamline, automate, and eliminate redundant processes within architecture, build, delivery, production operations, and across various business areas, driving significant efficiency gains and innovation
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
  • Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
  • Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
  • Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
  • Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
  • Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
  • Experience leading change and building high-performing, people-first engineering cultures
  • Fluent in English and comfortable in fast-paced, international environments
Job Responsibility
Job Responsibility
  • Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
  • Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
  • Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
  • Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
  • Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
  • Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
  • Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive additional leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Principal AI Software Architect

Do you want to be at the forefront of innovating the latest hardware designs to ...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, PyTorch, CUDA/Triton
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Leads by example across teams and mentors others to produce extensible, maintainable, well-tested, secure, and performant code used across products that adheres to design specifications
  • Leads efforts to continuously improve code performance, testability, maintainability, effectiveness, and cost, while learning about and accounting for relevant trade-offs
  • Identifies best practices and coding patterns (e.g., leveraging state-of-the-art generative artificial intelligence [GenAI], approaches to source code organization, naming conventions) and provides deep expertise in the coding and validation strategy
  • Creates and applies metrics to drive code quality and stability, appropriate coding patterns, and best practices
  • Identifies and anticipates blockers or unknowns during the development process, escalates them, communicates how they will impact timelines, and then leads efforts to identify and implement strategies and/or opportunities to address them
  • Reviews product code and test code to ensure it meets team standards, contains the correct test coverage, and is appropriate for the product or solution area
  • Brings insight to code reviews to help improve code quality, coaching and providing feedback to develop other engineers' skills
  • Conducts code reviews in a timely fashion that helps accelerate the pace of development on the team. Considers diagnosability, reliability, testability, and maintainability when reviewing code, and understands when code is ready to be shared or delivered
  • Applies and reviews for coding patterns, security risks, compliance issues, and best practices in code reviews, providing feedback on code to drive adherence to best practices
  • Uses automated source code analysis tools that are incorporated into the build/development process
  • Fulltime
Read More
Arrow Right