This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an experienced and highly motivated Staff Software Engineer to lead a new team dedicated to stabilizing and modernizing our Offline Testing Infrastructure (OTI). OTI is a critical, shared middle-layer infrastructure that underpins our PR testing, test creation, and Verification & Validation (V&V) efforts. This is a high-impact, high-urgency role focused on improving the velocity of our engineering teams and the reliability of our release cycles. The successful candidate will build and lead a small, focused team to transition OTI to a stable, performant, and scalable platform.
Job Responsibility:
Lead the OTI Team: Serve as the technical lead (TL) for the OTI team within PIE-Compute, driving the strategic vision, execution, and long-term stability of the core infrastructure
Help Define and Optimize the Testing Ecosystem: Lead the design of the next-generation offline testing architecture to meet diverse team needs, reducing redundancy and siloing across the organization
Partner with Test Creation and Test Drive teams to standardize end-to-end test execution and reporting (Creation -> Execution -> Reporting)
Refine the full test lifecycle to ensure performance and scalability, and maintain clear attribution of failures to enhance reliability and efficient debugging
Own Critical OTI Components and Migrations: Take ownership of the shared OTI components, including maintenance and on-call support
Own various offline test Modalities, including step code, workflow code, and general health
Lead the maintenance and development of common OTI tooling, including launching test evaluations, polling APIs, communicating results, and providing recommended pipeline templates
Establish Architecture and Best Practices: Define and enforce data management policies for the testing ecosystem (storage, lifecycling, write strategies, data integrity, and lineage)
Define use cases and feature design for new test modalities, including single versus cross-modality testing strategies
Manage incidents related to offline tests and maintain Standard Operating Procedures (SOPs) for PRs, local workflows, V&V, and releases
Act as a Center of Excellence: Serve as a subject matter expert for optimizing the architecture and performance of Aurora's largest compute use case (offline testing), and provide high-value consulting/architecture support to adjacent teams
Requirements:
Senior or Staff-level experience (P7 equivalent) as a Software Engineer, ideally in infrastructure, developer tooling, or critical shared services
Proven experience leading technical projects and mentoring/directing other engineers
Familiarity with distributed compute technologies, cloud services (e.g., AWS), and large-scale workflow management systems
Demonstrated ability to triage, debug, and perform on-call and incident management for complex, cross-cutting infrastructure issues
Strong communication skills to manage stakeholder alignment and drive cross-team standardization efforts