This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an experienced and versatile professional with expertise in validation strategy, automation, and quality for AI/ML model serving, GPU software stacks, device drivers, firmware, and cross-platform systems (Linux/Windows). You will build test frameworks, drive CI quality gates, perform performance and reliability testing, and lead cross-stack triage to ensure robust releases in a rapidly evolving environment.
Job Responsibility:
Own end-to-end test strategy for AI/ML workflows (PyTorch, vLLM), GPU runtimes, drivers, and firmware across kernel and user space
Develop scalable automation frameworks spanning unit, integration, HIL (hardware-in-the-loop), system, and end-to-end tests
Implement and maintain CI quality gates (GitHub Actions/Workflows, Jenkins), including automated build, test execution, artifact management, reporting, and flake reduction
Design and execute performance, stress, reliability, soak, and long-haul tests targeting GPU compute, memory, I/O, and serving throughput/latency
Create reproducible environments with containers/orchestration
instrument telemetry and observability for data-driven QA
Apply agentic AI techniques to accelerate test generation, triage, and root cause analysis
integrate intelligent diagnostics into pipelines
Develop rigorous test cases for low-level features (PCIe, DMA, interrupts, memory management), error handling, recovery, and fault injection
Define and track quality KPIs (coverage, defect escape rate, MTTR, performance regressions) and drive continuous improvement
Lead defect triage across hardware, firmware, driver, runtime, and model layers
collaborate with engineering to resolve issues rapidly
Produce comprehensive documentation: test plans, procedures, fixtures, coverage maps, readiness criteria, and retrospectives
Requirements:
8–12 years in QA/Test for systems software or platform engineering, with at least 4 years focused on GPU software, device drivers, or firmware validation
Demonstrable ownership of validation for AI/ML pipelines and serving stacks using PyTorch and at least one modern inference framework (e.g., vLLM), including accuracy baselining and performance regression detection
Proven expertise testing drivers and firmware with hands-on work in: PCIe fundamentals (link training, BARs, MSI/MSI-X), DMA engines, interrupt handling, and memory models
Failure modes: error injection, recovery paths, power/thermal events, and persistence across reboot/upgrade cycles
Deep proficiency in Linux (kernel/user space) and practical experience with Windows driver ecosystems
ability to: Read kernel logs and symbols, trace with ftrace/perf/ETW, and perform cross-layer debugging
Build custom kernels/modules and analyze crash dumps (kdump, WinDbg)
Strong programming for test automation: Python for framework and orchestration (pytest or equivalent), robust mocking/fixtures, and data-driven test generation
C/C++ for low-level test harnesses, protocol exercisers, and performance micro-benchmarks
Bash/PowerShell for environment setup, CI scripting, and reproducibility
CI/CD mastery with GitHub Actions/Workflows and/or Jenkins: Design gated pipelines with parallelization, artifact management, flaky test quarantine, and automated rollback criteria
Integrate metrics, alerts, and quality reports
enforce go/no-go release thresholds
Performance testing rigor: Methodology for baselining, variance control, and noise isolation
application of statistical techniques (e.g., confidence intervals, A/B comparisons) to detect regressions
containerized test environments (Docker) and familiarity with Kubernetes for distributed tests
Exploratory testing mindset: Hypothesis-driven investigation, boundary and adversarial testing, fuzzing (protocol/API), chaos/fault injection, and reverse-engineering of interfaces when documentation is limited
Communication and leadership: Clear, concise defect reporting
ability to drive triage across teams
establish and evangelize quality standards
maintain strong documentation discipline
BS/MS in Computer Science/Computer Engineering, or related discipline
Nice to have:
Lab ops for QA: rack mounting, server configuration, BMC/IPMI, BIOS/fw updates, network/storage setup, power/thermal profiling
Front-end/UI testing experience for internal tools: ReactJS, web UI automation, accessibility and usability checks