Senior Software Engineer, Cloud Development Job at Mozilla

Job Description

The AI Platform team is responsible for building the foundational infrastructure that powers intelligent experiences across Mozilla products. This includes model training pipelines, high-throughput inference services, GPU orchestration, and secure, privacy-respecting AI systems that operate reliably at global scale. We’re looking for a Senior Software Engineer with a strong platform mindset to help design, build, and operate Mozilla’s AI platform. In this role, you’ll work at the intersection of machine learning, distributed systems, and production infrastructure—ensuring that models can be trained, deployed, and served efficiently, securely, and at scale. You will collaborate closely with product, infrastructure, and security teams to enable fast iteration while meeting strict performance and privacy requirements.

Job Responsibility

Design, build, and operate core platform services and APIs used to deploy and serve production workloads at scale
Own service reliability end-to-end, driving improvements in availability, scalability, performance, and operational excellence
Lead efforts to optimize backend services for throughput, latency, and cost efficiency across distributed infrastructure
Design and manage Kubernetes-based workloads, including GitOps deployment pipelines, environment configuration, and resource utilization optimization
Own and improve critical parts of the service lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of backend services and pipelines
Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable new product features
Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews

Requirements

Bachelor's degree with 4–6 years of relevant industry experience, or Master's degree with significant hands-on experience building and operating production systems, or work experience equivalent
Strong, modern Python skills, with experience writing clean, maintainable code and working with a fast toolchain (dependency management, linting, formatting, type checks, pre-commit), building both libraries and CLIs that output structured data
Advance experience with database deployment and management, bonus points for familiarity with Postgres
Proven experience deploying and operating workloads in cloud environments, including production-grade infrastructure on GCP and GKE (artifact registries, managed caches, networking and internal load balancing, VPC, DNS, and separation of nonprod and prod)
Hands-on experience with Kubernetes and Helm, writing charts that deploy across environments with per-environment configuration and progressive feature rollout
Experience with Terraform for provisioning infrastructure across environments, including schema validation and PR-level plan review
Experience designing and running scalable APIs that hold up under load, including health and readiness checks, auth, and clean startup and shutdown
Experience with Grafana or similar tools for metrics, dashboards, and reading application and infrastructure health together during rollouts
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
On-call experience, including participating in incident response and post-incident reviews

Nice to have

Experience with Ray or Ray Serve for GPU-backed model serving, including setting resource requests and replica counts aligned with available hardware
Experience building stateless ML services such as embedding or similarity models, including multi-model loading, runtime device selection, batch APIs, and handling model-cache and cold-start tradeoffs
Experience running a multi-provider LLM gateway, including routing between providers, migrating models, and mixing self-hosted with third-party serving
Familiarity with containerization and orchestration systems in production environments beyond core Kubernetes/Helm usage
Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design
Contributions to open-source infrastructure projects or leadership in building reusable internal tooling

What we offer

Generous performance-based bonus plans to all eligible employees
Rich medical, dental, and vision coverage
Generous retirement contributions with 100% immediate vesting
Quarterly all-company wellness days
Country specific holidays plus a day off for your birthday
One-time home office stipend
Annual professional development budget
Quarterly well-being stipend
Considerable paid parental leave
Employee referral bonus program
Other benefits (life/AD&D, disability, EAP, etc.)

Mozilla - All Job Offers

Select Country

Senior Software Engineer, Cloud Development

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Software Engineer, Cloud Development

Software Engineer / Senior Software Engineer - CoreAI

Backend Software Engineer / Senior Software Engineer- Kusto

Software Engineer / Senior Software Engineer

Software Engineer II & Senior Software Engineer

Software engineer 2 / Senior Software engineer - Azure Data

Software Engineer II / Senior Software Engineer

Software Engineer 2 / Senior Software Engineer

Senior Software Engineer / Principal Software Engineer - Copilot CLI

Our AI answers in your language