CrawlJobs Logo

Site Reliability Product Owner

boeing.com Logo

Boeing

Location Icon

Location:
United States , Kent, Washington

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

224100.00 - 273900.00 USD / Year

Job Description:

The Site Reliability Product Owner leads end-to-end release engineering and operationalizing for a growing, multi-application software portfolio across multiple missions and effectivities—owning release coordination, bug/fix lifecycle, customer and multi-level leadership approvals, incident command, and post-incident reporting. This hands-on development-focused role requires strong AWS infrastructure and Python automation skills, practical knowledge of signal‑processing algorithm behavior to interpret anomalous system results, and ownership of on-call scheduling with an expectation of ~80% availability while assigned. The Product Owner defines and implements environment-wide monitoring and observations, builds comprehensive monitoring strategies (real‑time system health, anomaly detection, and alerting to pre-empt resource exhaustion and performance degradation), and develops environment monitoring dashboards and application monitoring using APM tools with proactive thresholds. Responsible for CI/CD and release quality, the role validates release candidates through operational and enterprise testing, compiles and coordinates release packages, facilitates development activities into operational environments, and enforces release control (scheduling, versioning, change control) while tracking and verifying fixes. The position drives continuous improvement—standardizing runbooks, automating deployment and recovery workflows, instrumenting DORA-style KPIs (deployment frequency, lead time, change success rate, MTTR), and partnering with engineering, suppliers, and the customer to reduce downtime, accelerate delivery cadence, and enable future capability growth and proposal support.

Job Responsibility:

  • Oversee end-to-end release engineering and sustainment for a multi-application portfolio supporting multiple missions and effectivities
  • Own release control processes: scheduling, versioning, change control, approvals, and authoritative configuration/deployment records
  • Coordinate and compile release packages
  • validate release candidates through operational and enterprise testing and facilitate development activities into operational environments
  • Track, verify, and communicate bug/fix status across the portfolio and obtain customer and multi-level leadership sign‑offs prior to deployments
  • Define, implement, and maintain environment monitoring and observations across all environments, including real‑time system health, anomaly detection, and alerting to pre‑empt resource exhaustion and performance degradation
  • Design and maintain environment monitoring dashboards, application monitoring, and APM monitoring tools with proactive thresholds to surface performance issues
  • Manage on‑call scheduling and incident response
  • serve as incident commander during outages, lead diagnostics and mitigation, and prepare and present executive incident slide decks and after‑action reports
  • Instrument and track release and operational KPIs (deployment frequency, lead time, change success rate, MTTR) and drive continuous improvement to release cadence and reliability
  • Automate deployment, rollback, and recovery workflows using Python and cloud-native tooling (including serverless patterns) to reduce manual effort and MTTR
  • Advise on signal‑processing algorithm behavior and cloud operations at scale to interpret anomalous outputs and recommend corrective actions
  • Coordinate supplier management and cross‑functional team activities to ensure release readiness, quality, and contractual compliance
  • Maintain and update operational runbooks, playbooks, and run‑to‑failure/response procedures
  • train and mentor junior SWE staff as the sustainment team grows
  • Support research into emerging technologies and contribute technical inputs for proposals, bids, and future architecture planning
  • Serve as the primary Boeing representative to the customer enterprise for release and sustainment matters, ensuring clear, accurate, and timely stakeholder communications

Requirements:

  • Bachelor’s Degree in an engineering discipline or 18 years’ directly related work experience or 22 years’ related relevant work experience
  • 20+ years of experience in software engineering, with demonstrated expertise in cloud‑native distributed systems, orchestration, and operationalizing services at scale (including serverless and containerized deployments)
  • 1+ years of experience in deploying and managing distributed systems in cloud platforms (Ex. Azure, AWS, GCP)
  • 1+ years of experience with Engineering Releases
  • 1+ years of experience in managing product backlog, writing user stories, and managing releases
  • 1+ years of experience with cloud platforms (e.g. AWS or Azure), infrastructure as code (e.g., Terraform), and automation tools (e.g. Puppet, Ansible, Chef etc.)
  • 1+ years of experience developing and operating microservice, containerized, or serverless applications
  • 1+ years of experience with signal processing or image processing

Nice to have:

  • 1+ years incident management experience, including leading post-incident reviews and preparing executive-level incident reports and slide decks
  • 3+ years experience in Python development, scripting and automation
  • experience building operational tooling, and automation for deployments and incident response
What we offer:
  • Generous company match to your 401(k)
  • Industry-leading tuition assistance program pays your institution directly
  • Fertility, adoption, and surrogacy benefits
  • Up to $10,000 gift match when you support your favorite nonprofit organizations
  • health insurance
  • flexible spending accounts
  • health savings accounts
  • retirement savings plans
  • life and disability insurance programs
  • a number of programs that provide for both paid and unpaid time away from work
  • relocation based on candidate eligibility

Additional Information:

Job Posted:
March 04, 2026

Expiration:
March 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Product Owner

Site Reliability Engineering Manager

The Wikimedia Foundation is looking for an Engineering Manager to join our SRE t...
Location
Location
United States of America
Salary
Salary:
132439.00 - 208378.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Prior experience managing teams
  • Prior hands-on experience with software or reliability engineering (within the last 3 years preferred)
  • Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure
  • Proficiency in project management methodologies to effectively plan, execute, and track new and existing initiatives
  • Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible) to be able to provide technical support to the team
  • Aptitude for automation and streamlining of tasks
  • Communicate effectively in both spoken and written English
  • Ability to work independently, as an effective part of a globally distributed team
  • Ability to travel several times a year for occasional in-person meetings
  • B.S. or M.S. in Computer Science or the equivalent in related work experience
Job Responsibility
Job Responsibility
  • Managing one to two globally distributed teams within Wikimedia’s Site Reliability Engineering organization
  • Providing guidance, mentorship, and support to ensure the team's effectiveness and growth
  • Working with team members to set individual performance goals, and supporting them in meeting and evolving their goals and career path
  • Recruiting, hiring, and helping onboard new team members
  • Triaging incoming workload, maintaining focus on priorities, and setting realistic expectations for both peers and team members
  • Coordinating and communicating with other members of the Wikimedia product & engineering teams on relevant projects, executing complex projects and contributing to the organizational strategy
  • Continuously developing the roadmap of the team in alignment with other SRE and Product & Technology teams, and helping to draft and execute the team’s annual and quarterly plans
  • Project managing new and existing initiatives
  • Leading the definition, refinement, and execution of the processes through which the team manages and performs work
  • Leading incident response, diagnosis, and follow-up on system alerts and outages across Wikimedia’s production infrastructure
  • Fulltime
Read More
Arrow Right

Manager, Reliability

Responsible for sustaining and continuously improving various mechanical compone...
Location
Location
United States , Big Spring
Salary
Salary:
Not provided
delekus.com Logo
Delek US
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4 year / Bachelor's Degree (Required)
  • Four (4) or more years Experience in a related field (Required)
  • No Licensure or Certification Required
  • Manages and leads the activities of the Reliability engineers and specialists
  • Ensures compliance to Engineering Practices/Mechanical Integrity at the site level
  • Champions initiatives, projects, and programs that support the reliability vision
  • Guides Reliability Engineers to grow their technical and leadership skills
  • Develops working relationships with site leaders to guide teams on reliability centered processes and investigations
  • SPOC between Corporate Reliability and site activities
  • Reliability Department budget owner
Job Responsibility
Job Responsibility
  • Responsible for sustaining and continuously improving various mechanical components for equipment and tools
  • Ensures the safe, effective operations of the organization's production and supports continuous improvement
  • Manages reliability engineering projects
  • Performs analytical verification
  • Evaluates, tests and tracks results of reliability interventions
  • Initiates reporting for internal or third-party reported incidents
  • Creates, documents, and follows up on corrective actions
  • Prepares routine reports and memos and coordinate communications across all necessary functional groups of the organization
What we offer
What we offer
  • up to a 10% match on 401K on your hire start, with a vesting timeline of only one year
  • medical benefits that start on day one with a 30% premium rebate annually
  • access to the Calm app for FREE
  • additional annual incentives through performance management program
  • Fulltime
Read More
Arrow Right

Migration Services Product Manager

Join Barclays as a Migration Services Product Manager, where you’ll be responsib...
Location
Location
United Kingdom , Knutsford
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as a Product Manager, serving critical technology-driven products
  • Extensive knowledge of Agile methodologies within the Product Development Lifecycle
  • Proven experience in Product Discovery, data analysis, and delivery methods
Job Responsibility
Job Responsibility
  • Provision of subject matter expertise to support the collaboration between the product owner and the technical side of product development
  • Support the development and implementation of the product strategy and vision defined in the product roadmap and communicate them with the relevant stakeholders and the development team
  • Collaboration with internal stakeholders to gather and prioritise product requirements and features based on business value and feasibility that are well defined, measurable and secure
  • Development and implementation of assessments to ensure continuous testing and improvement of product quality and performance
  • Monitoring of product performance to identify opportunities for optimisation that meets the banks performance standards
  • Stay abreast of the latest industry technology trends and technologies, to evaluate and adopt new approaches to improve product development and delivery
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Intern

Join a mission-driven team at the forefront of distributed systems technology to...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Current Enrollment: Must be currently pursuing a Bachelor’s or Master’s degree
  • Graduation Timeline: Must have a projected graduation date of either Fall 2026 or Spring 2027
  • Programming Proficiency: Demonstrated ability to code in at least one major language, such as Python, Java, C++, JavaScript, Rust, or Go
  • CS Fundamentals: Strong foundational knowledge of computer science, specifically in data structures and algorithm design
  • Communication Skills: Proactive communication style with a proven ability to collaborate effectively within a team environment
  • Analytical Mindset: A talent for troubleshooting and a curiosity-driven approach to optimizing complex technical systems
Job Responsibility
Job Responsibility
  • System Development: Build and maintain scalable, highly available, and fault-tolerant distributed systems at scale to support intensive computational workloads
  • Product Creation: Develop innovative products and tools from the ground up that will be utilized by a global customer base
  • Production Support: Triage bugs and resolve complex issues within production environments to ensure platform reliability
  • Feature Iteration: Partner with product owners and stakeholders to design, test, and iterate on new features that drive platform growth
  • Cross-Functional Collaboration: Work closely with senior engineers and teammates to align technical tasks with broader company goals
  • Mentorship Engagement: Participate in dedicated one-on-one mentorship sessions to accelerate your professional growth and technical mastery
  • Strategic Problem Solving: Apply analytical thinking to troubleshoot and optimize complex systems for maximum efficiency
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Site Reliability Engineer II - (Microsoft 365 Enterprise + Cloud). We are lookin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Mid-level years of software development: automation-related experience is most valued
  • Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable
  • Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on
  • Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps
  • Consequent understanding of monitoring in distributed systems
  • Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack
  • understanding of how applications are affected by the above, and ability to debug same
  • Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems
  • Practical experience running large scale online systems is always an advantage
Job Responsibility
Job Responsibility
  • Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
  • identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance
  • Drives the adoption of innovative solutions across engineering teams working with related products within an organization
  • Apply advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights
  • Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability
  • Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles
  • leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention
  • Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale
  • reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale
  • Fulltime
Read More
Arrow Right

Senior Manager, Staff Software Engineering

At Rating Scoring and Data Services in Geico Sales Tech our goal is to build a n...
Location
Location
United States , Chevy Chase
Salary
Salary:
130000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong foundation in algorithms, data structures, and core computer science concepts
  • Basic UI/UX, API and prototype design knowledge and experience
  • Proven experience in digital experimentation to optimize the online customer experience
  • Knowledge of DTM tagging, including implementation solution design and processing rule set
  • Hands on experience with Metadata management tools (Microsoft Purview, Collibra, Alation, Informatica, etc.)
  • Knowledge of cloud computing technologies and concepts (SaaS, PaaS, IaaS, etc)
  • Working knowledge of object-oriented development, Gang of Four (GOF) Design Patterns, Microservices, Dependency Injection with IOC containers, and both frontend and backend unit testing
  • Proven ability to concentrate and demonstrate a capacity for learning technical concepts and adapting to new technologies quickly
  • Strong Cloud (AWS, GCP, Azure etc.) platform knowledge
  • In-depth knowledge of MS Office tools such as PowerPoint, Outlook, Word, and WebEx for effective communication
Job Responsibility
Job Responsibility
  • Work with your Director to address project dependencies, negotiate and estimate incremental delivery dates for milestones with the stakeholder community, and deliver projects on time
  • Identify and raise appropriate project risks, in addition to presenting detailed and implementable solutions or alternatives
  • Understand how requirements and design choices may impact systems across multiple areas
  • Report on your team’s progress for project and other key metrics, in addition to presenting detailed and implementable ideas for areas to further improve or influence product or project delivery
  • Initiate and support performance evaluation of team members
  • Cultivate a culture that motivates all levels of performers to higher levels of achievement
  • Build and maintain relationships with your team members to support an environment of trust
  • Foster a culture of growth mindset that acknowledges and expects individuals to grow and be accountable. Influence those you motivate and coach to be receptive to feedback
  • Identify where technical or analytical skill gaps put future team deliverables at risk and craft a plan to remediate, consistently challenge team members to share knowledge and learn new technologies
  • Proficiently execute difficult conversations on development and performance
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Manager – AI Infrastructure Operations

As a senior leader on our team, you will be responsible for the overall health, ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Technical Leadership: 15+ years of experience in managing and operating complex compute infrastructure, with a minimum of 5 years in a senior or leadership role
  • SRE and Operations Expertise: A strong background as a Site Reliability Engineer or in a similar role, with a proven track record of managing large-scale, mission-critical systems
  • Deep Systems Knowledge: Expert-level proficiency in Linux-based systems, Python scripting, and command-line tools for system administration and automation
  • Troubleshooting Acumen: Exceptional ability to lead and resolve complex technical challenges under pressure, especially during customer or engineering escalations
  • On-Call Leadership: Proven experience managing an on-call rotation and responding to 24/7 technical incidents
  • Communication: Excellent communication and leadership skills, with the ability to effectively mentor junior team members and communicate complex technical concepts to a diverse audience
Job Responsibility
Job Responsibility
  • Lead and Manage Infrastructure: Oversee the operation and reliability of our advanced AI compute infrastructure, defining strategy and setting a high bar for operational excellence
  • Drive Technical Ownership: Act as the primary owner for critical infrastructure systems, ensuring uptime, performance, and capacity are consistently optimized
  • Handle High-Stakes Escalations: Serve as the final point of contact for complex customer and engineering escalations, providing expert-level, hands-on support and driving issues to a rapid and complete resolution
  • Champion Reliability and Automation: Leverage your SRE experience to develop and implement robust monitoring, alerting, and automation solutions, reducing manual toil and preventing future issues
  • Collaborate and Strategize: Partner with cross-functional teams, including engineering and product, to align on long-term infrastructure strategy and support future AI initiatives
  • Innovate and Improve: Continuously evaluate and improve existing processes, tools, and technologies to enhance system reliability and operational efficiency
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Business Development System Architect

A System Architect offers comprehensive technology assistance throughout the ent...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
brightstarlottery.com Logo
Brightstar Lottery
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in Technology – Capability to comprehend and design architectures for client solutions collaborating with the product architecture team and other technical subject matter experts. Comprehensive knowledge of solution technologies including software, data, networks, operations, and cloud platforms. Having cloud experience is beneficial
  • Understands Business Needs in Challenging New Markets – Transforms uncertain requirements into technical specifications and solution development independently from customers. In new markets, clients may lack clarity on their needs, requiring structured thinking to align with our offerings
  • Innovation – Being receptive to exploring fresh approaches to address current and emerging challenges, while maintaining the capacity to establish a reliable and resilient system
  • Guide technology decisions on-site with leadership. Form skilled leadership teams for project implementation. Mentor team members and foster leadership development
  • Decision Making – Demonstrating the capability to serve as primary technology decision makers for projects and for significant decisions for a site (in services). Ability to collect the necessary information, contact relevant experts, and make sound tactical/strategic decisions on behalf of the customer solution
  • Strong communication skills – Capability to communicate effectively with collaborators both externally and internally in a concise and straightforward way. Skilled at connecting different teams crucial to project achievement
  • Accountable for solution quality in project delivery through leading technical team
  • Bachelor's degree in IT or equivalent work experience, with 5+ years in business systems analysis or as a product owner
  • A solid grasp of product management and agile software development approaches, applied theories and tools in contemporary product development and product composition
  • Ability to travel (up to 40% of time)
Job Responsibility
Job Responsibility
  • Product Architecture – when necessary, define or evolve a product architecture based on new requirements or overall system needs
  • Solution Architecture – when required by System Architecture, help define a particular decision or solution for a customer deployment to guide Brightstar development and customer technical leads
  • Architecture Strategy – help guide Brightstar engineering and product towards technologies and an architecture that meets business needs, being responsible for key development decisions when necessary
Read More
Arrow Right