CrawlJobs Logo

Product Reliability Engineer - Defense

United States, Washington, D.C. 82000.00 - 140000.00 USD / Year · Job Posted February 20, 2026
Apply Position
Job Link Share

Job Description

Product Reliability Engineers (PREs) are responsible for the health, performance, and stability of the services that power services at Palantir. PREs take ownership over the entire end-to-end cycle of service reliability, from responding to outages to improving codebases and building lasting solutions. You will tackle critical issues for key customers, introduce observability into complex systems, address tech debt in essential codebases, and inform strategic investments in core products. We are looking for engineers who enjoy deep-dive troubleshooting, feel strong ownership over the problems they encounter, and recognize the urgency of customer-facing outages. PREs spend the majority of their time on forward-looking product work, including but not limited to, infrastructure migrations, product contributions to improve stability and observability, and codebase enhancements that increase resilience. During periodic on-call shifts, we respond to automated alerts, investigate issues reported by customers, and share technical expertise with adjacent product teams. Whatever the technical issue or question about your service is, you'll play a central and critical role in resolving it, seeking not just a one-time fix, but a permanent solution. We provide new team members with an experienced mentor and a clear onboarding framework to set them up for success in the role.

Job Responsibility

  • Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  • Participate in on-call rotations during business hours and occasional weekends. This is a challenging yet rewarding opportunity to help remediate the most pressing issues across the Palantir fleet
  • Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues you encounter in the field
  • Improve observability by refactoring codepaths and introducing telemetry
  • Identify and implement data-driven opportunities for improved service resilience
  • Develop strategic opinions on stability investments and inform the vision for long-term product stability

Requirements

  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance

Nice to have

  • Comfortable with and curious about large scale production systems and technologies. For example, load balancing, monitoring, distributed systems, and configuration management
  • Confidence in troubleshooting complex issues independently using observability tools and stack traces
  • Familiarity with monitoring tools such as Prometheus and health checks
  • Experience coding with Java, Go and/or web technologies (e.g. HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails, etc.) is a plus
  • Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
  • Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy

What we offer

  • Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
  • Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
  • Commuter benefits
  • Take what you need paid time off, not accrual based
  • 2 weeks paid time off built into the end of each year (subject to team and business needs)
  • 10 paid holidays throughout the calendar year
  • Supportive leave of absence program including time off for military service and medical events
  • Paid leave for new parents and subsidized back-up care for all parents
  • Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation
  • Stipend to help with expenses that come with a new child
  • Employees can enroll in Palantir’s 401k plan

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Product Reliability Engineer - Defense

8 matching positions

Product Reliability Engineer - Defense

Product Reliability Engineers (PREs) are responsible for the health, performance...
Location
Location
United States , New York
Salary
Salary:
82000.00 - 140000.00 USD / Year
palantir.com Logo
Palantir Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance
Job Responsibility
Job Responsibility
  • Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  • Participate in on-call rotations during business hours and occasional weekends. This is a challenging yet rewarding opportunity to help remediate the most pressing issues across the Palantir fleet.
  • Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues you encounter in the field.
  • Improve observability by refactoring codepaths and introducing telemetry
  • Identify and implement data-driven opportunities for improved service resilience
  • Develop strategic opinions on stability investments and inform the vision for long-term product stability
What we offer
What we offer
  • Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
  • Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
  • Commuter benefits
  • Take what you need paid time off, not accrual based
  • 2 weeks paid time off built into the end of each year (subject to team and business needs)
  • 10 paid holidays throughout the calendar year
  • Supportive leave of absence program including time off for military service and medical events
  • Paid leave for new parents and subsidized back-up care for all parents
  • Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation
  • Stipend to help with expenses that come with a new child
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...
Location
Location
Salary
Salary:
Not provided
themuse.com Logo
The Muse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Job Responsibility
Job Responsibility
  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
What we offer
What we offer
  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
Read More
Arrow Right

Reliability Engineer – Performance & Life-Cycle Assurance

Mach Industries is seeking a Reliability Engineer who will own the end-to-end re...
Location
Location
United States , Huntington Beach
Salary
Salary:
150000.00 - 200000.00 USD / Year
machindustries.com Logo
Mach Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Mechanical Engineering, Electrical/Electronic Engineering, Aerospace Engineering, Systems Engineering or related discipline
  • 5+ years of reliability engineering (or similar) experience in complex hardware-centric systems
  • preferably in aerospace/defense/unmanned systems or high-reliability industrial/automotive environments
  • Demonstrated experience applying reliability methods such as FMEA, FMECA, and RCFA
  • Strong data-analysis skills: ability to ingest large data sets (field returns, operational logs), perform statistical/trend analysis, build dashboards, derive actionable insights
  • Experience with reliability testing: accelerated life tests, environmental stress screening, vibration/thermal/thermal-cycle/shock/humidity, life-cycle modelling
  • Knowledge of safety‐critical system standards and regulatory requirements (e.g., MIL-STD, DO-178, DO-254)
Job Responsibility
Job Responsibility
  • Develop, deploy and maintain a reliability program plan for our UAS platforms and key subsystems (hardware, firmware, software) following best-practices (e.g., failure-mode and effects analysis (FMEA))
  • Define reliability and maintainability requirements and metrics (e.g., MTBF, MTBR, availability, mission readiness, failure rate targets) early in the design lifecycle, and track performance through production and field operation
  • Using data (lab testing, manufacturing, field returns, in-service logs) perform analytics to identify trends, root causes of failures (RCFA), latent defects, and reliability risks—then drive corrective and preventive actions
  • Define and oversee reliability test plans, accelerated life testing, environmental stress screening, field-data analysis, degradation modelling and life-cycle modelling in collaboration with test & validation teams
  • Monitor key reliability indicators (e.g., failure-rate trending, early‐life failures, wear-out characteristics, maintenance cost per unit time/mission, parts-life forecasting) and provide actionable insights to leadership
  • Communicate reliability status, risk posture, and improvement plans to senior leadership and stakeholders, including interfacing with defense-customer reliability/quality requirements and audits if applicable
What we offer
What we offer
  • Offers Equity
  • healthcare
  • dental and vision plans
  • retirement savings
  • paid time off
  • continuing education
  • training
  • career growth
  • Fulltime
Read More
Arrow Right

Software Engineer, Internship - Defense Tech

Software Engineers at Palantir build software at scale to transform how organiza...
Location
Location
United States , Palo Alto
Salary
Salary:
10500.00 USD / Month
palantir.com Logo
Palantir Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Engineering background in fields such as Computer Science, Mathematics, Software Engineering, and Physics
  • Familiarity with data structures, storage systems, cloud infrastructure, front-end frameworks, and other technical tools
  • Active US Security clearance, or eligibility and willingness to obtain a US Security clearance prior to start of internship
  • Experience coding in programming languages, such as Java, C++, Python, JavaScript, or similar languages
  • Must be planning on graduating in 2027. This should be your final internship before graduating
Job Responsibility
Job Responsibility
  • Ownership: We see projects through from beginning to end in spite of obstacles we may encounter
  • Collaboration: We work internally with people from a variety of backgrounds — such as other Software Engineers, Product Managers, Designers and Product Reliability Engineers. We also partner with our business development teams (Forward Deployed Engineers, Deployment Strategists) in order to understand and solve our customers' problems
  • Trust: We trust each other to effectively handle time and priorities, and don't micromanage. We want people to have the space to think for themselves, while feeling supported by their team
What we offer
What we offer
  • Promoting health and well-being across all areas of Palantirians’ lives is just one of the ways we’re investing in our community
  • Fulltime
Read More
Arrow Right

Software Engineer - Data Infra Reliability

Luma's mission is to build multimodal AI to expand human imagination and capabil...
Location
Location
United States , Palo Alto
Salary
Salary:
220000.00 - 280000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation
  • Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP)
  • Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers
  • Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management
  • Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store)
Job Responsibility
Job Responsibility
  • Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure
  • Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs
  • Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads
  • Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms
  • Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems
  • Fulltime
Read More
Arrow Right
New

Sr Mechanical Engineer

Sigma Design has collaborated with an aerospace company seeking an experienced S...
Location
Location
United States , Everett
Salary
Salary:
137000.00 - 197000.00 USD / Year
sigmadzn.com Logo
Sigma Design
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Mechanical Engineering or related degree
  • Minimum of 15+ years of electro-mechanical design and architecture experience
  • Demonstrated technical leadership experience at a Senior, Staff, Principal, or equivalent engineering level
  • Experience in one or more of the following areas required
  • experience in multiple areas strongly preferred: Stress, thermal, and/or fatigue analysis
  • Aerospace regulations and qualification standards (DO-160 preferred)
  • Optical engineering and lighting systems
  • Program and project management
  • Electronics packaging
  • Solid modeling and CAD experience required
Job Responsibility
Job Responsibility
  • Lead the design, development, and integration of complex electro-mechanical systems and products
  • Serve as a technical leader on multidisciplinary engineering projects, providing guidance and mentorship to engineering staff
  • Participate in and lead cross-functional teams through all phases of product development, validation, and sustaining engineering activities
  • Perform and review advanced engineering analyses including stress, thermal, fatigue, and structural assessments to support product performance and reliability
  • Manage product architecture, system design decisions, and technical trade studies to meet customer, cost, manufacturability, and reliability requirements
  • Oversee product validation and qualification testing, ensuring compliance with applicable industry and customer requirements
  • Collaborate directly with customers and suppliers to define requirements, resolve technical challenges, and support project execution
  • Create and modify 3D models, component drawings, and assembly drawings using CAD software (CREO preferred)
  • Conduct root cause investigations and failure analyses to identify corrective actions and continuous improvement opportunities
  • Manage engineering documentation, configuration control, and product lifecycle data within PLM systems
What we offer
What we offer
  • Multiple options for medical insurance and dental insurance including some with FSA and HSA
  • 401(k) with up to 4% company match
  • 15-days of accrued PTO and 9 company paid holidays
  • Quarterly bonus program
  • Voluntary benefits: vision, long-term disability, and life insurance
  • Fulltime
Read More
Arrow Right

Senior Technical Operator (A-Shift)

We manufacture and supply reliable, high-quality medicines and vaccines to meet ...
Location
Location
United States , Zebulon
Salary
Salary:
Not provided
us.gsk.com Logo
GSK
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma or GED
  • 5+ years' experience in Production Operator role
  • 3+ years' experience working at GSK in a Production role
  • OJT certified in all work centers in packaging or the majority of work centers in manufacturing
  • Demonstrated performance in all Process Operator job roles
Job Responsibility
Job Responsibility
  • Operates, challenges, and cleans equipment, replenishes consumable supplies, and all the duties needed by the business in accordance with cGMPs, Batch Documentation, SOPs, ZSPs and JSAs as required and responsible for independently maintaining inspection readiness of area
  • Actively participates in monitoring equipment for excessive rejects /stoppages utilizing performance data to escalate any problems occurring in the area that affect product quality, safety and other aspects of line performance
  • Trained in the GSK Production System standards (i.e. 5s, standard work, problem solving, Gemba, process confirmations, and performance management) towards the goal of Zero Accidents, Zero Defects, and Zero Waste. Participates in performance management defect / stoppage trending, steps 1-6 of problem solving, and performing / maintaining OSW and 5S, CIF actions
  • Deliver against safety, quality, waste & performance objectives defined by strategy deployment. Identifies and able to implement process improvement for areas or alternative operating methods to increase safety, quality, and equipment efficiency aligned to strategy
  • Recognized as a subject matter expert on operations / process and provides the first line of defense for equipment troubleshooting. Leads and executes minor repairs on the equipment and completes preventative maintenance. Operators possess considerable knowledge of the job and is reliable and able to plan own daily activities and produce high quality and high quantity work. Participates in validation and engineering trials
  • Fluent and a trainer in systems /applications required for job performance and to monitor and identify trends (i.e. DELTA, myLearning, FreeWeigh, DISY, Active Plant, IP21, SAP work order transactions etc.)
  • Works in coordination with other associates, assistants, and/or technicians to carry out daily job responsibilities, including recognizing and leading troubleshooting technical issues and providing information to next supervisory level
  • Maintains full knowledge of the job and is recognized as a certified OJT trainer for the department and to partner with leads to complete SOP changes. Maintains complete acquaintance with and understanding of the general aspects of the job and the practical applications to problems and situations ordinarily encountered
  • Demonstrates the ability to conduct equipment set up changeovers and start of batch activities to the designed specifications without start up issues and performs start of batch maintenance activities
What we offer
What we offer
  • Competitive base salary
  • Annual bonus based on company performance
  • Flexible working options available for most roles
  • Learning and career development
  • Access to healthcare & wellbeing programmes
  • Employee recognition programmes
  • Onsite cafeteria
  • Onsite gym
  • Temperature-controlled environment
  • Licensed onsite Health & Wellness clinic
  • Fulltime
Read More
Arrow Right

Senior Data Scientist - Agentic Systems

The Global Marketing Engines and Experiences (E&E) team within Microsoft is resp...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1+ year(s) data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
  • OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 3+ years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
  • OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
  • OR equivalent experience
Job Responsibility
Job Responsibility
  • AI-Native Operations
  • Design, build, and ship agentic capabilities that make the analytics team more AI-native, including co-PM agents that triage incoming work, monitor data science workstreams in Azure DevOps, and propose ticket updates that keep our backlog accurate without manual hygiene effort
  • Build PM-layer agents that read across the Azure Dev Ops (ADO) portfolio to surface risk, estimate effort on new requests, and recommend project plans that managers can adapt rather than write from scratch
  • Establish shared infrastructure and patterns — prompting, evaluation, orchestration, observability, guardrails — that let the rest of the team build downstream agents reliably
  • Analyst Delivery Acceleration
  • Develop LLM-powered internal tools and skills that compress the cycle from analytics request to delivered insight, including capabilities that draft, format, and pressure-test the standard inputs analysts produce for MBR, MMR, and leadership review rhythms
  • Identify the highest-friction parts of the analyst delivery flow and design generative AI interventions that remove rather than relocate the work
  • Partner with the analytics team to instrument adoption, measure time saved, and iterate based on real usage rather than projected value
  • Marketer-Facing Capabilities
  • Lead the design and delivery of marketer-facing generative AI capabilities, anchored by a conversational analytics agent that allows marketers across Brand, Product Marketing Management (PMM), Customer Insights, and demand generation to self-serve on the analytics questions they bring to the team today
What we offer
What we offer
  • Eligible for benefits and other compensation
  • Fulltime
Read More
Arrow Right