Product Reliability Engineer - Defense Job at Palantir Technologies (Washington, D.C.)

Product Reliability Engineer - Defense

Product Reliability Engineers (PREs) are responsible for the health, performance...

Location

United States , New York

Salary:

82000.00 - 140000.00 USD / Year

Palantir Technologies

Expiration Date

Until further notice

Requirements

Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
Experience producing code in backend languages such as Java, as part of a past role or personal projects
Familiarity with storage and data processing systems and cloud infrastructure
Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
Eligibility and willingness to obtain a US Security clearance

Job Responsibility

Continuously invest in documentation, metrics, monitors and other troubleshooting tools
Participate in on-call rotations during business hours and occasional weekends. This is a challenging yet rewarding opportunity to help remediate the most pressing issues across the Palantir fleet.
Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues you encounter in the field.
Improve observability by refactoring codepaths and introducing telemetry
Identify and implement data-driven opportunities for improved service resilience
Develop strategic opinions on stability investments and inform the vision for long-term product stability

What we offer

Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
Commuter benefits
Take what you need paid time off, not accrual based
2 weeks paid time off built into the end of each year (subject to team and business needs)
10 paid holidays throughout the calendar year
Supportive leave of absence program including time off for military service and medical events
Paid leave for new parents and subsidized back-up care for all parents
Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation
Stipend to help with expenses that come with a new child

Fulltime

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...

Location

Salary:

Not provided

The Muse

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
Strong GitOps experience with Argo CD
experience building delivery workflows and automation using Argo Workflows
Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
ability to define reusable platform patterns and controls
Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
Proficiency in Python for building automation, tooling, and reliability improvements
Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)

Job Responsibility

Act as the technical leader for reliability for one or more domains
set direction and standards while remaining hands-on where it matters most
Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
Lead operational readiness and reliability reviews for new features/architectural changes
reinforce non-functional requirements (availability, latency, security, cost)
Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services

What we offer

Pet Insurance
Health Insurance
Dental Insurance
Vision Insurance
FSA
HSA
HSA With Employer Contribution
Life Insurance
Short-Term Disability
Long-Term Disability

Reliability Engineer – Performance & Life-Cycle Assurance

Mach Industries is seeking a Reliability Engineer who will own the end-to-end re...

Location

United States , Huntington Beach

Salary:

150000.00 - 200000.00 USD / Year

Mach Industries

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Mechanical Engineering, Electrical/Electronic Engineering, Aerospace Engineering, Systems Engineering or related discipline
5+ years of reliability engineering (or similar) experience in complex hardware-centric systems
preferably in aerospace/defense/unmanned systems or high-reliability industrial/automotive environments
Demonstrated experience applying reliability methods such as FMEA, FMECA, and RCFA
Strong data-analysis skills: ability to ingest large data sets (field returns, operational logs), perform statistical/trend analysis, build dashboards, derive actionable insights
Experience with reliability testing: accelerated life tests, environmental stress screening, vibration/thermal/thermal-cycle/shock/humidity, life-cycle modelling
Knowledge of safety‐critical system standards and regulatory requirements (e.g., MIL-STD, DO-178, DO-254)

Job Responsibility

Develop, deploy and maintain a reliability program plan for our UAS platforms and key subsystems (hardware, firmware, software) following best-practices (e.g., failure-mode and effects analysis (FMEA))
Define reliability and maintainability requirements and metrics (e.g., MTBF, MTBR, availability, mission readiness, failure rate targets) early in the design lifecycle, and track performance through production and field operation
Using data (lab testing, manufacturing, field returns, in-service logs) perform analytics to identify trends, root causes of failures (RCFA), latent defects, and reliability risks—then drive corrective and preventive actions
Define and oversee reliability test plans, accelerated life testing, environmental stress screening, field-data analysis, degradation modelling and life-cycle modelling in collaboration with test & validation teams
Monitor key reliability indicators (e.g., failure-rate trending, early‐life failures, wear-out characteristics, maintenance cost per unit time/mission, parts-life forecasting) and provide actionable insights to leadership
Communicate reliability status, risk posture, and improvement plans to senior leadership and stakeholders, including interfacing with defense-customer reliability/quality requirements and audits if applicable

What we offer

Offers Equity
healthcare
dental and vision plans
retirement savings
paid time off
continuing education
training
career growth

Fulltime

Software Engineer, Internship - Defense Tech

Software Engineers at Palantir build software at scale to transform how organiza...

Location

United States , Palo Alto

Salary:

10500.00 USD / Month

Palantir Technologies

Expiration Date

Until further notice

Requirements

Engineering background in fields such as Computer Science, Mathematics, Software Engineering, and Physics
Familiarity with data structures, storage systems, cloud infrastructure, front-end frameworks, and other technical tools
Active US Security clearance, or eligibility and willingness to obtain a US Security clearance prior to start of internship
Experience coding in programming languages, such as Java, C++, Python, JavaScript, or similar languages
Must be planning on graduating in 2027. This should be your final internship before graduating

Job Responsibility

Ownership: We see projects through from beginning to end in spite of obstacles we may encounter
Collaboration: We work internally with people from a variety of backgrounds — such as other Software Engineers, Product Managers, Designers and Product Reliability Engineers. We also partner with our business development teams (Forward Deployed Engineers, Deployment Strategists) in order to understand and solve our customers' problems
Trust: We trust each other to effectively handle time and priorities, and don't micromanage. We want people to have the space to think for themselves, while feeling supported by their team

What we offer

Promoting health and well-being across all areas of Palantirians’ lives is just one of the ways we’re investing in our community

Fulltime

Software Engineer - Data Infra Reliability

Luma's mission is to build multimodal AI to expand human imagination and capabil...

Location

United States , Palo Alto

Salary:

220000.00 - 280000.00 USD / Year

Luma AI

Expiration Date

Until further notice

Requirements

Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation
Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP)
Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers
Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management
Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store)

Job Responsibility

Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure
Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs
Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads
Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms
Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems

Fulltime

New

Sr Mechanical Engineer

Sigma Design has collaborated with an aerospace company seeking an experienced S...

Location

United States , Everett

Salary:

137000.00 - 197000.00 USD / Year

Sigma Design

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Mechanical Engineering or related degree
Minimum of 15+ years of electro-mechanical design and architecture experience
Demonstrated technical leadership experience at a Senior, Staff, Principal, or equivalent engineering level
Experience in one or more of the following areas required
experience in multiple areas strongly preferred: Stress, thermal, and/or fatigue analysis
Aerospace regulations and qualification standards (DO-160 preferred)
Optical engineering and lighting systems
Program and project management
Electronics packaging
Solid modeling and CAD experience required

Job Responsibility

Lead the design, development, and integration of complex electro-mechanical systems and products
Serve as a technical leader on multidisciplinary engineering projects, providing guidance and mentorship to engineering staff
Participate in and lead cross-functional teams through all phases of product development, validation, and sustaining engineering activities
Perform and review advanced engineering analyses including stress, thermal, fatigue, and structural assessments to support product performance and reliability
Manage product architecture, system design decisions, and technical trade studies to meet customer, cost, manufacturability, and reliability requirements
Oversee product validation and qualification testing, ensuring compliance with applicable industry and customer requirements
Collaborate directly with customers and suppliers to define requirements, resolve technical challenges, and support project execution
Create and modify 3D models, component drawings, and assembly drawings using CAD software (CREO preferred)
Conduct root cause investigations and failure analyses to identify corrective actions and continuous improvement opportunities
Manage engineering documentation, configuration control, and product lifecycle data within PLM systems

What we offer

Multiple options for medical insurance and dental insurance including some with FSA and HSA
401(k) with up to 4% company match
15-days of accrued PTO and 9 company paid holidays
Quarterly bonus program
Voluntary benefits: vision, long-term disability, and life insurance

Fulltime

Senior Technical Operator (A-Shift)

We manufacture and supply reliable, high-quality medicines and vaccines to meet ...

Location

United States , Zebulon

Salary:

Not provided

GSK

Expiration Date

Until further notice

Requirements

High School Diploma or GED
5+ years' experience in Production Operator role
3+ years' experience working at GSK in a Production role
OJT certified in all work centers in packaging or the majority of work centers in manufacturing
Demonstrated performance in all Process Operator job roles

Job Responsibility

Operates, challenges, and cleans equipment, replenishes consumable supplies, and all the duties needed by the business in accordance with cGMPs, Batch Documentation, SOPs, ZSPs and JSAs as required and responsible for independently maintaining inspection readiness of area
Actively participates in monitoring equipment for excessive rejects /stoppages utilizing performance data to escalate any problems occurring in the area that affect product quality, safety and other aspects of line performance
Trained in the GSK Production System standards (i.e. 5s, standard work, problem solving, Gemba, process confirmations, and performance management) towards the goal of Zero Accidents, Zero Defects, and Zero Waste. Participates in performance management defect / stoppage trending, steps 1-6 of problem solving, and performing / maintaining OSW and 5S, CIF actions
Deliver against safety, quality, waste & performance objectives defined by strategy deployment. Identifies and able to implement process improvement for areas or alternative operating methods to increase safety, quality, and equipment efficiency aligned to strategy
Recognized as a subject matter expert on operations / process and provides the first line of defense for equipment troubleshooting. Leads and executes minor repairs on the equipment and completes preventative maintenance. Operators possess considerable knowledge of the job and is reliable and able to plan own daily activities and produce high quality and high quantity work. Participates in validation and engineering trials
Fluent and a trainer in systems /applications required for job performance and to monitor and identify trends (i.e. DELTA, myLearning, FreeWeigh, DISY, Active Plant, IP21, SAP work order transactions etc.)
Works in coordination with other associates, assistants, and/or technicians to carry out daily job responsibilities, including recognizing and leading troubleshooting technical issues and providing information to next supervisory level
Maintains full knowledge of the job and is recognized as a certified OJT trainer for the department and to partner with leads to complete SOP changes. Maintains complete acquaintance with and understanding of the general aspects of the job and the practical applications to problems and situations ordinarily encountered
Demonstrates the ability to conduct equipment set up changeovers and start of batch activities to the designed specifications without start up issues and performs start of batch maintenance activities

What we offer

Competitive base salary
Annual bonus based on company performance
Flexible working options available for most roles
Learning and career development
Access to healthcare & wellbeing programmes
Employee recognition programmes
Onsite cafeteria
Onsite gym
Temperature-controlled environment
Licensed onsite Health & Wellness clinic

Fulltime

Senior Data Scientist - Agentic Systems

The Global Marketing Engines and Experiences (E&E) team within Microsoft is resp...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1+ year(s) data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 3+ years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
OR equivalent experience

Job Responsibility

AI-Native Operations
Design, build, and ship agentic capabilities that make the analytics team more AI-native, including co-PM agents that triage incoming work, monitor data science workstreams in Azure DevOps, and propose ticket updates that keep our backlog accurate without manual hygiene effort
Build PM-layer agents that read across the Azure Dev Ops (ADO) portfolio to surface risk, estimate effort on new requests, and recommend project plans that managers can adapt rather than write from scratch
Establish shared infrastructure and patterns — prompting, evaluation, orchestration, observability, guardrails — that let the rest of the team build downstream agents reliably
Analyst Delivery Acceleration
Develop LLM-powered internal tools and skills that compress the cycle from analytics request to delivered insight, including capabilities that draft, format, and pressure-test the standard inputs analysts produce for MBR, MMR, and leadership review rhythms
Identify the highest-friction parts of the analyst delivery flow and design generative AI interventions that remove rather than relocate the work
Partner with the analytics team to instrument adoption, measure time saved, and iterate based on real usage rather than projected value
Marketer-Facing Capabilities
Lead the design and delivery of marketer-facing generative AI capabilities, anchored by a conversational analytics agent that allows marketers across Brand, Product Marketing Management (PMM), Customer Insights, and demand generation to self-serve on the analytics questions they bring to the team today

What we offer

Eligible for benefits and other compensation

Fulltime

Select Country

Product Reliability Engineer - Defense

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Product Reliability Engineer - Defense

Product Reliability Engineer - Defense

Principal Site Reliability Engineer

Reliability Engineer – Performance & Life-Cycle Assurance

Software Engineer, Internship - Defense Tech

Software Engineer - Data Infra Reliability

Sr Mechanical Engineer

Senior Technical Operator (A-Shift)

Senior Data Scientist - Agentic Systems

Our AI answers in your language