Reliability Engineer Job at Proscia (Philadelphia)

Job Responsibility

Deploy, configure, and support Proscia’s container based application stack in on-premise customer environments
Own system reliability across customer installations, including uptime, performance, backup/recovery, and upgrade workflows
Diagnose and resolve production incidents—deep root cause analysis across application, container, host, storage, and networking layers, using AI alongside traditional debugging to correlate signals and cut through noise
Optimize performance for large image datasets and AI workloads running on customer-managed compute infrastructure
Improve installation automation, configuration management, and repeatability across diverse environments integrating agentic workflows in your day-to-day to keep pace with demands from Engineering
Develop and refine monitoring, logging, and alerting patterns appropriate for customer-hosted deployments
Collaborate closely with Engineering, Customer Success, and Support to translate field learnings into product and operational improvements
Create operational playbooks—written with the clarity and structure that makes them useful to teammates, customers, and the AI-augmented workflows the team relies on
Contribute to Proscia’s technical presence—whether through internal demos, engineering blog posts, or operational knowledge sharing that raises the bar for how the team works

Deep hands-on experience deploying and operating containerized applications using container orchestration in production environments
Strong Linux systems expertise (process management, networking, storage, security hardening, performance tuning)
Expert troubleshooting skills in distributed systems across application, container, and infrastructure layers
Experience with enterprise networking—you can troubleshoot and recommend corrections in customer infrastructure. Comfortable operating software in customer-managed and on-premise environments
Experience supporting data-intensive systems, ideally involving large image files or compute-heavy workloads
Working knowledge of observability practices (logs, metrics, tracing) and pragmatic monitoring approaches in non-cloud-native environments
Comfort working directly with customers or customer-facing teams to resolve high-impact issues
You already use AI tools in your operational work, in troubleshooting, writing automation, analyzing logs, or however it fits your practice
A mindset aligned with Proscia’s values: ownership, speed, simplification, and a willingness to challenge the status quo
Experience building with or on top of LLMs, AI agents, or agentic pipelines
Demonstrated fluency applying AI tools to real operational problems beyond basic code completion
Familiarity with prompt engineering, tool use patterns, and evaluation of AI systems—you know when AI output is production-ready and when it needs different guardrails

Experience with healthcare or regulated environments
Exposure to Kubernetes (for hybrid or future-state deployments)
Experience with infrastructure automation or configuration management tools
Familiarity with database performance tuning for large datasets
Experience supporting GPU-enabled workloads
Open-source contributions, side projects, or a portfolio that shows how you think and build
Background that spans multiple domains or disciplines
Active in technical communities, forums, or meetups

competitive pay
savings, schedule, and insurance options that promote long-term health and personal growth