This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a DevOps/SRE Engineer to support the reliable delivery and day-to-day operation of AI-driven capabilities within customer-facing products in Minneapolis, Minnesota. This position is focused on production readiness, service stability, and smooth integration across platforms rather than building the underlying AI features. The ideal candidate will help ensure new functionality is introduced safely, monitored effectively, and maintained with a strong emphasis on performance, cost awareness, and customer impact.
Job Responsibility
Lead the release of AI-enabled product capabilities from pre-production validation through live deployment, ensuring launches are controlled and dependable
Oversee production health by tracking availability, response times, failures, and service quality, and take prompt action when issues affect performance
Maintain and improve connections between external AI providers, internal model services, and customer-facing applications to support reliable functionality
Administer API credentials, usage thresholds, vendor quotas, and spend controls, proactively identifying risks related to capacity or budget
Create and refine operational dashboards, alerting rules, and response documentation to strengthen support for AI-related incidents
Work closely with product and engineering partners to plan staged rollouts, feature gating, rollback paths, and low-risk release strategies
Support customer-facing teams by explaining AI feature readiness, expected delivery timelines, and practical capabilities in clear business terms
Participate in customer or sales discussions when technical expertise is needed, helping address questions about solution behavior, roadmap direction, and use case alignment
Manage investigation and resolution of customer-impacting incidents by coordinating with internal stakeholders and external vendors while providing timely updates
Monitor usage patterns, operating costs, vendor changes, model retirements, and security notices, and prepare tested mitigation or migration plans before service is affected
Requirements
At least 3 years of experience in DevOps, site reliability, software engineering, or production operations supporting live customer environments
Strong programming ability in Python with practical experience working with APIs, webhooks, and asynchronous service interactions
Proven background operating systems in production with an understanding of reliability, scalability, and incident handling under real-world load
Hands-on experience with monitoring and observability platforms such as Datadog, Grafana, New Relic, Amazon CloudWatch, or comparable tools
Familiarity with at least one major AI platform, including OpenAI, Claude, Azure OpenAI, Amazon Bedrock, or Google Vertex AI, along with production concerns such as latency, fallback design, rate limits, and cost control
Working knowledge of cloud infrastructure and CI/CD practices used to deploy, update, and maintain services consistently
Ability to write clear operational documentation, including runbooks and post-incident summaries, and to lead communication during service disruptions
Strong communication skills with the confidence to explain technical topics to non-technical stakeholders and customers while maintaining sound security and data-handling practices