This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Lead Infrastructure Engineer to join our AI Platforms and model Support Group as part of Digital Technology and Innovations. Learn more about the career areas and business divisions at wellsfargojobs.com. The Lead Infrastructure Engineer is responsible for designing, building, and operating highly scalable, resilient infrastructure Production platforms that support enterprise Generative AI and Predictive AI workloads. This role provides technical leadership across GPU-accelerated environments, OpenShift/Kubernetes platforms, and advanced AI infrastructure patterns, including large AI factory scale GPU compute architectures. The engineer partners closely with platform, application, and vendor teams to ensure secure, performant, and production-grade AI solutions.
Job Responsibility:
Lead complex initiatives to develop infrastructure to provide solutions for business applications
Participate in various projects intended to continually improve or upgrade the infrastructure
Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third party vendors
Design, code, test, debug and document programs using Agile development practices
Make decisions in technical designs, implementation plans and identify project risks and resource requirements
Direct the daily risk and control flow of operations, focusing on policies, procedures and work standards to ensure success
Recommend courses of action to maintain cost effectiveness and achieve results
Collaborate and consult with peers, colleagues and managers to resolve issues and achieve goals
Interact with customer and vendor
Requirements:
5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years troubleshooting complex end-to-end architectures (including CI/CD pipeline)
5+ years Linux systems experience
4+ years supporting AI/ML platforms
4+ years of Kubernetes / container platform experience including production support
Nice to have:
Experience with Generative AI and Predictive AI platforms
Hands-on GPU platform operations including scheduling, quota, and performance tuning
Experience with OpenShift in GPU-enabled, multi-tenant environments
Experience designing or operating GPU SuperPods
Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines
Experience building AI- or agent-driven automation tooling (AIOps)
Hands-on experience supporting AI/ML workloads on GCP and Azure, including GPU-backed services and managed AI infrastructure
Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
Strong monitoring of AI signals such as inference latency and GPU utilization
Experience with BCP/DR, resiliency, and highly available architectures
What we offer:
Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance