Principal Engineer – Gen AI Platform Inferencing Engineering Job at Wells Fargo (Charlotte)

Job Description

Wells Fargo is seeking a Principal Engineer – Gen AI Platform Inferencing Engineering to lead the development and optimization of our AI model serving and inferencing platforms within Digital Technology's AI Capability Engineering group. This is a software engineering role — you'll write code, build systems, and solve hard problems in the AI inference stack. You'll work deep inside frameworks like vLLM, SGLang, and NVIDIA Dynamo, extending and optimizing them to serve models at enterprise scale. You'll also build the automation, tooling, and deployment infrastructure that connects these runtimes to Kubernetes-native serving layers like KServe, KNative, and OpenShift AI. If you've contributed to inference frameworks, written custom serving logic, or built production ML serving pipelines in Python, we want to hear from you.

Job Responsibility

Develop, extend, and optimize inference runtime configurations and integrations across vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM, and Triton
Write Python-based tooling and automation for model onboarding, serving configuration, performance benchmarking, and deployment pipelines
Build and maintain Kubernetes-native model serving infrastructure using KServe, KNative, and OpenShift AI — including custom serving runtimes and inference graphs
Implement and tune inference performance optimizations — continuous batching, speculative decoding, prefix caching, concurrency control, autoscaling policies, and disaggregated prefill/decode pipelines
Develop Helm charts, operators, and Kustomize overlays for deploying and managing inference workloads on OpenShift/OCP
Integrate inference platforms with GPU workload orchestrators (Run:AI or similar) — automating project provisioning, quota management, and workload scheduling
Build observability and testing harnesses — load testing frameworks, latency/throughput profiling scripts, and regression test suites for inference stack upgrades
Partner with AI/ML teams to productionize new models, defining serving architectures, resource requirements, and SLA targets

Requirements

7+ years in software engineering or platform engineering (work experience, training, military experience, or education)
5+ years of programming experience in Python with experience building production systems
Experience with Inference frameworks, such as vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM, or Triton Inference Server
Experience with Kubernetes-native ML serving, such as KServe, KNative, Seldon, or OpenShift AI
Experience with Inference optimization, (Continuous batching, speculative decoding, KV-cache management, prefix caching, quantization-aware serving (FP8, AWQ, GPTQ), or tensor parallelism configuration)
Experience with Container platform development, (Writing Helm charts, operators, or custom controllers for OpenShift, GKE, or EKS)
Experience with GPU workload orchestration, (Run:AI, Kueue, Volcano — scripting workload automation, quota management, or scheduler integrations)
Experience with Performance and load testing, (Building benchmarking tools for token throughput, time-to-first-token, batch latency, and autoscaling behavior)
Familiarity with NVIDIA GPU fundamentals (CUDA, MIG, NCCL), experience contributing to open-source inference projects, or background in ML observability tooling (Prometheus, Grafana, Arize)

Nice to have

Experience with Inference frameworks, such as vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM, or Triton Inference Server
Experience with Kubernetes-native ML serving, such as KServe, KNative, Seldon, or OpenShift AI
Experience with Inference optimization, (Continuous batching, speculative decoding, KV-cache management, prefix caching, quantization-aware serving (FP8, AWQ, GPTQ), or tensor parallelism configuration)
Experience with Container platform development, (Writing Helm charts, operators, or custom controllers for OpenShift, GKE, or EKS)
Experience with GPU workload orchestration, (Run:AI, Kueue, Volcano — scripting workload automation, quota management, or scheduler integrations)
Experience with Performance and load testing, (Building benchmarking tools for token throughput, time-to-first-token, batch latency, and autoscaling behavior)
Familiarity with NVIDIA GPU fundamentals (CUDA, MIG, NCCL), experience contributing to open-source inference projects, or background in ML observability tooling (Prometheus, Grafana, Arize)

What we offer

Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance
Parental leave
Critical caregiving leave
Discounts and savings
Commuter benefits
Tuition reimbursement
Scholarships for dependent children
Adoption reimbursement

Wells Fargo - All Job Offers

Select Country

Principal Engineer – Gen AI Platform Inferencing Engineering

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Principal Engineer – Gen AI Platform Inferencing Engineering

IT Training Lead

K Kitchen Representative

K Kitchen Representative

Restaurant Assistant Manager

Plant Operator - Crushing and Screen

Graduate Student Instructors

Shift Supervisor

Shift Supervisor

Our AI answers in your language