CrawlJobs Logo

Lead Software Engineer, DevOps / MLOps

capitalone.com Logo

Capital One

Location Icon

Location:
United States , New York

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

197300.00 - 245600.00 USD / Year

Job Description:

Lead Software Engineer, DevOps / MLOps (Agentic Workflows, AI/ML Guardrails, Kubernetes, Python, AWS). Do you love building and pioneering in the technology space? Do you enjoy solving complex business problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of makers, breakers, doers and disruptors, who love to solve real problems and meet real customer needs. We are seeking DevOps Engineers who are passionate about marrying data with emerging technologies to join our team. As a DevOps Engineer, you’ll have the opportunity to be on the forefront of driving a major transformation within Capital One. Intelligent Foundations & Experiences (IFX) is a powerful collective of horizontal technology organizations that are driving Capital One’s real-time intelligent future. Together with our partners in the Enterprise and across lines of business, we deliver broad-reaching technical solutions and advance state-of-the-art science to help every Capital One associate and our 100+M customers succeed.

Job Responsibility:

  • Lead a portfolio of diverse technology projects and a team of developers with deep experience in machine learning, distributed microservices, and full stack systems to create solutions that help meet regulatory needs for the company
  • Share your passion for staying on top of tech trends, experimenting with and learning new technologies, participating in internal & external technology communities, and mentoring other members of the engineering community
  • Collaborate with digital product managers, and deliver robust cloud-based solutions that drive powerful experiences to help millions of Americans achieve financial empowerment
  • Utilize programming languages like Java, Python, SQL, Ruby and Go, Container Orchestration services including Docker and Kubernetes, CM tools including Ansible and Terraform, and a variety of AWS tools and services

Requirements:

  • Bachelor’s degree
  • At least 4 years of experience in DevOps Engineering (Internship experience does not apply)
  • At least 3 years of experience in Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
  • At least 4 years of Unix or Linux system administration experience

Nice to have:

  • 7+ years of DevOps Engineering experience
  • 4+ years of experience with coding and scripting (Python, SQL, Java, JavaScript, Golang, Bash, Perl or Ruby)
  • 4+ years of experience using build and deployment tools (Jenkins, Docker, Kubernetes)
  • 2+ years of experience with distributed database systems (Cassandra, ElasticSearch)
  • 2+ years of experience with deploying clustered web services
  • 2+ years of experience working within Agile Development Practices
  • 2+ years of Machine Learning Operations experience
What we offer:
  • comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being
  • performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)

Additional Information:

Job Posted:
April 16, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Lead Software Engineer, DevOps / MLOps

Senior Software Engineer – AI

NStarX is seeking a highly skilled Senior Software Engineer – AI with a strong f...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
nstarxinc.com Logo
NStarX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field (PhD is a plus)
  • 9+ years of experience in AI/ML engineering or related roles
  • 3+ years of experience in Generative AI with team leadership responsibilities
  • Proven track record of production-grade ML and GenAI model development and deployment
  • Programming: Python (preferred)
  • GenAI Frameworks: Hugging Face Transformers, Diffusers, LangChain, TGI
  • Serving & Inference: FastAPI, gRPC, NVIDIA Triton, TorchServe
  • Cloud Platforms: AWS (SageMaker, EKS), GCP (Vertex AI, GKE), Azure (Azure ML, AKS)
  • MLOps & DevOps: Kubeflow, MLflow, GitHub Actions, Jenkins, Helm, Terraform
  • Optimization Techniques: Model quantization, distillation, pipeline and tensor parallelism
Job Responsibility
Job Responsibility
  • Design, develop, and deploy machine learning models and AI algorithms to address complex business challenges
  • Lead and mentor a team of AI/ML engineers, ensuring quality and scalability in solution design and implementation
  • Collaborate closely with cross-functional teams including data scientists, software engineers, product managers, and UX designers
  • Lead the development and deployment of Generative AI applications across text, code, image, and audio modalities using state-of-the-art LLMs
  • Design and implement CI/CD pipelines for the GenAI model lifecycle including training, validation, packaging, and deployment
  • Apply best practices for model performance tuning, cost optimization, and scalable deployment in cloud and hybrid environments
  • Develop prompt engineering, fine-tuning strategies (LoRA, QLoRA, PEFT), and evaluation protocols tailored to business use cases
  • Stay current with emerging trends in AI, ML, and Generative AI and drive adoption across teams
  • Document processes, model architectures, and deployment strategies for traceability and knowledge sharing
  • Work closely with cross-functional teams to gather requirements and deliver high-quality solutions
What we offer
What we offer
  • Competitive salary aligned with market standards
  • Opportunities for professional development and skill enhancement
  • A collaborative and innovative work environment
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Senior AI ML Engineer

We are seeking a highly skilled and experienced Assistant Vice President (AVP), ...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Data Science, Artificial Intelligence, Machine Learning, Statistics, or a related quantitative field
  • Minimum of 6+ years of professional experience in Data Science, Machine Learning Engineering, or a similar role, with a strong track record of deploying ML models to production
  • Proven experience in a lead or senior technical role
  • Expert-level proficiency in Python programming, including experience with relevant data science libraries (e.g., Pandas, NumPy, Scikit-learn) and deep learning frameworks (e.g., TensorFlow, PyTorch)
  • Strong hands-on experience designing, developing, and deploying RESTful APIs using FastAPI
  • Solid understanding and practical experience with CI/CD tools and methodologies (e.g., Jenkins, GitLab CI, GitHub Actions, Azure DevOps) for MLOps
  • Experience with MLOps platforms, model monitoring, and model versioning
  • Experience with at least one major cloud provider (e.g., AWS, Azure, GCP) for deploying and managing ML workloads
  • Proficiency in SQL and experience working with relational and/or NoSQL databases
  • Deep understanding of machine learning algorithms, statistical modeling, and data mining techniques
Job Responsibility
Job Responsibility
  • Design, develop, and implement advanced machine learning models (e.g., predictive, prescriptive, generative AI) to solve complex business problems, from initial data exploration and feature engineering to model training and evaluation
  • Lead the deployment of AI/ML models into production environments, ensuring scalability, reliability, and performance
  • Build and maintain robust, high-performance APIs (using frameworks like FastAPI) to serve machine learning models and integrate them with existing applications and systems
  • Establish and manage continuous integration and continuous deployment (CI/CD) pipelines for ML code and model deployments, promoting automation and efficiency
  • Collaborate with data engineers to ensure optimal data pipelines and data quality for model development and deployment
  • Conduct rigorous experimentation, A/B testing, and model performance monitoring to continuously improve and optimize AI/ML solutions
  • Promote and enforce best practices in software development, including clean code, unit testing, documentation, and version control
  • Mentor junior team members, contribute to technical discussions, and drive the adoption of new technologies and methodologies within the team
  • Effectively communicate complex technical concepts and model results to both technical and non-technical stakeholders.
What we offer
What we offer
  • Not explicitly stated.
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

We are looking for a talented and experienced Senior Software Engineer to join o...
Location
Location
United States , Westfield
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of software engineering experience
  • At least 2 years focused on AI/ML systems
  • At least 2 years in technical leadership roles
  • Proven expertise in developing and deploying agent-based solutions utilizing frameworks like LangChain or AutoGen
  • Strong understanding of multi-tenant systems, including tenant isolation, security protocols, and resource optimization
  • Hands-on experience with infrastructure-as-code tools such as Terraform and serverless compute frameworks
  • Familiarity with modern MLOps practices and AI application patterns like vector databases and retrieval-augmented generation
  • Excellent communication skills, capable of articulating technical concepts to stakeholders and leadership
  • Ability to innovate, adapt, and drive improvements in engineering processes and practices
  • Experience in integrating AI orchestration layers for task-based tools or chat applications
Job Responsibility
Job Responsibility
  • Lead and mentor a small team of AI engineers, guiding technical direction and ensuring timely delivery of high-quality solutions
  • Design and implement AI-driven agents and chat-based tools that function reliably within a multi-tenant architecture
  • Integrate AI workflows with services from providers like OpenAI, Anthropic, or open-source models, ensuring seamless orchestration
  • Optimize platform scalability, performance, and security across tenant environments
  • Take full ownership of projects, managing everything from architectural design to deployment and performance monitoring
  • Collaborate closely with product management, DevOps, and other engineering leads to align project scope and timelines
  • Stay informed on advancements in AI technologies and explore new approaches for real-world applications
  • Foster a fast-paced, experimentation-driven culture that encourages innovative thinking and rapid delivery
  • Champion best practices in engineering to drive continuous improvement and operational excellence
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Eligible to enroll in company 401(k) plan
Read More
Arrow Right

Technical Lead, AI and Engineering, HBS Foundry

Technical Lead acts as the primary architectural authority and engineering manag...
Location
Location
United States , Boston
Salary
Salary:
170000.00 - 200000.00 USD / Year
hbs.edu Logo
HBS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of seven years’ post-secondary education or relevant work experience
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related technical field
  • 5+ years of Software Engineering (ideally AI-focused) experience, including 2+ years in a Lead Engineer, Architect, or equivalent technical leadership role
  • Deep understanding of and strong hands-on experience with: Application & Product Engineering
  • Modern web application architecture, including frontend, middleware, backend, and full-stack development
  • Languages and frameworks including Python, TypeScript/JavaScript, NextJS, React, Node.js, and SQL
  • APIs, Services & Data
  • Microservices and API design, including REST and GraphQL
  • Asynchronous and event-driven systems
  • Authentication and authorization
Job Responsibility
Job Responsibility
  • Lead development and implementation of complex information technology projects and architecture to solve problems with wide impact, requiring high levels of functional integration
  • Act as the primary architectural authority and engineering manager for the HBS Foundry platform
  • Architect, design, and provide hands-on contribution to the development and deployment of scalable, secure, and maintainable technical solutions
  • Make technical design choices and follow technical standards
  • Balance fast, lean innovation with practical implementation, technical feasibility, technical debt, and system stability
  • Evaluate emerging AI technologies and determine their applicability to platform needs
  • Manage the engineering team in alignment with product roadmap and critical deadlines
  • Translate business and product requirements into detailed technical specs and engineering tasks
  • Mentor software developers, conduct code reviews, and ensure high-quality coding
  • Oversee the software product development lifecycle, including MLOps, CI/CD pipelines, testing frameworks, and release management, as well as data pipelines, and governance
What we offer
What we offer
  • Generous paid time off including parental leave
  • Medical, dental, and vision health insurance coverage starting on day one
  • Retirement plans with university contributions
  • Wellbeing and mental health resources
  • Support for families and caregivers
  • Professional development opportunities including tuition assistance and reimbursement
  • Commuter benefits, discounts and campus perks
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, AI Agent Platform

The Geico AI Agent Platform team is seeking an exceptional Staff Software Engine...
Location
Location
United States , Chevy Chase; New York City
Salary
Salary:
115000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, Mathematics, or a related field
  • an advanced degree (master’s or Ph.D.) is highly desirable
  • 6+ years of hands-on experience in designing, implementing, and maintaining multi-tenant AIML systems and platforms in production environments
  • 6+ years of experience working with cloud platforms such as Azure and AWS
  • Extensive expertise in designing and deploying large-scale data pipelines and real-time inference systems and managing the end-to-end AI Agent and/or AIML system development lifecycles, including configuration, evaluation, monitoring, observability and AuthN/AuthR considerations
  • 6+ years of experience working with common backend systems & tools (e.g, Kubernetes, Temporal, OpenSearch, PostgreSQL, Redis, Neo4J, etc.)
  • Deep understanding of Docker, container optimization, and multi-stage builds
  • Experience with Prometheus, Grafana, Open Telemetry and distributed tracing
  • 3+ years of experience building front-end web applications using frameworks such as React and/or Next.JS
  • Deep proficiency in programming languages such as Python, Java, Go, etc., with a strong emphasis on coding excellence
Job Responsibility
Job Responsibility
  • Architect and implement scalable multi-tenant backend systems for building AI agent workflows, including agent configuration, offline evaluation, synthetic data generation, workflow simulation, agent marketplace, etc. using Azure Kubernetes Service (AKS), FastAPI, etc., ensuring economy of scale and control cost of maintenance
  • Collaborate with Design team to architect and implement frontend experiences and workflows for onboarding both technical and non-technical stakeholders, maximizing user adoption and successful AI agent development
  • Develop observability frameworks to ensure 99.9%+ uptime for AI agent platforms through robust monitoring, alerting, and incident response procedures
  • Evaluate and (if desirable) integrate cutting-edge GenAI frameworks, libraries and vendors to maintain a state-of-the-art technology stack, including hybrid cloud solutions with AWS/GCP as backup or specialized use cases
  • Architect and implement scalable, high-performance machine learning platforms and systems capable of processing large data volumes and supporting real-time decision making and workflows
  • Oversee the end-to-end lifecycle of AI agent applications, ensuring robust testing, deployment, and ongoing monitoring
  • Ensure adherence to company production readiness standards, security protocols, and regulatory compliance throughout the development lifecycle
  • Continuously optimize platform performance, reducing latency and improving throughput for AI agent workloads
  • Design and implement backup, recovery, and business continuity plans for hosted platform applications & services
  • Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right