Lead Software Engineer, DevOps / MLOps Job at Capital One (New York)

Sr. Devops Engineer AWS

Location

United States

Salary:

145000.00 - 165000.00 USD / Year

Megazone Cloud US

Expiration Date

Until further notice

Requirements

Bachelor Degree or 10+ years of professional or military experience
8+ years of experience as a technical specialist
2+ years of hands-on experience of programming in languages such as Python, Ruby, Go, Swift, Java, .Net, C++ or similar object-oriented language
Experience with architecting and automating cloud native technologies, deploying applications, and provisioning infrastructure
Hands-on experience with Infrastructure as Code, using CloudFormation, Terraform, or other tools
Experience architecting cloud native CI/CD workflows and tools, such as Jenkins, Bamboo, TeamCity, Code Deploy (AWS) and/or GitLab
Hands-on experience with microservices and distributed application architecture, such as containers, Kubernetes, and/or serverless technology
Experience with the full software development lifecycle and delivery using Agile practices
Experience with Chef, Puppet, Salt, or Ansible in production environments
Knowledge of IP networking, VPN's, DNS, load balancing and firewall

Job Responsibility

Advise customers on their DevOps journey, manage projects independently and also deliver as part of larger teams
Work with customers and partners internalizing their context while using your business and technical skills to design solutions based on requirements and constraints
Work towards customer business outcomes, ensuring there is a strong connection between delivery activities and business objectives
Own and complete key tasks and deliverables, and collaborate with others to define and implement optimal, complete solutions based on stakeholders needs
Guide customers’ technical and investments, maximizing alignment with the platform, and ease of adoption as new services and products become available
Design and deliver solutions that solve for new levels of complexity, scale and performance, and in turn, enable breakthrough innovations. Create and apply frameworks, methods, best practices and artifacts that deliver prescriptive guidance to customers, and publish and present them in large forums and across various media platforms
Experience with seamless/automated build scripts used for release management across all environments
Willingness to travel to client locations and deliver professional services

What we offer

Discretionary bonus

Fulltime

Devops Engineer AWS

Overview Application DevOps Engineer (L5) Key Responsibilites: Previous exper...

Location

United States

Salary:

Not provided

Megazone Cloud US

Expiration Date

Until further notice

Requirements

Bachelor Degree or 5+ years of professional or military experience
5+ years of experience as a technical specialist
2+ years of hands-on experience of programming in languages such as Python, Ruby, Go, Swift, Java, .Net, C++ or similar object-oriented language
Experience with automating cloud native technologies, deploying applications, and provisioning infrastructure
Hands-on experience with Infrastructure as Code, using CloudFormation, Terraform, or other tools
Experience developing cloud native CI/CD workflows and tools, such as Jenkins, Bamboo, TeamCity, Code Deploy (AWS) and/or GitLab
Hands-on experience with microservices and distributed application architecture, such as containers, Kubernetes, and/or serverless technology
Experience with the full software development lifecycle and delivery using Agile practices
Experience with Chef, Puppet, Salt, or Ansible in production environments
Knowledge of IP networking, VPN's, DNS, load balancing and firewall

Job Responsibility

Previous experience in a lead DevOps role
Assist on larger projects or run smaller opportunities independently
Technical depth and hands-on implementation experience of various practices and tools in the DevOps toolchain
Comfortable rolling up their sleeves to design and code modules for infrastructure, application, and processes

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer, Infrastructure

We are seeking a Staff Engineer to help lead critical initiatives of our core in...

Location

United States , Los Angeles

Salary:

230000.00 - 260000.00 USD / Year

Genius Sports

Expiration Date

Until further notice

Requirements

8+ years of experience building and operating infrastructure or devops platforms
Experience architecting complex infrastructure with strict uptime and latency requirements across multiple regions
Ability to navigate significant ambiguity and make sound technical decisions that hold up over time
Experience building applications and automations to eliminate toil for engineering teams
Strong communication skills that enable you to drive platform adoption forwards for new teams
Track record of making pragmatic tradeoff decisions across architecture, implementation, technical debt, and customer requests
Passion for mentorship and upleveling the team around you to maximize their full potential

Job Responsibility

Work with other InfraPlat leads to define and drive technical vision and implementation for a variety of projects
Engage with stakeholders from product engineering teams to scope requests, identify shared pain points within the org, and prioritize initiatives

Fulltime

Staff Software Engineer, AI Agent Platform

The Geico AI Agent Platform team is seeking an exceptional Staff Software Engine...

Location

United States , Chevy Chase; New York City

Salary:

115000.00 - 260000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, Mathematics, or a related field
an advanced degree (master’s or Ph.D.) is highly desirable
6+ years of hands-on experience in designing, implementing, and maintaining multi-tenant AIML systems and platforms in production environments
6+ years of experience working with cloud platforms such as Azure and AWS
Extensive expertise in designing and deploying large-scale data pipelines and real-time inference systems and managing the end-to-end AI Agent and/or AIML system development lifecycles, including configuration, evaluation, monitoring, observability and AuthN/AuthR considerations
6+ years of experience working with common backend systems & tools (e.g, Kubernetes, Temporal, OpenSearch, PostgreSQL, Redis, Neo4J, etc.)
Deep understanding of Docker, container optimization, and multi-stage builds
Experience with Prometheus, Grafana, Open Telemetry and distributed tracing
3+ years of experience building front-end web applications using frameworks such as React and/or Next.JS
Deep proficiency in programming languages such as Python, Java, Go, etc., with a strong emphasis on coding excellence

Job Responsibility

Architect and implement scalable multi-tenant backend systems for building AI agent workflows, including agent configuration, offline evaluation, synthetic data generation, workflow simulation, agent marketplace, etc. using Azure Kubernetes Service (AKS), FastAPI, etc., ensuring economy of scale and control cost of maintenance
Collaborate with Design team to architect and implement frontend experiences and workflows for onboarding both technical and non-technical stakeholders, maximizing user adoption and successful AI agent development
Develop observability frameworks to ensure 99.9%+ uptime for AI agent platforms through robust monitoring, alerting, and incident response procedures
Evaluate and (if desirable) integrate cutting-edge GenAI frameworks, libraries and vendors to maintain a state-of-the-art technology stack, including hybrid cloud solutions with AWS/GCP as backup or specialized use cases
Architect and implement scalable, high-performance machine learning platforms and systems capable of processing large data volumes and supporting real-time decision making and workflows
Oversee the end-to-end lifecycle of AI agent applications, ensuring robust testing, deployment, and ongoing monitoring
Ensure adherence to company production readiness standards, security protocols, and regulatory compliance throughout the development lifecycle
Continuously optimize platform performance, reducing latency and improving throughput for AI agent workloads
Design and implement backup, recovery, and business continuity plans for hosted platform applications & services
Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Software Engineer, Forward Deployed

We are seeking a highly skilled and driven Software Engineer with a strong backg...

Location

United States , New York

Salary:

147000.00 - 216000.00 USD / Year

Invisible Technologies

Expiration Date

Until further notice

Requirements

2+ years of software engineering experience, with a strong focus on ML engineering and deploying machine learning models in production
Extensive experience in full-stack development, particularly in backend environments that support AI/ML workloads
Prior experience working directly with clients in use case discovery, product development, and leading client engagements
Strong proficiency in Python, with deep expertise in LLMs, AI Agents, and ML model development
Experience designing and deploying scalable ML systems, such as retrieval-augmented generation (RAG) pipelines and production-grade AI applications
Extensive experience with cloud platforms (AWS, GCP, Azure) and operational best practices for ML workloads
Familiarity with Kubernetes and other container management tools
Ability to write well-structured, organized code and automated unit/E2E tests
Comfortable with polyglot persistence models (SQL vs. NoSQL)
Experience with MLOps frameworks and best practices

Job Responsibility

Develop and Maintain AI/ML Systems: Build robust, scalable backend systems that support machine learning operations and data processing pipelines
Cloud Operations and Management: Oversee and optimize cloud infrastructure to ensure efficient deployment and operation of ML models
Problem Solving: Independently explore and address complex problem spaces to improve system capabilities and performance without extensive guidance
Cross-Functional Collaboration: Work closely with ML engineers and data scientists to integrate advanced ML technologies, ensuring seamless operations across various platforms
Client Engagement: Collaborate directly with Invisible’s clients, working embedded with client teams to support use case discovery, product development, and AI deployment
Innovation and R&D: Actively participate in research and development of new tools that can enhance our AI capabilities and workflows

What we offer

Bonuses and equity are included in offers above entry level

Fulltime

Select Country

Lead Software Engineer, DevOps / MLOps

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?