CrawlJobs Logo

Staff Software Engineer - AI/ML Platform

Geico

Location Icon

Location:
United States , Chevy Chase

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

115000.00 - 300000.00 USD / Year

Job Description:

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications. This role combines deep technical expertise in cloud platforms, container orchestration, and ML operations with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable systems that enable our data science and engineering teams to deploy and operate LLMs efficiently at scale. The candidate must have excellent verbal and written communication skills with a proven ability to work independently and in a team environment.

Job Responsibility:

  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
  • Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
  • Implement automated model training, validation, deployment, and monitoring workflows
  • Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards
  • Continuously optimize platform performance, reducing latency and improving throughput for ML workloads
  • Design and implement backup, recovery, and business continuity plans for ML platforms
  • Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
  • Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
  • Design and deliver technical onboarding programs for new team members joining the ML platform team
  • Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
  • Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities
  • Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
  • Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
  • Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures
  • Present technical solutions and platform roadmaps to leadership and cross-functional stakeholders

Requirements:

  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
  • Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar
  • Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms
  • Proficiency with Terraform, ARM templates, Pulumi, or CloudFormation
  • Deep understanding of Docker, container optimization, and multi-stage builds
  • Experience with Prometheus, Grafana, ELK stack, Azure Monitor, and distributed tracing
  • Knowledge of both SQL and NoSQL databases, data warehousing, and vector databases
  • Demonstrated track record of mentoring engineers and leading technical initiatives
  • Experience leading design reviews with focus on compliance, performance, and reliability
  • Excellent ability to explain complex technical concepts to diverse audiences
  • Strong analytical and troubleshooting skills for complex distributed systems
  • Experience managing cross-functional technical projects and coordinating with multiple stakeholders

Nice to have:

  • Master’s degree in computer science, Machine Learning, or related field
  • 8+ years of platform engineering or infrastructure experience
  • Experience with Staff Engineer or Tech Lead roles in ML/AI organizations
  • Background in distributed systems and high-performance computing
  • Open-source contributions to ML infrastructure projects or LLM frameworks
  • Multi-Cloud Experience: Hands-on experience with Azure, AWS (SageMaker, EKS) and/or GCP (Vertex AI, GKE)
  • Experience with specialized hardware (A100s, H100s, TPUs, TEEs) and optimization
  • RLHF & Fine-tuning: Experience with Reinforcement Learning from Human Feedback and LLM fine-tuning workflows
  • Experience with Milvus, Pinecone, Weaviate, Qdrant, or similar vector storage solutions
  • Deep experience with MLflow, Kubeflow, DataRobot, or similar platforms
  • Understanding of AI safety principles, model governance, and regulatory compliance
  • Background in regulated industries with understanding of data privacy requirements
  • Experience supporting ML research teams and academic partnerships
  • Deep understanding of GPU optimization, memory management, and high-throughput systems
What we offer:
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer - AI/ML Platform

Staff Software Engineer, Backend

The Staff Engineer will work closely with AI/ML engineers, product managers, app...
Location
Location
United States , NYC
Salary
Salary:
160000.00 - 190000.00 USD / Year
conductor.com Logo
Conductor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Completed studies in Computer Science, Mathematics, engineering or a related field or equivalent professional experience
  • 8+ years of experience in software development, with experience in product-driven companies
  • Strong expertise in system design, distributed computing, and scalable architecture patterns for handling large datasets and high-throughput applications
  • Proficiency in multiple programming languages with strong Python coding skills. Experience with Java is highly valued
  • Strong database experience including both SQL and NoSQL systems, with knowledge of data modeling and optimization techniques
  • Experience with AI/ML technologies including LLMs, vector databases (e.g., Milvus), embeddings, and ML frameworks
  • Knowledge of MLOps practices, model deployment, and AI system integration in production environments
  • Experience working across the full software development lifecycle including CI/CD, monitoring, testing, and production deployment
  • Proven track record of technical leadership, mentoring engineers, and driving engineering excellence within teams
  • Up-to-date with rapidly-evolving technologies and demonstrated ability to evaluate and adopt new tools and frameworks
Job Responsibility
Job Responsibility
  • Lead the technical architecture, design, and implementation of large-scale distributed systems and data platforms to support customer needs and business growth
  • Oversee the planning, execution, and successful delivery of complex engineering projects, ensuring adherence to engineering best practices and quality standards
  • Design and build scalable, high-performance backend systems and APIs that handle millions of requests and large datasets efficiently
  • Architect robust data processing pipelines and ETL workflows using modern cloud technologies and distributed computing frameworks
  • Drive technical decision-making across the engineering organization, evaluating trade-offs and establishing engineering standards and practices
  • Lead cross-functional collaboration with product, AI/ML engineering, data engineering, and infrastructure teams to deliver comprehensive solutions
  • Build and maintain CI/CD pipelines, monitoring systems, and deployment automation to ensure reliable software delivery
  • Implement AI/ML capabilities including LLM integration, vector databases, and intelligent content processing workflows
  • Mentor senior and junior engineers, fostering technical excellence and knowledge sharing within the engineering organization
What we offer
What we offer
  • 100% covered employee medical plan
  • a dental & vision plans
  • 401(k) with employer contribution
  • an unlimited vacation policy
  • 10 sick days
  • short-term disability
  • long-term disability
  • generous paid parental leave
  • employee assistance program
  • flexible savings accounts
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...
Location
Location
United States , San Francisco
Salary
Salary:
216500.00 - 324500.00 USD / Year
gofundme.com Logo
GoFundMe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
  • Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
  • Extensive experience designing, developing, and operating scalable backend systems
  • Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
  • Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
  • Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
  • Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
  • Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
  • Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)
Job Responsibility
Job Responsibility
  • Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
  • Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
  • Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
  • Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
  • Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
  • Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
  • Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
  • Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
  • Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
  • Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure
What we offer
What we offer
  • Competitive pay
  • Comprehensive healthcare benefits
  • Financial assistance for things like hybrid work, family planning
  • Generous parental leave
  • Flexible time-off policies
  • Mental health and wellness resources
  • Learning, development, and recognition programs
  • Fulltime
Read More
Arrow Right

Staff Data Engineer

A VC-backed retail AI scale-up is expanding its engineering team and is looking ...
Location
Location
United States
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in software development and data engineering with ownership of production-grade systems
  • Proven expertise in Spark (Databricks, EMR, or similar) and scaling it in production
  • Strong knowledge of distributed computing and modern data modeling approaches
  • Solid programming skills in Python, with an emphasis on clean, maintainable code
  • Hands-on experience with SQL and NoSQL databases (e.g., PostgreSQL, DynamoDB, Cassandra)
  • Excellent communicator who can influence and partner across teams
Job Responsibility
Job Responsibility
  • Design and evolve distributed, cloud-based data infrastructure that supports both real-time and batch processing at scale
  • Build high-performance data pipelines that power analytics, AI/ML workloads, and integrations with third-party platforms
  • Champion data reliability, quality, and observability, introducing automation and monitoring across pipelines
  • Collaborate closely with engineering, product, and AI teams to deliver data solutions for business-critical initiatives
What we offer
What we offer
  • Fully remote
  • great equity
Read More
Arrow Right

Staff Data Engineer

A VC-backed retail AI scale-up is expanding its engineering team and is looking ...
Location
Location
United States
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in software development and data engineering with ownership of production-grade systems
  • Proven expertise in Spark (Databricks, EMR, or similar) and scaling it in production
  • Strong knowledge of distributed computing and modern data modeling approaches
  • Solid programming skills in Python, with an emphasis on clean, maintainable code
  • Hands-on experience with SQL and NoSQL databases (e.g., PostgreSQL, DynamoDB, Cassandra)
  • Excellent communicator who can influence and partner across teams
Job Responsibility
Job Responsibility
  • Design and evolve distributed, cloud-based data infrastructure that supports both real-time and batch processing at scale
  • Build high-performance data pipelines that power analytics, AI/ML workloads, and integrations with third-party platforms
  • Champion data reliability, quality, and observability, introducing automation and monitoring across pipelines
  • Collaborate closely with engineering, product, and AI teams to deliver data solutions for business-critical initiatives
What we offer
What we offer
  • Fully remote
  • great equity
Read More
Arrow Right

System Software Expert

Designs, develops, troubleshoots and debugs software programs for software enhan...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 6-10 years experience
  • Extensive experience with multiple software systems design tools and languages
  • Excellent analytical and problem solving skills
  • Experience in overall architecture of software systems for products and solutions
  • Designing and integrating software systems running on multiple platform types into overall architecture
  • Evaluating forms and processes for software systems testing and methodology
  • Excellent written and verbal communication skills
  • mastery in English and local language
  • Strong in programming (python preferred)
Job Responsibility
Job Responsibility
  • Leads multiple project teams of other software systems engineers and internal and outsourced development partners responsible for all stages of design and development for complex products and platforms
  • Manages and expands relationships with internal and outsourced development partners on software systems design and development
  • Reviews and evaluates designs and project activities for compliance with systems design and development guidelines and standards
  • Provides domain-specific expertise and overall software systems leadership and perspective to cross-organization projects, programs, and activities
  • Drives innovation and integration of new technologies into projects and activities in the software systems design organization
  • Provides guidance and mentoring to less- experienced staff members
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Principal Software Engineer, AI Developer Tools

At Docker, we make app development easier so developers can focus on what matter...
Location
Location
United States , Seattle
Salary
Salary:
232000.00 - 319000.00 USD / Year
docker.com Logo
Docker
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years software engineering experience with 3+ years in Staff or Principal Engineer roles
  • Deep expertise in AI/ML technologies with hands-on production experience building LLM-powered applications, AI agents, or AI-assisted developer tools
  • Strong understanding of LLM APIs (OpenAI, Anthropic, etc.), prompt engineering, agent orchestration frameworks, and practical applications of AI in software development workflows
  • Proven track record of architecting and building highly scalable distributed systems and developer-facing platforms
  • Production experience with modern cloud-native infrastructure including Kubernetes, GitOps deployment patterns, observability systems, and CI/CD pipelines
  • Proficiency in Go (preferred), Rust, Java, or Python with strong software engineering fundamentals
  • Experience designing developer tools, platform engineering systems, or internal tools that enable other teams
  • Exceptional product and platform mindset considering business outcomes, developer experience, and technical trade-offs
  • Strong communication skills with ability to influence technical and non-technical stakeholders across the organization
  • Track record of technical mentorship and elevating engineering teams' capabilities
Job Responsibility
Job Responsibility
  • Define the long-term technical vision and architecture for AI-powered developer tools and the self-service platform that enables teams to build their own AI agents
  • Establish architectural patterns, technical standards, and best practices for LLM integration, AI agent development, and production AI systems serving developers
  • Lead technical strategy for platform capabilities including deployment frameworks (ArgoCD/GitOps), observability integration (Grafana), security controls, and operational tooling for AI developer tools
  • Design highly available, scalable infrastructure for hosting AI agents and developer tools with predictable performance and intelligent resource management
  • Drive technical decisions on AI technology choices, LLM provider strategies, prompt engineering approaches, and agent orchestration frameworks
  • Partner with Senior Manager and product leadership to align technical architecture with business objectives and productization opportunities
  • Architect and build production-ready AI agents for developer productivity including code review assistants, test generators, deployment diagnostics, and incident response automation
  • Design and implement the self-service platform infrastructure that reduces time-to-production for new AI tools from weeks to days
  • Build systems that accelerate adoption of AI-native development tools (Claude Code, Cursor, Warp) across Docker's engineering organization
  • Establish reliability, security, and performance standards for AI systems including SLOs, monitoring, incident response, and cost management
What we offer
What we offer
  • Freedom & flexibility
  • fit your work around your life
  • Designated quarterly Whaleness Days plus end of year Whaleness break
  • Home office setup
  • we want you comfortable while you work
  • 16 weeks of paid Parental leave
  • Technology stipend equivalent to $100 net/month
  • PTO plan that encourages you to take time to do the things you enjoy
  • Training stipend for conferences, courses and classes
  • Equity
  • Fulltime
Read More
Arrow Right

System Software Expert

Designs, develops, troubleshoots and debugs software programs for software enhan...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 6-10 years experience
  • Extensive experience with multiple software systems design tools and languages
  • Excellent analytical and problem solving skills
  • Experience in overall architecture of software systems for products and solutions
  • Designing and integrating software systems running on multiple platform types into overall architecture
  • Evaluating forms and processes for software systems testing and methodology, including writing and execution of test plans, debugging, and testing scripts and tools
  • Excellent written and verbal communication skills
  • mastery in English and local language
  • Strong in programming (python preferred)
Job Responsibility
Job Responsibility
  • Leads multiple project teams of other software systems engineers and internal and outsourced development partners responsible for all stages of design and development for complex products and platforms, including solution design, analysis, coding, testing, and integration
  • Manages and expands relationships with internal and outsourced development partners on software systems design and development
  • Reviews and evaluates designs and project activities for compliance with systems design and development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk
  • Provides domain-specific expertise and overall software systems leadership and perspective to cross-organization projects, programs, and activities
  • Drives innovation and integration of new technologies into projects and activities in the software systems design organization
  • Provides guidance and mentoring to less-experienced staff members
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Career development programs
  • Inclusive workplace culture
  • Fulltime
Read More
Arrow Right

System software developer

Designs, develops, troubleshoots and debugs software programs for software enhan...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 4-6 years experience
  • Good in programming (python preferred)
  • Designing and leading a module/component of a larger solution
  • Good troubleshooting skills
  • Exposure on Networking basics
  • Exposure on AI/ML basics
  • Good exposure on Linux platform internals
  • Good in system software development
Job Responsibility
Job Responsibility
  • Designs enhancements, updates, and programming changes for portions and subsystems of systems software
  • Analyzes design and determines coding, programming, and integration activities required
  • Writes and executes complete testing plans, protocols, and documentation
  • Leads a project team of other software systems engineers
  • Collaborates and communicates with management and development partners
  • Represents the software systems engineering team for all phases of larger projects
  • Provides guidance and mentoring to less-experienced staff members
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Fulltime
Read More
Arrow Right