CrawlJobs Logo

Staff Software Engineer - AI/ML Infra

geico.com Logo

Geico

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

90000.00 - 300000.00 USD / Year

Job Description:

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications. This role combines deep technical expertise in cloud platforms, container orchestration, and ML operations with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable systems that enable our data science and engineering teams to deploy and operate LLMs efficiently at scale. The candidate must have excellent verbal and written communication skills with a proven ability to work independently and in a team environment.

Job Responsibility:

  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
  • Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
  • Implement automated model training, validation, deployment, and monitoring workflows
  • Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards
  • Continuously optimize platform performance, reducing latency and improving throughput for ML workloads
  • Design and implement backup, recovery, and business continuity plans for ML platforms
  • Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
  • Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
  • Design and deliver technical onboarding programs for new team members joining the ML platform team
  • Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
  • Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities
  • Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
  • Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
  • Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures
  • Present technical solutions and platform roadmaps to leadership and cross-functional stakeholders

Requirements:

  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
  • Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar
  • Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms
  • Proficiency with Terraform, ARM templates, Pulumi, or CloudFormation
  • Deep understanding of Docker, container optimization, and multi-stage builds
  • Experience with Prometheus, Grafana, ELK stack, Azure Monitor, and distributed tracing
  • Knowledge of both SQL and NoSQL databases, data warehousing, and vector databases
  • Demonstrated track record of mentoring engineers and leading technical initiatives
  • Experience leading design reviews with focus on compliance, performance, and reliability
  • Excellent ability to explain complex technical concepts to diverse audiences
  • Strong analytical and troubleshooting skills for complex distributed systems
  • Experience managing cross-functional technical projects and coordinating with multiple stakeholders

Nice to have:

  • Master’s degree in computer science, Machine Learning, or related field
  • 8+ years of platform engineering or infrastructure experience
  • Experience with Staff Engineer or Tech Lead roles in ML/AI organizations
  • Background in distributed systems and high-performance computing
  • Open-source contributions to ML infrastructure projects or LLM frameworks
  • Multi-Cloud Experience: Hands-on experience with Azure, AWS (SageMaker, EKS) and/or GCP (Vertex AI, GKE)
  • Experience with specialized hardware (A100s, H100s, TPUs, TEEs) and optimization
  • RLHF & Fine-tuning: Experience with Reinforcement Learning from Human Feedback and LLM fine-tuning workflows
  • Experience with Milvus, Pinecone, Weaviate, Qdrant, or similar vector storage solutions
  • Deep experience with MLflow, Kubeflow, DataRobot, or similar platforms
  • Understanding of AI safety principles, model governance, and regulatory compliance
  • Background in regulated industries with understanding of data privacy requirements
  • Experience supporting ML research teams and academic partnerships
  • Deep understanding of GPU optimization, memory management, and high-throughput systems
What we offer:
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Evening Waiter

The team at 25 hours Hotel Paper Island is responsible for creating gastronomic ...
Location
Location
Denmark , Copenhagen
Salary
Salary:
Not provided
granddelmar.com Logo
Fairmont Grand Del Mar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • First experience in the upscale catering business
  • Fun working with people and enjoy making others feel good
  • Passionate about good food and drinks and enjoy discovering new concepts
  • Not just bringing your abilities, but also your character
Job Responsibility
Job Responsibility
  • Make sure that processes are flexible and that tables and work spaces are clean
  • Be committed to the high standards of our food
  • Make sure your colleagues are committed to an equally special service experience
  • Conjure up that 'Wow!' expression on the guest's face
What we offer
What we offer
  • Benefit from great offers from our numerous cooperation partners
  • Work in an international team and atmosphere
  • Be part of our hilarious staff parties and much more
  • Parttime
Read More
Arrow Right

English instructor

English teaching job in Paris (75006), France. Join our unique team of language ...
Location
Location
France , Paris
Salary
Salary:
13.39 EUR / Hour
job-in-france.babylangues.com Logo
Babylangues
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Energetic and creative personality
  • Some childcare experience (babysitting, tutoring, etc.)
  • Native or strong proficiency in the language
What we offer
What we offer
  • Paid leave: 1.34€ /h
  • Parttime
Read More
Arrow Right

Dentist

Join Maven Dental South Coast, a well-established, thriving 2-chair practice wit...
Location
Location
Australia , Moruya
Salary
Salary:
Not provided
mavendental.com.au Logo
Maven Dental
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Excellent communication skills to build lasting patient relationships
  • AHPRA registration (or eligibility) with professional indemnity insurance
  • The right to work in Australia
What we offer
What we offer
  • $20k sign on bonus
  • $10k CPD allowance
  • $10k (plus) Relocation allowance
  • Busy, consistent schedule
  • Clinical freedom
  • Supportive environment
  • Internal referrals
  • Growth & development
  • Modern tech & systems
  • Marketing power
  • Fulltime
Read More
Arrow Right

Security Specialist Engineer

At Mobile Financial Services we enable financial inclusion - truly using Technol...
Location
Location
India , Noida
Salary
Salary:
Not provided
ericsson.com Logo
Ericsson
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Software Engineering or similar
  • Solid Linux handling knowledge in Kubernetes environment
  • Experience in Cloud/SAAS Operational Security Management, AWS will be preferred
  • Experience in container technologies e.g. Docker/Kubernetes
  • Minimum of 5 years of experience in the IT security industry, preferably working in a SOC/NOC environment
  • Better understanding of SSL/TLS
  • Better understanding of Vulnerability assessment and Management
  • Better understanding of Public Key Infrastructure
  • Working knowledge of TCP/IP and networking concepts
  • Better understanding of Certificate Management
Job Responsibility
Job Responsibility
  • Perform advanced triage and investigation of escalated security incidents from L2 and L3 analysts
  • Conduct threat hunting activities using SIEM, EDR, and network security tools
  • Correlate logs from multiple sources (firewalls, IDS/IPS, endpoints, cloud, email gateways)
  • Develop fine-tune detection rules and use cases in SIEM platforms
  • Analyze malware behavior and suspicious artifacts
  • Coordinate containment, eradication, and recovery actions during incidents
  • Review risky user behavior (impossible travel, abnormal downloads, privilege escalation)
  • Manage access controls, conditional access policies, and MFA enforcement
  • Security configuration reviews and hardening of cloud resources (VMs, storage, databases, Kubernetes, serverless)
  • Support Cloud Security Posture Management (CSPM) tools by investigating misconfigurations and policy violations
  • Fulltime
Read More
Arrow Right

Locum Dentist

As a Locum Dentist, you will provide dental care to patients across general and ...
Location
Location
Australia , Bundaberg
Salary
Salary:
Not provided
mavendental.com.au Logo
Maven Dental
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Current AHPRA registration
  • Minimum 2 years’ experience in general dentistry preferred
  • Professional indemnity insurance
  • Strong communication and interpersonal skills
  • Commitment to delivering high-quality patient care
Job Responsibility
Job Responsibility
  • Provide dental care to patients across general and emergency dentistry treatments within the practice
What we offer
What we offer
  • Return flights
  • private accommodation
  • car hire can be offered
  • Competitive base retainer or commission arrangement
  • Supportive and experienced clinical and administrative team
  • Fulltime
Read More
Arrow Right

Software Developer - Cloud

We are looking for a Software Developer to join our team in Indaiatuba. You will...
Location
Location
Brazil , Indaiatuba
Salary
Salary:
Not provided
ericsson.com Logo
Ericsson
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Software / Platform / DevOps Engineer on cloud‑native systems
  • Good coding skills in Python and at least one of Go or Java
  • Hands‑on experience with Kubernetes (architecture, operators/CRDs) and Helm/Helmfile
  • Practical experience with AWS (and/or Azure) and core cloud services
  • Background in CI/CD (e.g. Jenkins, GitLab CI, Spinnaker) and GitOps (e.g. ArgoCD)
  • Good understanding of cloud security: IAM, TLS certificate handling, secrets management (Vault, KMS) and service meshes (Istio/Linkerd)
  • Experience working in Agile teams, strong collaboration skills and the ability to lead technical initiatives and mentor others
Job Responsibility
Job Responsibility
  • Design, develop and maintain cloud‑native / AI‑native platforms for BOS products, from requirements and architecture to deployment and operations
  • Build and evolve Kubernetes‑based platforms (incl. Helm/Helmfile, containerd) on AWS and Azure, ensuring scalability, reliability, security and cost efficiency
  • Lead and improve CI/CD and GitOps practices using tools such as Jenkins, GitLab CI, Spinnaker and ArgoCD to increase deployment frequency and reduce lead time
  • Implement and improve observability with Prometheus, Grafana, OpenTelemetry, Jaeger/Zipkin and ELK/EFK, defining and monitoring SLOs and reducing MTTR
  • Strengthen platform security (TLS certificates, IAM, secrets management such as Vault/KMS, and service meshes like Istio/Linkerd)
  • Contribute to technical roadmaps and standards (Helm/GitOps patterns, IaC modules), improving developer experience and onboarding for product teams
  • Collaborate closely with developers, architects and SREs to onboard services to the platform and share best practices within a global engineering community
Read More
Arrow Right

Instrumentation and Controls Engineering - Lead

The Instrumentation & Controls Engineer serves as the discipline lead on multi‑d...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
afry.com Logo
AFRY
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You must have a legal right to live and work in the United States
  • Bachelor’s degree in engineering or related field is required
  • Preferred Professional Engineering registration or ability to obtain
  • Minimum of 10 years of experience with process instrumentation in the Pulp and Paper industry is preferred. Experience in Oil and Gas or Chemical Industry is also acceptable
  • Proficiency using Microsoft Office suite of applications
  • Effective verbal and written communication skills
  • Self-motivated and comfortable working in a team environment
  • Ability to travel to customer sites. Travel varies but may range from 10-15%
Job Responsibility
Job Responsibility
  • Discipline Lead on multi-discipline projects responsible for I&C deliverables, estimating engineering effort, meeting project schedule, and communication with clients and project team
  • Participate in studies preparing cost estimates for instrumentation and control systems
  • Provide input into Process and Instrumentation Diagrams (P&ID’s)
  • Specify control system components for DCS and PLC systems
  • Specify various types of instrumentation for flow, pressure, level, temperature, and analytical measurement
  • Collaborate with process engineers and vendors to size and specify control valves
  • Work with design staff to generate Loop Diagrams and Junction Box Wiring Diagrams
  • Develop project lists including Instrument Index, I/O List, Panel List, and Cable List
  • Prepare Instrument Installation Details
  • Work with design staff to develop Plan Drawings to identify locations of control valves, instrumentation, control panels, junction boxes and cable trays
What we offer
What we offer
  • Competitive performance-oriented compensation
  • Competitive benefits - Medical, dental, vision, life, short-term and long-term disability, accident, critical illness, identity theft protection, 401(k) with company match, paid vacation, and holidays
  • Opportunity to work with recognized global industry leaders within AFRY
  • Ability to work with employees from many different cultures and backgrounds
  • A firm that believes by working together we can create a new energy era in which the world can become more sustainable
  • Various forms of flexibility to help you integrate your life with your professional commitments
Read More
Arrow Right