This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms, and applications. You will help drive our insurance business transformation as we transition from a traditional IT model to a tech organization with engineering excellence as its mission, while co-creating the culture of psychological safety and continuous improvement. The Senior Staff Engineer in Availability and Incident Management will design and deploy machine learning systems that enable intelligent incident detection, automated root cause analysis, and predictive reliability improvements across the platform. This role focuses on building a multi-agent AI platform where specialized agents autonomously detect anomalies, diagnose failures, recommend remediation actions, and learn from historical patterns to prevent recurring incidents. You will lead the technical strategy for an AI-powered incident response system that reduces mean time to resolution, minimizes operational toil, and enables proactive reliability improvements through predictive analytics and autonomous workflows.
Job Responsibility:
Design and build a multi-agent AI platform where specialized agents autonomously detect, diagnose, and resolve issues through agent-to-agent (A2A) collaboration
Develop intelligent agents using LLMs and agentic frameworks that coordinate detection, diagnostic, remediation, and knowledge tasks with minimal human intervention
Define agent interaction protocols, A2A communication standards, and evaluation frameworks for agent decision quality and autonomous action safety
Architect vector database solutions (Milvus, pgvector, Qdrant) for semantic search and RAG to enable context-aware agent decision-making
Build end-to-end ML pipelines for severity classification, anomaly detection, failure pattern recognition, and impact forecasting using observability data
Establish scalable orchestration infrastructure for multi-agent workflows with CI/CD, automated evaluation, canary releases, and rollback strategies
Implement monitoring for agent interactions, A2A communication patterns, decision quality, data drift, and system reliability
Lead technical architecture ensuring scalability, observability, and integration with existing alerting, logging, and monitoring systems
Define standards for agent safety, explainability, governance, and human-in-the-loop controls for high-impact automated actions
Partner with SRE, Product, and Engineering teams to translate reliability goals into measurable ML objectives and maintain pragmatic technical roadmaps
Mentor engineers through complex AI platform implementations and establish best practices, coding standards, and technical documentation
Stay current with AI/ML and multi-agent systems
educate engineering leadership on emerging technologies
Requirements:
Experience building and deploying ML systems in production with cross-functional engineering teams
Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
Experience architecting multi-component ML platforms using open-source/cloud-agnostic components: Datastores: PostgreSQL, NoSQL (MongoDB, Cassandra, CosmosDB) Streaming: Kafka, Flink, or Spark Streaming
Experience with end-to-end ML lifecycle: version control, CI/CD, Kubernetes, testing, monitoring, and production support
Experience with cloud providers (Azure, AWS or GCP) in production ML environments
Experience with observability tools and distributed systems monitoring, logging, tracing, and root cause analysis
Experience building multi-agent systems using LLMs and agentic frameworks (e.g., LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI)
Hands-on experience with RAG, semantic search, and vector databases (e.g., Milvus, pgvector, Qdrant, ElasticSearch)
Experience designing human-in-the-loop workflows and safety controls for autonomous systems
Strong architecture and design skills with ability to influence technical direction and roadmap
Proven ability to solve complex problems with data-driven approaches
10+ years of professional platform development or general development experience
8+ years of experience with architecture and design
6+ years of experience building and deploying machine learning systems in production
6+ years of experience in open-source frameworks
4+ years of experience with AWS, GCP, Azure, or another cloud service
2+ years of experience with LLMs, agentic AI frameworks, or multi-agent systems
Bachelor’s degree in Computer Science, Information Systems, or equivalent education or work experience
Nice to have:
Experience fine-tuning or deploying open-source LLMs (Llama, Mistral, Phi) is a plus
Experience with data warehouse/lakehouse platforms (e.g., Snowflake, Databricks, Parquet, Delta, Iceberg)
What we offer:
Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year