This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Geico AI Agent Platform team is seeking an exceptional Staff Software Engineer to build the next generation enterprise Agent OS and SDKs. This role combines deep technical expertise in platform engineering, application design and agentic workflows with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable frontend and backend systems that enable our business, product and engineering teams to build, test and deploy their own AI agents & workflows. The candidate must have excellent communication skills and a proven track record of delivering business value via technical excellence.
Job Responsibility:
Architect and implement scalable multi-tenant backend systems for building AI agent workflows, including agent configuration, offline evaluation, synthetic data generation, workflow simulation, agent marketplace, etc. using Azure Kubernetes Service (AKS), FastAPI, etc., ensuring economy of scale and control cost of maintenance
Collaborate with Design team to architect and implement frontend experiences and workflows for onboarding both technical and non-technical stakeholders, maximizing user adoption and successful AI agent development
Develop observability frameworks to ensure 99.9%+ uptime for AI agent platforms through robust monitoring, alerting, and incident response procedures
Evaluate and (if desirable) integrate cutting-edge GenAI frameworks, libraries and vendors to maintain a state-of-the-art technology stack, including hybrid cloud solutions with AWS/GCP as backup or specialized use cases
Architect and implement scalable, high-performance machine learning platforms and systems capable of processing large data volumes and supporting real-time decision making and workflows
Oversee the end-to-end lifecycle of AI agent applications, ensuring robust testing, deployment, and ongoing monitoring
Ensure adherence to company production readiness standards, security protocols, and regulatory compliance throughout the development lifecycle
Continuously optimize platform performance, reducing latency and improving throughput for AI agent workloads
Design and implement backup, recovery, and business continuity plans for hosted platform applications & services
Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
Act as the tech lead for a sub-team, setting technical direction and ensuring consistency in design principles and best practices
Provide hands-on mentorship and guidance during design reviews, code assessments, and performance tuning
Lead by example in tackling complex technical challenges and driving system-wide architectural improvements
Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities
Work closely with data scientists, software engineers, and product teams to seamlessly deploy ML systems into production environments
Translate complex technical concepts into actionable insights for both technical and non-technical stakeholders
Foster a collaborative environment that encourages innovation and the sharing of best practices across teams
Present technical solutions and platform roadmaps to leadership and cross-functional stakeholders
Requirements:
Bachelor’s degree in computer science, Engineering, Mathematics, or a related field
an advanced degree (master’s or Ph.D.) is highly desirable
6+ years of hands-on experience in designing, implementing, and maintaining multi-tenant AIML systems and platforms in production environments
6+ years of experience working with cloud platforms such as Azure and AWS
Extensive expertise in designing and deploying large-scale data pipelines and real-time inference systems and managing the end-to-end AI Agent and/or AIML system development lifecycles, including configuration, evaluation, monitoring, observability and AuthN/AuthR considerations
6+ years of experience working with common backend systems & tools (e.g, Kubernetes, Temporal, OpenSearch, PostgreSQL, Redis, Neo4J, etc.)
Deep understanding of Docker, container optimization, and multi-stage builds
Experience with Prometheus, Grafana, Open Telemetry and distributed tracing
3+ years of experience building front-end web applications using frameworks such as React and/or Next.JS
Deep proficiency in programming languages such as Python, Java, Go, etc., with a strong emphasis on coding excellence
Proficiency in AIML frameworks such as TensorFlow, PyTorch, Langraph, etc.
Demonstrated track record of mentoring engineers and leading technical initiatives
Proven ability to tackle complex technical challenges, innovate through hands-on experimentation, and set technical standards
Excellent verbal and written communication against audience of diverse seniority levels and professional backgrounds
Nice to have:
Deep expertise operating and/or building AI agent platforms & capabilities like Langraph platform, Autogen, N8N, Crew.ai, etc.
Experience with LLM observability systems such as Langsmith, Langfuse, Arize Phoenix, etc.
Experience building LLM-based AI agent workflows via both no code/low code and traditional high-code development environments
Experience utilizing both open source (e.g. llama, Qwen, Mistral) and proprietary (e.g. GPT, Claude) LLMs for appropriate tasks
Understanding of AI safety principles, model governance, and regulatory compliance
Background in regulated industries with understanding of data privacy requirements and cybersecurity review processes
What we offer:
Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year