CrawlJobs Logo

AI Engineer - Instrumentation

United States 125000.00 - 225000.00 USD / Year · Job Posted December 06, 2025
Apply Position
Job Link Share

Job Description

Join Arize AI's Engineering team working on OpenInference – the industry-leading open-source standard for AI observability, LLM and agent instrumentation. You'll be at the forefront of defining how organizations instrument, trace, and observe their AI applications across the entire ecosystem, shaping the future of AI development worldwide.

Job Responsibility

  • Build new LLM and instrumentation libraries for emerging LLM providers and agent frameworks
  • Maintain and enhance existing instrumentation across Python and TypeScript ecosystems, and others (OpenAI, Anthropic, LlamaIndex, CrewAI and many more)
  • Drive improvements to semantic conventions and OpenTelemetry standards that define AI observability
  • Collaborate with the global developer community through GitHub, Slack, and conferences, as well as Arize PMs and solution architects
  • Take complex problems from ideation to completion with full ownership and accountability

Requirements

  • 3-5+ years of software development experience shipping production code
  • Expert-level proficiency in both Python and TypeScript
  • Community-oriented mindset with genuine passion for collaborative open-source development
  • Deep interest in AI/LLM ecosystem with desire to stay current on emerging technologies
  • Strong analytical skills to distill requirements from diverse sources and stakeholders

Nice to have

  • Experience building SDK clients, instrumentation libraries, or platform APIs
  • Hands-on experience with AI/ML observability or evaluation systems

What we offer

  • competitive equity package
  • comprehensive benefits package, including medical, dental, vision
  • a 401(k) plan
  • unlimited paid time off
  • a generous parental leave plan
  • additional support for mental health and wellness
  • WFH monthly stipend to pay for co-working spaces

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI Engineer - Instrumentation

8 matching positions

Applied AI Engineer - MCP

We are hiring a founding group of engineers to kickstart this mission. As an AI ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
appian.com Logo
Appian Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • AI Infrastructure Experience: Professional experience building and deploying production-grade GenAI systems on GCP Vertex AI or AWS Bedrock including LLM APIs, agent frameworks, and RAG pipelines
  • Python Mastery: Deep proficiency in Python and common AI/ML libraries (e.g., LangChain, LlamaIndex, OpenAI SDK, Google Cloud AI SDK)
  • Agentic Systems: Hands-on experience building multi-step, tool-calling AI agents that operate reliably in production including tool schema design, structured outputs, and failure handling
  • MCP or Tool Protocol Experience: Familiarity with Model Context Protocol (MCP) or equivalent patterns for exposing enterprise resources to AI systems as callable, permissioned tools
  • Vector & Retrieval Systems: Experience with vector databases (Pinecone, pgvector, or similar) and embedding-based retrieval at scale
  • Technical Foundation: B.S. in Computer Science, Engineering, or a related technical field
  • The Pioneer Mindset: A self-starter who is excited to be part of a "first-of-its-kind" team and thrives in environments where you are building the playbook
Job Responsibility
Job Responsibility
  • Build the AI Platform: Design, deploy, and scale GenAI infrastructure on GCP Vertex AI (preferred) or AWS Bedrock
  • Build & Operate MCP Servers: Develop and maintain Model Context Protocol (MCP) servers that expose enterprise systems (databases, APIs, internal tools) as structured, AI-consumable capabilities
  • Design the Tool Layer: Build the tool registry and invocation framework that allows agents to interact with internal systems safely and reliably including schema design, access controls, and error handling
  • Engineer Agent-to-Agent Infrastructure: Architect the communication and coordination layer that allows specialized agents to delegate tasks, share context, and compose into larger autonomous workflows
  • Own the Retrieval Layer: Architect and operate RAG systems, including vector stores, embedding pipelines, chunking strategies, and retrieval evaluation frameworks
  • Establish AI Reliability: Instrument AI systems with logging, tracing, latency monitoring, and evaluation hooks so agents can be trusted, debugged, and improved in production
What we offer
What we offer
  • health coverage
  • Employee Assistance Program (EAP) with free mental health support
  • life and disability insurance
  • Employee Stock Purchase Program (ESPP)
  • retirement/pension plan
  • wellness dollars
  • tuition reimbursement
  • family-forming benefits
  • Fulltime
Read More
Arrow Right

AI Engineer - Platform

At hyperexponential, we’re building the AI-powered platform that enables the wor...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
hyperexponential.com Logo
hyperexponential
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Built and deployed production AI infrastructure that scaled to support enterprise-grade reliability and observability
  • Delivered self-service tools or APIs that enabled multiple product teams to accelerate their AI/ML development cycles
  • Implemented evaluation frameworks, A/B testing infrastructure, or monitoring solutions that measured and improved model performance, latency, cost, and quality in production
  • Led initiatives to reduce AI compute costs through optimisation strategies such as intelligent routing or caching
  • Successfully migrated teams from bespoke AI solutions to shared platform services, driving measurable adoption
  • Prioritised and improved developer experience through documentation, support, or workflow enhancements
Job Responsibility
Job Responsibility
  • Designing and operating scalable AI infrastructure for LLM inference, prompt management, and evaluation pipelines, supporting billions in premium flow
  • Building self-service tools, SDKs, and APIs that empower product teams to move from prototype to production 30% faster
  • Instrumenting production AI/ML workloads with standardised logging, tracing, and evaluation metrics, increasing observability coverage to 100% of deployed models
  • Implementing intelligent routing, caching, and provider optimisation via the LLM gateway, reducing AI compute costs by up to 25%
  • Driving adoption of shared platform services (LLM gateway, evaluation frameworks, monitoring) to replace bespoke solutions, increasing platform adoption across new AI features
  • Championing developer experience by delivering comprehensive documentation and responsive support, resulting in higher internal customer satisfaction
What we offer
What we offer
  • Share Options at a highly successful Series B company
  • 25 days of non-working + Polish bank holidays (B2B) / 26 days of holiday + Polish bank holidays (UoP)
  • £5,000 GBP budget for Learning & Development
  • Mental Health Support and Therapy via Spectrum Life
  • Optional for you: access to Private Healthcare via Luxmed + Multisport (fully funded by yourself as B2B Contractor)
  • Top-spec laptop (MacOS or Windows)
  • Company pension (UoP)
  • Company Sick Pay for 10 days at 100% salary (UoP)
  • Monthly Wellbeing allowance via Juno (UoP)
  • Private Healthcare Insurance via Luxmed (UoP)
  • Fulltime
Read More
Arrow Right

AI Engineer - Platform

At hx, AI is central to how we build software and make decisions across the comp...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
hyperexponential.com Logo
hyperexponential
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Built and deployed production AI infrastructure that scaled to support enterprise-grade reliability and observability
  • Delivered self-service tools or APIs that enabled multiple product teams to accelerate their AI/ML development cycles
  • Implemented evaluation frameworks, A/B testing infrastructure, or monitoring solutions that measured and improved model performance, latency, cost, and quality in production
  • Led initiatives to reduce AI compute costs through optimisation strategies such as intelligent routing or caching
  • Successfully migrated teams from bespoke AI solutions to shared platform services, driving measurable adoption
  • Prioritised and improved developer experience through documentation, support, or workflow enhancements
Job Responsibility
Job Responsibility
  • Designing and operating scalable AI infrastructure for LLM inference, prompt management, and evaluation pipelines, supporting billions in premium flow
  • Building self-service tools, SDKs, and APIs that empower product teams to move from prototype to production 30% faster
  • Instrumenting production AI/ML workloads with standardised logging, tracing, and evaluation metrics, increasing observability coverage to 100% of deployed models
  • Implementing intelligent routing, caching, and provider optimisation via the LLM gateway, reducing AI compute costs by up to 25%
  • Driving adoption of shared platform services (LLM gateway, evaluation frameworks, monitoring) to replace bespoke solutions, increasing platform adoption across new AI features
  • Championing developer experience by delivering comprehensive documentation and responsive support, resulting in higher internal customer satisfaction
What we offer
What we offer
  • £5,000 training and conference budget for individual and group development
  • 25 days of holiday plus 8 bank holidays (33 days total)
  • Company pension scheme via Penfold
  • Mental health support and therapy via Spectrum.life
  • Individual wellbeing allowance via Juno
  • Private healthcare insurance through AXA
  • Income protection and Life Insurance
  • Cycle to Work Scheme
  • Top-spec equipment (laptop, screens, adjustable desks, etc.)
  • Regular remote and in-person hackathons, lunch and learns, socials, and game nights
  • Fulltime
Read More
Arrow Right

Software Engineer, AI

Meta is seeking a Software Engineer with deep AI specialization to help build an...
Location
Location
Singapore
Salary
Salary:
Not provided
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of software engineering experience, with a focus on building and deploying machine learning or AI systems in production environments
  • Experience designing and implementing end-to-end machine learning pipelines, including data preprocessing, model training, evaluation, and serving at scale
  • Experience with deep learning frameworks such as PyTorch or TensorFlow, and proficiency in Python for AI and data engineering workflows
  • Experience applying experimentation methodologies — including A/B testing and metric design — to evaluate AI model performance and drive product decisions
  • Experience building maintainable, well-tested codebases for AI systems, including unit testing, integration testing, and monitoring for model quality and reliability
Job Responsibility
Job Responsibility
  • Design and implement scalable AI and machine learning systems, including model training pipelines, inference infrastructure, and feature engineering frameworks, to power Meta's core products
  • Develop and optimize large-scale AI models — including large language models, generative AI systems, and ranking and recommendation models — from prototype through production deployment
  • Leverage AI tools and workflows as a force multiplier to expand technical scope across modeling, data analysis, and operational readiness within a single project lifecycle
  • Establish and maintain robust evaluation frameworks, automated testing, and monitoring pipelines to ensure reliability and quality of AI systems in production
  • Own the technical design of AI components and systems, evaluating architectural trade-offs to meet well-defined product and business requirements
  • Instrument AI systems with telemetry, design experiments to validate model hypotheses, and make data-informed decisions that balance short-term goals with long-term model quality
  • Proactively identify performance bottlenecks in model serving and training infrastructure, using profiling and benchmarking to drive latency and throughput improvements
  • Collaborate with product managers, data scientists, and research scientists to translate AI research advances into production-ready features with measurable user impact
  • Contribute to AI safety, privacy, and integrity practices by incorporating responsible AI principles into system design and partnering with cross-functional teams on safeguards
  • Mentor other engineers on AI engineering best practices, advocate for coding and testing standards, and help drive adoption of AI-augmented development workflows across the team
Read More
Arrow Right

Agentic AI Engineer

Role - Agentic AI Engineer; Location - Boston, MA (ONSITE); Experience – 5+; Typ...
Location
Location
United States , Boston
Salary
Salary:
172500.00 USD / Year
realign-llc.com Logo
Realign
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on with agent frameworks such as Semantic Kernel, LangGraph, LangChain Agents or CrewAI
  • Proficiency in Python and modern AI/ML frameworks, AWS
  • Hands‑on with Bedrock AgentCore Memory, Gateway, and Runtime
  • Strong experience with LLMs (fine-tuning, evaluation, prompt engineering)
  • Proficiency in vector databases and embedding models for retrieval tasks
  • Experience with Data Connectors and API gateways that support seamless communication between systems
  • Experience with generative AI concepts such as Retrieval-Augmented Generation (RAG), agentic workflows, training LLMs with structured and unstructured data sets
  • Strong communicator with the ability to present ideas clearly and influence stakeholders - with a passion for enabling data-driven transformation
  • Deep subject matter expertise in AI technologies, including but not limited to Copilot Studio, OpenAI, Semantic Kernels, Azure AI Foundry, Google Gemini, Microsoft 365, and M365 Copilot or Anthropic or AWS platforms
Job Responsibility
Job Responsibility
  • Focus on designing, building, and optimizing AGENTC capabilities using large language models and advanced reasoning frameworks
  • Strong technical depth, a passion for building, workstreams in a fast-paced, innovation-driven environment
  • Build serverless agentic product architectures and features on AWS using Bedrock AgentCore Memory, Gateway, Identity, Interpreter, Observability and Runtime
  • Implement and operate A2A/MCP servers on AWS
  • integrate with Bedrock Agents/Converse APIs
  • Orchestrate multi‑agent plans with Strands Agents
  • deploy to EKS or Lambda
  • Instrument agents with CloudWatch metrics, spans, and traces
  • enable audits
  • Collaborate with product managers, data engineers, and UX teams to deliver production-ready solutions
  • Fulltime
Read More
Arrow Right

Senior Applied AI Engineer

Build production-grade, multimodal (audio/video/text) systems that convert broad...
Location
Location
United States , New York
Salary
Salary:
180000.00 - 240000.00 USD / Year
geniussports.com Logo
Genius Sports
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–8+ years of professional software engineering experience (backend and/or ML systems)
  • Strong proficiency in one or more of: Python, Java, Rust
  • Hands-on experience building production services involving LLM or multimodal model integration (including Gemini, ChatGPT or Claude)
  • Comfortable with ambiguity, iterative experimentation, and evidence-based decision-making in an Agile environment
  • Experience with streaming data platforms like Kafka, Pulsar, Flink
  • Experience with AWS Bedrock or Google Vertex AI
  • Familiarity with version control systems (e.g., Git)
  • Excellent problem-solving skills and attention to detail
  • Ability to work independently and as part of a team
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Build and maintain multimodal agents: Audio sensor agents (acoustic events, sentiment, alignment), Visual sensor agents (scorebug/overlay reading, basic visual cues when applicable), Specialist and decision logic components (structured event outputs, confidence, traceability)
  • Implement streaming-friendly pipelines: chunking, normalization, time-sync, async execution, and robust retry/backoff for model/tool calls
  • Develop prompt-as-code with strict JSON contracts, schema validation, and deterministic post-processing to reduce brittleness
  • Improve system robustness under noisy inputs by: Designing fallback behaviors (degraded modes), Adding guardrails and confidence thresholds, Instrumenting traces/metrics for latency + cost + accuracy
  • Partner with product, platform, and domain leads to translate sport rules/edge cases into validation logic and to integrate outputs into downstream consumers (tagging, live feeds, analytics)
  • Contribute to the evaluation workflow by adding test cases, failure mode categories, and regression checks for prompts and model routing
  • Stay up-to-date with emerging Gen AI technologies, tools, and best practices
  • Mentor and support other team members in data engineering principles and practices
What we offer
What we offer
  • Eligible to take part in Genius Sports Group's benefits plan
  • Competitive salary and range of benefits
  • Committed to supporting employee wellbeing and helping you grow your skills, experience and career
  • Inclusive working environment
  • Fulltime
Read More
Arrow Right

Applied AI Engineer II

Project Sophia is a new generation business application, built ground up from ma...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 2+ years related experience (e.g., statistics, predictive analytics, research) OR Master's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, OR related field AND 1+ year(s) related experience (e.g., statistics, predictive analytics, research) OR Doctorate in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 2+ years of extensive experience with one or more modern web technologies such as .NET / Node / React / Angular, building RESTful APIs
  • 2+ years of solid experience in an OO Language like C# or Java or Python
  • 1+ years of professional experience working with generative artificial intelligence, large language models, or agent-based systems
  • BS/MS in Computer Science or equivalent or 5+ years of industry experience
  • Excellence in one or more general programming languages including but not limited to: Python, C#
  • JavaScript
  • TypeScript
Job Responsibility
Job Responsibility
  • Design, implement, and ship AI-first product capabilities end-to-end from rapid prototype to production, spanning LLM-powered services, retrieval/grounding pipelines, and intelligent UX experiences that delight users through Sophia’s AI canvas
  • Own implementation across the full stack integrating front-end experiences, back-end services, and AI orchestration layers that connect models, context, and tools to deliver cohesive, extensible, high-performance systems
  • Collaborate with design, research, and platform teams to adapt or fine-tune LLMs/SLMs and multimodal models for real-world customer scenarios, ensuring outcomes are contextual, transparent, and human-centered
  • Build agentic, tool-using, and multimodal workflows that reason across data and services
  • optimize for safety, latency, reliability, and cost efficiency
  • Contribute to engineering excellence secure-by-design, accessibility compliance, automated testing, and code craftsmanship across the product lifecycle
  • Instrument and evaluate AI features with telemetry, experimentation, and continuous feedback loops to refine reasoning quality and user experience
  • Drive live-site reliability and operational excellence, participating in On-Call rotations while maintaining a sustainable, high-ownership engineering culture
  • Fulltime
Read More
Arrow Right

AI Engineer

As an AI Engineer at Aspora, you’ll start by shipping internal automation tools ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
aspora.com Logo
Aspora
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5 years of experience training, deploying, and scaling AI/ML models in production environments
  • You understand the pace and mindset of a startup and are excited to contribute to its growth and culture
  • Hands-on experience integrating third-party LLM APIs in production environments
  • Practical understanding of MLOps/LLMOps concerns: evaluation, monitoring, rollouts, incident response
  • Curiosity and drive to experiment with advanced AI techniques while staying grounded in production impact
  • Strong bias for measurable improvements: latency, cost, reliability
Job Responsibility
Job Responsibility
  • Own the serving and runtime layer for all LLM-powered features, ensuring low latency, high reliability, and cost efficiency in production
  • Integrate LLM and agent capabilities into real product workflows, including tool calling, routing, guardrails, and safe failure modes
  • Build reliable, observable AI workflows across product, ops, and data systems, with strong foundations in retries, idempotency, fallbacks, and human-in-the-loop patterns
  • Design and operate retrieval and knowledge systems end-to-end, from ingestion and indexing to retrieval and response composition
  • Automate and harden operational processes, turning manual workflows into scalable, measurable pipelines
  • Own performance, reliability, and operational excellence across AI systems, measuring end-to-end latency, removing bottlenecks, and implementing pragmatic reliability patterns
  • Instrument and monitor AI systems in production, tracking quality regressions, drift, prompt failures, and retrieval issues with logs, metrics, and traces
What we offer
What we offer
  • Competitive compensation and equity, aligned with experience, impact and market standards
  • Learning & development support, including professional development budgets and learning stipends
  • Wellness and team engagement programs to support work-life integration and collaboration
  • Fulltime
Read More
Arrow Right