CrawlJobs Logo

Principal Engineer, AI Model Lifecycle

United States, San Francisco 260000.00 - 326000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

The Principal Software Engineer for the Model LifeCycle team will play a crucial role in building a comprehensive managed platform for the entire application development lifecycle, with a specific focus on leveraging Machine Learning models, including Large Language Models (LLMs). This role offers significant 0 → 1 ownership — you'll be designing and building core systems from first principles.

Job Responsibility

  • Manage fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling
  • Implement and maintain end-to-end training pipelines for Large Language Models
  • Distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling)
  • Agent execution infrastructure
  • Dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale
  • Work closely with product, business, and platform teams to shape the core abstractions and APIs of the system
  • Influence long-term architectural decisions around training runtimes, scheduling, storage, and model lifecycle management
  • Contribute to and engage with the open-source LLM ecosystem

Requirements

  • Advanced degree in Computer Science, Engineering, or a related field
  • 10-15+ years of industry experience driving impactful projects in the AI Space
  • Proven track record of delivering early-stage projects under tight deadlines
  • Expertise in using cloud-based services, such as, elastic compute, object storage, virtual private networks, managed database, etc.
  • Experience in Generative AI (Large Language Models, Multimodal)
  • Deep experience with AI infrastructure, including training, inference

Nice to have

  • Proficiency in Golang or Python for large-scale, production-level services
  • Contributions to open-source AI projects such as vLLM or similar frameworks
  • Performance optimizations on GPU systems and inference frameworks
  • Experience working with PyTorch
  • Experience with training and fine-tuning LLMs

What we offer

  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300/month

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Engineer, AI Model Lifecycle

8 matching positions

Principal Engineer, AI Strategy and Innovation

Shape the architecture and execution of CLEAR’s AI platform strategy, from infra...
Location
Location
United States , New York
Salary
Salary:
250000.00 - 290000.00 USD / Year
clearme.com Logo
Clear
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering and/or technical experience with deep expertise in AI systems, ML platforms, and data infrastructure
  • At least 5 years of experience with various AI technologies including GenAI, ML, Deep Learning, RPA or others
  • Proven ability to scale AI capabilities into high-throughput, low-latency environments
  • Strong technical background in cloud-native architectures (AWS or similar) and modern AI/ML stacks (TensorFlow/PyTorch, MLflow, RAG, MCP, etc.)
  • Experience leading AI strategy and platform adoption in enterprise-scale environments
  • Skilled at translating regulatory and compliance requirements into responsible AI practices
  • Track record of partnering closely with Product, Engineering, Analytics, and Security teams as well as business executives
  • Excellent communicator who can set a vision for AI, explain technical trade-offs, and influence executives, peers, and partners
  • Passionate about embedding AI into core products to deliver measurable impact for members and enterprise partners
Job Responsibility
Job Responsibility
  • Define and scale CLEAR’s AI strategy: spanning data pipelines, ML lifecycle management, and intelligent applications
  • Lead engineering execution for AI models (development, deployment, monitoring, retraining) with a focus on reliability, observability, and ethical AI practices
  • Modernize analytics and intelligence systems to deliver predictive insights and partner-facing transparency in real time
  • Operationalize trust in AI by embedding privacy, compliance, and security into all platforms and workflows
  • Influence cross-functional stakeholders across the business, fostering a culture of technical rigor, collaboration, and innovation, advising C Suite executives, leaders, and individual contributors
  • Lead the AI Governance group and drive best practices across business functions
  • Track and optimize KPIs on AI adoption, model performance, scalability, and business impact
What we offer
What we offer
  • Comprehensive healthcare plans
  • Family-building benefits (fertility and adoption/surrogacy support)
  • Flexible time off
  • Annual wellness stipend
  • Free OneMedical memberships for you and your dependents
  • A CLEAR Plus membership
  • A 401(k) retirement plan with employer match
  • Catered lunches every day
  • Fully stocked kitchens
  • Stipends and reimbursement programs for well-being and learning & development
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Principal Data Engineer

We are on the lookout for a Principal Data Engineer to help define and lead the ...
Location
Location
United Kingdom
Salary
Salary:
Not provided
dotdigital.com Logo
Dotdigital
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience delivering python-based projects in the data engineering space
  • Extensive experience working with SQL and NoSQL database technologies (e.g. SQL Server, MongoDB & Cassandra)
  • Proven experience with modern data warehousing and large-scale data processing tools (e.g. Snowflake, DBT, BiqQuery, Clickhouse)
  • Hands on experience with data orchestration tools like Airflow, Dagster or Prefect
  • Experience using cloud environments (e.g. Azure, AWS, GCP) to process, store and surface large scale data
  • Experience using Kafka or similar event-based architectures e.g. (Pub/Sub via AWS SQS, Azure EventHubs, AWS Kinesis)
  • Strong grasp of data architecture and data modelling principles for both OLAP and OLTP workloads
  • Capable in the wider software development lifecycle in terms of agile ways of working and continuous integration/deployment of data solutions
  • Experience as a lead or Principal Engineer on large-scale data initiative or product builds
  • Demonstrated ability to architect data systems and data structures for high volume, high throughput systems
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, secure and resilient data systems across streaming, batch and real-time use cases
  • Architect data pipelines, model and storage solutions that power analytical and product use cases
  • using primarily Python and SQL via orchestration tooling that run workloads in the cloud
  • Leverage AI to automate both data processing and engineering processes
  • Assure and drive best practices relating to data infrastructure, governance, security and observability
  • Work with technologists across multiple teams to deliver coherent features and data outcomes
  • Support the data team to help adopt data engineering principles
  • Identify, validate and promote new tools and technologies that improve the performance and stability of data services
What we offer
What we offer
  • Parental leave
  • Medical benefits
  • Paid sick leave
  • Dotdigital day
  • Share reward
  • Wellbeing reward
  • Wellbeing Days
  • Loyalty reward
  • Fulltime
Read More
Arrow Right

Principal Machine Learning System Engineer

As a Principal Machine Learning System Engineer on the AI & ML Platform team, yo...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in building Machine Learning and AI infra/platform/system (generally 5+ years)
  • Comprehensive ML lifecycle expertise: proven experience developing, deploying, and maintaining end-to-end ML systems, from data engineering to model serving and monitoring
  • Large-scale system design: Extensive experience designing and building scalable, fault-tolerant, and high-performance distributed systems for machine learning
  • Proficiency with frameworks and languages: Expert-level proficiency in Python and ML frameworks like PyTorch, TensorFlow, or JAX. Familiarity with other languages like Go, Java, or Scala is also beneficial
  • MLOps and automation: Deep experience implementing MLOps, CI/CD pipelines, and automation for continuous training, deployment, and monitoring of ML models
Job Responsibility
Job Responsibility
  • Collaborate with your teammates to solve complex problems, from technical design to launch
  • Deliver cutting-edge solutions that are used by other Atlassian teams and products to build AI features that reach millions of customers
  • Deliver code reviews, documentation & bug fixes within a strong engineering culture
  • Partner across engineering teams to take on company-wide initiatives spanning multiple projects
  • Mentor junior members of the team
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
Read More
Arrow Right

Principal Automation Engineer

We are seeking a Principal Automation Engineer to lead and drive innovation in a...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or master’s degree in computer science, cybersecurity, data science, or related engineering field
  • proven experience (8+ years) in cybersecurity, with at least 3+ years in automation-focused roles
  • deep understanding of cybersecurity frameworks and concepts, including attack vectors, threat landscapes, and defence mechanisms
  • strong experience with SIEM/SOAR/ and EDR/XDR platforms and tools
  • experience in Machine Learning (ML) and Agentic AI applied for security use-cases
  • experience with anomaly detection, behavioural modeling, and predictive analytics in cybersecurity contexts
  • experience integrating machine learning models into security operations workflows in enterprise environments
  • proficiency in languages such as Python, Go, SPL, YaraL, and building automation frameworks
  • hands-on experience with big data technologies and cloud environments (AWS, Azure, GCP)
  • familiarity with regulatory requirements and compliance frameworks (e.g., GDPR, NIST, ISO 27001)
Job Responsibility
Job Responsibility
  • Drive the SOAR development lifecycle, in support of security operations and engineering teams
  • develop SOAR playbooks and logic
  • build integrations across SIEM, SOAR, EDR, identity platforms, and cloud-native services
  • write, test, and maintain automation scripts and workflows
  • deliver API solutions for SOC and enterprise Business Units
  • design and implement reusable automation services, APIs, and playbooks
  • maintain documentation for scripts, integrations, and workflows
  • debug and resolve technical issues in the automation lifecycle
  • apply advanced analytics, Machine Learning, and AI for security automation
  • partner with SOC/IR leadership and IT stakeholders to gather SOAR requirements and develop solutions
What we offer
What we offer
  • Health and wellbeing benefits
  • career development programs
  • unconditional inclusion
  • flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Principal AI Technology & Innovation Specialist

The Principal AI Technology & Innovation Specialist at NTT DATA is a key role fo...
Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent in Computer Science, Artificial Intelligence, Data Science, or a related field
  • Advanced degrees (MSc/PhD) in AI/ML fields preferred
  • TOGAF, COBIT, or related enterprise architecture certifications are beneficial
  • Certifications in machine learning or cloud-based AI platforms (e.g., AWS Certified Machine Learning Specialty, Google Cloud AI Engineer) are advantageous
  • Extensive experience in leading enterprise AI innovation and architecture initiatives
  • Proven track record of evaluating, piloting, and operationalizing AI solutions in enterprise environments
  • Experience working across multiple industries and large-scale IT organizations
  • Hands-on experience in AI/ML development, integration, and lifecycle management
  • Familiarity with regulations governing AI use, such as the EU AI Act, and experience in operationalizing compliance measures
  • Deep knowledge of modern AI paradigms including generative AI (e.g., LLMs), machine learning infrastructure, AI model lifecycle, and MLOps
Job Responsibility
Job Responsibility
  • Lead the evaluation and strategic assessment of emerging AI technologies, platforms, and vendor solutions, advising on technical and ethical feasibility
  • Design and guide the development of AI capabilities and innovation pilots, translating business goals into AI-enabled solutions
  • Define architectural blueprints for integrating AI technologies into IT systems and product platforms, ensuring security, scalability, and alignment with enterprise standards
  • Develop frameworks for responsible AI adoption including model evaluation, explainability, privacy, compliance (e.g., EU AI Act), and ethical use
  • Partner with product and platform teams to align AI innovations with enterprise technology strategy and business outcomes
  • Drive initiatives for AI prototyping, proof-of-concepts (PoCs), and production readiness assessments
  • Monitor vendor roadmaps and contribute to the strategy for selecting and onboarding external AI capabilities
  • Act as a center of excellence for AI within the IT organization, driving awareness, knowledge sharing, and standardization
  • Collaborate with enterprise architects and platform leads to integrate AI tools into data infrastructure, software architecture, and cloud environments
  • Perform technical due diligence on third-party AI services and models, ensuring fit-for-purpose and cost-effective solutions
What we offer
What we offer
  • Workplace embraces diversity and inclusion – it’s a place where you can grow, belong and thrive
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed AI - AI model LifeCycle

The Senior Software Engineer for the Model LifeCycle team will contribute to bui...
Location
Location
United States , San Francisco
Salary
Salary:
172425.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related field
  • Experience delivering production-ready features
  • Familiarity with essential cloud-based services (e.g., compute, storage, networking)
  • Familiarity with Generative AI (Large Language Models, Multimodal)
  • Experience with AI infrastructure components (training, inference)
  • 4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
Job Responsibility
Job Responsibility
  • Implement and maintain systems for fine-tuning large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling
  • Implement and maintain end-to-end training pipelines for Large Language Models
  • Implement components for distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling)
  • Develop and maintain core agent execution infrastructure
  • Implement features for dataset, model, and experiment management, focusing on versioning, lineage, evaluation, and reproducible fine-tuning
  • Work closely with Senior Engineers and Principal Engineers, as well as product and platform teams, to implement system abstractions and APIs
  • Contribute to technical discussions on training runtimes, scheduling, storage, and model lifecycle management
  • Engage with the open-source LLM ecosystem
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

Location
Location
Canada , Mississauga
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience: Extensive experience in designing and building AI/ML solutions, with a significant focus on generative AI and Large Language Models (LLMs)
  • Gen AI Expertise: Deep understanding of modern AI architectures and techniques, including Retrieval-Augmented Generation (RAG), fine-tuning, function calling, and AI agentic workflows
  • Programming Proficiency: Expert-level skills in Python and extensive experience with core AI/ML libraries such as PyTorch, TensorFlow
  • System Design: Proven ability to architect and develop large-scale, distributed, multi-tier applications. Strong knowledge of microservices, API design, and system integration
  • MLOps: Solid understanding of MLOps principles and experience with tools for model versioning, deployment, monitoring, and lifecycle management
  • Leadership: Demonstrated experience serving as a technical lead, architect, or principal engineer, with a track record of mentoring team members and driving projects to completion
Job Responsibility
Job Responsibility
  • Architectural Leadership: Design and architect end-to-end generative AI solutions, from proof-of-concept to production, ensuring scalability, performance, and reliability
  • Technical Strategy: Develop and maintain a comprehensive strategic roadmap for generative AI adoption, evaluating new models, techniques, and platforms to keep our capabilities at the forefront of the industry
  • Solution Development: Lead the hands-on development of complex AI systems, including Retrieval-Augmented Generation (RAG) pipelines, autonomous AI agents, fine-tuning workflows, and custom model integrations
  • Best Practices & Standards: Establish and govern best practices for the full AI development lifecycle, including prompt engineering, model evaluation, MLOps, and data management
  • Cross-Functional Partnership: Collaborate closely with multiple management teams and business units to identify high-impact use cases and ensure the successful integration of AI solutions to meet business goals
  • Mentorship & Guidance: Serve as a senior advisor and coach to other engineers and analysts, fostering a culture of innovation and technical excellence. Allocate work and provide technical direction to the team
  • Risk & Compliance: Appropriately assess risk when business decisions are made, demonstrating consideration for the firm's reputation and safeguarding its clients and assets. Drive compliance with all applicable laws, rules, and regulations, particularly those related to AI ethics, data privacy, and model bias
  • Innovation and Research: Stay abreast of the latest advancements in generative AI research, and translate state-of-the-art developments into practical, innovative solutions
  • Fulltime
Read More
Arrow Right