CrawlJobs Logo

Software Engineer, AI Infrastructure

United States, New York, NY · Job Posted December 08, 2025
Apply Position
Job Link Share

Job Description

As a Software Engineer on our AI Infrastructure team, you will help design the core systems that power Fireworks AI’s generative AI platform. You will help build infrastructure and tools that ensure the reliability, performance, quality, and availability of our AI system. Our mission is to make Fireworks AI the most reliable and user friendly generative AI platform in the world. You will partner closely with our cloud infrastructure team, product team, and performance team to deliver infrastructure that bridges the gap between our customer and the ultra-performant proprietary Fireworks inference engine.

Job Responsibility

  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)

Nice to have

  • 5+ years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Experience with open source inference engine like vLLM, Sglang, or TRT-LLM
  • Contributions to open-source infrastructure or ML projects
  • Experience in building large scale ML/MLOps infrastructure

What we offer

  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer, AI Infrastructure

8 matching positions

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

The AI Platform organization builds the end-to-end Azure AI stack, from the infr...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript | OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing
  • Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters
  • Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security
  • Provide operational support and DRI (on-call) responsibilities for the service
  • Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers
  • Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together
  • Provide vision, expertise, and technical leadership to other team members
  • Help to grow talent in these areas
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right

Software Engineer - AI Infrastructure

We’re looking for a software engineer to join our Infrastructure team—building a...
Location
Location
United States , New York City
Salary
Salary:
135000.00 - 280000.00 USD / Year
assembled.com Logo
Assembled
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have 6+ years of engineering experience, with past ownership of high-scale, production-critical infrastructure
  • Have experience with distributed systems and container orchestration (especially Kubernetes)
  • Have worked with AI/ML platforms or are excited to build foundational infrastructure for LLM-based applications
  • Thrive in fast-paced environments with shifting requirements and ambiguous problem spaces
  • Are motivated by impact, enjoy deep technical challenges, and want to work cross-functionally across security, AI, and product
  • Have strong familiarity with one or more parts of our tech stack: Cloud provider: AWS
  • Orchestration: Kubernetes + Karpenter
  • LLM integration: Experience with OpenAI, Anthropic, or open-source model serving (e.g., vLLM, HuggingFace TGI, Ray Serve)
  • Prompt & embedding infrastructure: Vector databases (e.g., Pinecone, Weaviate, PGVector), semantic search, prompt templating systems
  • Datastores: Postgres + PgBouncer, Snowflake, Redis
Job Responsibility
Job Responsibility
  • Agent service reliability and scaling: We manage and scale the infrastructure that serves LLM-powered agents across chat, email, and voice. This includes selecting inference strategies, integrating with model providers (e.g. OpenAI, Anthropic), and dynamically routing traffic for performance and cost efficiency
  • Prompt and embedding storage systems: Assist relies heavily on dynamically generated prompts and semantic search across support content. The team owns highly-available, fast-access storage and indexing layers optimized for real-time AI interactions
  • Privacy and security: Enterprises expect strict guardrails around AI use. We’re building systems like network-level intrusion detection (IDS/IPS), audit logging, and LLM usage policy enforcement to meet these expectations and unlock new sales channels
  • Observability and usage analytics: We operate systems that surface key metrics—token usage, latency, cost per response, and quality signals—so the Assist team can continuously improve Assist’s performance and accuracy
  • AI-powered developer tools: We are beginning to explore and evangelize the use of AI to accelerate internal engineering workflows—through internal chat agents, pair programming tools, and intelligent automation for deployment, debugging, and on-call. Our goal is to empower engineers across the company to build faster and more confidently with AI
What we offer
What we offer
  • Generous medical, dental, and vision benefits
  • Paid company holidays, sick time, and unlimited time off
  • Monthly credits to spend on each: professional development, general wellness, Assembled customers, and commuting
  • Paid parental leave
  • Hybrid work model with catered lunches everyday (M-F), snacks, and beverages in our SF & NY offices
  • 401(k) plan enrollment
  • Stock options are provided as part of the compensation package
  • Fulltime
Read More
Arrow Right

Staff Infrastructure Software Engineer - AI Platform

We are currently seeking a Staff Software Engineer to join the AI Platform team ...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
addepar.com Logo
Addepar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience as a Software/Backend Engineer, with a track record of taking on increasing responsibility
  • Experience across the full product lifecycle: designing, implementing, shipping, scaling, operationalizing, and maintaining technology/SaaS products
  • Exceptional Programming skills and fundamentals in Python/Go/Java, with a proven track record of building large scale production systems
  • Proficient experience with diverse compute environments including microservices (K8s), Databricks and serverless architectures (e.g. AWS Lambda)
  • Demonstrable experience leading initiatives with infrastructure-as-code tools such as Terraform in complex, multi-account environments
  • Proficient experience with comprehensive monitoring and alerting stacks (e.g. Prometheus/Grafana/Sentry/cloud-native tools), with a focus on observability strategy
  • Excellent interpersonal and communication skills to effectively collaborate with multi-functional teams, articulate complex technical concepts, and influence outcomes
Job Responsibility
Job Responsibility
  • Design and build the production runtime for LLM-based agents and products, creating the services and infrastructure that serve autonomous agents
  • Develop deep application-level knowledge to proactively inform and influence requirements, constraints and best practices for implementing composable, complex AI systems
  • Lead the design, implementation, and automation of production infrastructure on a variety of cloud environments (Kubernetes/Databricks), to enable us to ship and scale AI features instantly
  • Evangelize and promote disciplined, best engineering practices to enforce strong production hygiene and culture
  • Initiate and lead collaborations with cross-functional teams to identify and resolve complex application or infrastructure issues, serving as a technical subject matter expert
  • Architect, build, and maintain advanced, automated CI/CD pipelines e.g. using Jenkins, ArgoCD, AWS CodeBuild/Pipeline, GitHub Actions, or similar, establishing best practices for deployment strategies (e.g., blue/green, canary)
  • Develop systems and best practices monitoring, alerting, and troubleshooting of our probabilistic and AI-driven systems and broader software stack
Read More
Arrow Right

Staff Infrastructure Software Engineer, Enterprise AI

Scale GP is building the next generation of enterprise-grade Generative AI produ...
Location
Location
United States , New York; San Francisco
Salary
Salary:
216200.00 - 270250.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in a senior role
  • 5+ years of full-time software engineering experience
  • Deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana)
  • Extensive experience with at least one major cloud provider (AWS, Azure, or GCP)
  • Strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups
  • Proficiency in Python or JavaScript/TypeScript, and SQL
Job Responsibility
Job Responsibility
  • Define the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers
  • Lead the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies
  • Own the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response
  • Drive developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization
  • Solve the toughest engineering problems related to multi-tenancy, data isolation, and high-performance inference at a massive scale, taking end-to-end ownership across the full product lifecycle
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation
  • additional benefits such as a commuter stipend
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Data Infrastructure & AI

Fullstory Anywhere is one of Fullstory's three primary product verticals, and it...
Location
Location
United States , Atlanta
Salary
Salary:
160000.00 - 170000.00 USD / Year
fullstory.com Logo
Fullstory
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant experience building and operating high-throughput data pipelines (batch and/or streaming) in a major cloud platform, including work with cloud data warehouses like BigQuery, Snowflake, or Databricks.
  • Proficiency in Go, Python, Java or a similar language.
  • Hands-on experience with data transformation tooling such as dbt, with a strong understanding of data modeling and pipeline observability.
  • Familiarity with LLM integration patterns and evaluation approaches (e.g., LangSmith, Vertex AI, or comparable frameworks), or demonstrated ability to ramp quickly in applied AI.
  • A track record of owning major system areas end-to-end: driving architectural decisions, maintaining production health, and improving reliability over time.
Job Responsibility
Job Responsibility
  • Maintain, extend, and scale Go microservices that transform and deliver Fullstory session data into customer warehouses and power the team's MCP server that enables AI agent integrations.
  • Develop and maintain dbt models and pipeline orchestration to ensure timely, fault-tolerant data migrations across hundreds of customer destinations.
  • Define evaluation frameworks for LLM outputs using tools like Langsmith and Vertex AI, ensuring AI-powered customer agents produce accurate, useful results.
  • Investigate and resolve production incidents across the data pipeline, implementing systemic fixes that prevent entire classes of failure from recurring.
  • Write technical design documents that drive consensus on architectural changes, proactively surfacing scaling bottlenecks, edge cases, and cross-team dependencies.
  • Demonstrate sound technical judgment by de-risking work through spikes, taking on tech debt deliberately, and knowing when to escalate versus dig in.
What we offer
What we offer
  • Flexibility and Connection
  • flexible PTO policy
  • annual company-wide closure
  • Benefits
  • paid parental leave
  • Bereavement leave, including miscarriage/pregnancy loss
  • Learning opportunities
  • annual learning subsidy
  • Productivity support
  • monthly productivity stipend
  • Fulltime
Read More
Arrow Right

Lead Software Engineer, Backend (AI Infrastructure & Tooling)

Do you love building and pioneering in the technology space? Do you enjoy solvin...
Location
Location
United States , Plano
Salary
Salary:
179400.00 - 204700.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree
  • At least 4 years of professional software engineering experience (Internship experience does not apply)
  • At least 1 year experience with cloud computing (AWS, Microsoft Azure, Google Cloud)
Job Responsibility
Job Responsibility
  • Lead a portfolio of diverse technology projects and a team of developers with deep experience in distributed microservices, and full stack systems to create solutions that help meet regulatory needs for the company
  • Share your passion for staying on top of tech trends, experimenting with and learning new technologies, participating in internal & external technology communities, mentoring other members of the engineering community, and from time to time, be asked to code or evaluate code
  • Collaborate with digital product managers, and deliver robust cloud-based solutions that drive powerful experiences to help millions of Americans achieve financial empowerment
  • Utilize programming languages like Java, Python, SQL, Node, Go, and Scala, Open Source RDBMS and NoSQL databases, Container Orchestration services including Docker and Kubernetes, and a variety of AWS tools and services
What we offer
What we offer
  • performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Data Platform, AI Infrastructure

We are building a large-scale, productized data platform that powers critical in...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
  • Strong programming experience in Python
  • Experience building and operating large-scale distributed systems
  • Hands-on experience with: Backend services or APIs (e.g., FastAPI, Flask, or similar)
  • Cloud-based infrastructure (Azure, AWS, or GCP)
  • Monitoring and observability systems (metrics, logging, alerting)
  • Experience designing systems with reliability, scalability, and operational clarity in mind
  • Proven ability to own and deliver production systems end-to-end
  • Ability to break down ambiguous problems, ask the right questions, and execute effectively
Job Responsibility
Job Responsibility
  • Design, build, and operate core components of a distributed data platform, including: Orchestration systems (e.g., Airflow or equivalent)
  • Backend services and APIs (Python/FastAPI or similar)
  • Monitoring, alerting, and reliability systems
  • Own the end-to-end lifecycle of platform components - from design through deployment, scaling, and maintenance
  • Ensure systems meet requirements for availability, performance, and data reliability at large scale
  • Define and enforce standardized patterns for infrastructure, deployment, and observability across the platform
  • Partner with data engineering teams to enable efficient, reliable data processing workflows
  • Diagnose and resolve complex issues in distributed systems, including performance bottlenecks and failure modes
  • Contribute to infrastructure-as-code and deployment systems to support reproducibility and operational excellence
  • Drive continuous improvements in system robustness, cost efficiency, and operational clarity
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
  • Fulltime
Read More
Arrow Right