CrawlJobs Logo

ML Software Tool Development Engineer

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
Canada , Toronto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.

Job Responsibility:

  • Lead the design and implementation of system-level debugging, validation, and observability platforms
  • Develop automated systems for collecting and analyzing numerical, and execution anomalies
  • Create visualization and analysis tools to enable efficient root-cause investigation
  • Build frameworks for failure classification, regression detection, and anomaly monitoring
  • Extend compilers, runtimes, and programming interfaces to support advanced profiling and instrumentation
  • Improve system bring-up, low-level debug, and validation workflows
  • Partner cross-functionally with compiler, hardware, firmware, runtime, and infrastructure teams
  • Establish best practices for debuggability, reliability, and operational excellence
  • Lead high-impact initiatives
  • Support incident response and drive long-term corrective actions

Requirements:

  • Strong proficiency in C++ and Python, with a track record of building reliable, high-performance systems and tooling
  • Demonstrated experience debugging complex hardware/software systems and driving issues to root cause
  • Experience analyzing system-level data structures, execution graphs, or dependency networks for diagnostics and validation
  • Proven ability to design and build intuitive visualization and analysis tools for complex technical data
  • Experience with compiler internals, custom hardware interfaces, or low-level protocol design
  • Strong written and verbal communication skills, with the ability to explain technical concepts to diverse stakeholders
  • Ability to work independently and lead complex technical projects end-to-end

Nice to have:

  • Familiarity with machine learning training and inference pipelines, especially distributed training and large-model scaling
  • Prior work on high-performance clusters, HPC systems, or custom hardware/software co-design
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 20, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for ML Software Tool Development Engineer

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Sr. Software Development Engineer

You will safeguard the quality of our AI and GenAI features by evaluating model ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience as a Software Development Engineer in AI/ML systems
  • Strong coding skills in Python (evaluation pipelines, data processing, metrics computation)
  • Hands-on experience with evaluation frameworks (Ragas or equivalent)
  • Knowledge of vector embeddings, similarity search, and RAG evaluation
  • Familiarity with evaluation metrics (precision, recall, F1, relevance, hallucination detection)
  • Understanding of LLM-as-a-judge evaluation approaches
  • Strong analytical and problem-solving skills
  • ability to combine human judgment with automated evaluations
  • Bachelor’s or Master’s degree in Computer Science, Data Science, or related field
  • Strong English written and verbal communication skills
Job Responsibility
Job Responsibility
  • Evaluation Frameworks – Develop reusable, automated evaluation pipelines using frameworks such as Raagas
  • integrate LLM-as-a-judge methods for scalable assessments
  • Golden Datasets – Build and maintain high-quality benchmark datasets in collaboration with subject matter experts
  • AI Output Validation – Evaluate results across text, documents, audio, and video, using both automated metrics and human-in-the-loop judgment
  • Metric Evaluation – Implement and track metrics such as precision, recall, F1 score, relevance scoring, and hallucination penalties
  • RAG & Embeddings – Design and evaluate retrieval-augmented generation (RAG) pipelines, vector embedding similarity, and semantic search quality
  • Error & Bias Analysis – Investigate recurring errors, biases, and inconsistencies in model outputs
  • propose solutions
  • Framework & Tooling Development – Build tools that enable large-scale model evaluation across hundreds of AI agents
  • Cross-Functional Collaboration – Partner with ML engineers, product managers, and QA peers to integrate evaluation frameworks into product pipelines
  • Fulltime
Read More
Arrow Right

Software Development Engineer II – Machine Learning Operations

We are seeking a Full-Stack Engineer to be a key member of the Everseen ML Opera...
Location
Location
Serbia , Belgrade
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2-3 years of work experience in a relevant role and global SaaS company
  • Experience in ML infrastructure, MLOps, or Platform Engineering
  • Strong programming skills, with experience in Front-End development, in React and Angular
  • Understanding ML lifecycle, model versioning, and monitoring
  • Experience with back-end frameworks on top of NodeJS ( NestJS )
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Excellent communication and collaboration skills
  • Experience with Infrastructure as Code (e.g., Terraform)
  • Possesses a comprehensive understanding of technical concepts and terminology relevant to Everseen's products and services
Job Responsibility
Job Responsibility
  • Design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Design and implement reusable, testable, efficient, and elegant code based on requirements
  • Ensure adherence to coding standards and best practices
  • Create, maintain, and run unit tests for both new and existing applications and services
  • Aim to deliver defect-free and well-tested solutions
  • Analyze and collect data from various sources such as log files, application stack traces, and thread dumps
  • Utilize data analysis to identify trends, patterns, and potential areas for improvement
  • Create and maintain CI/CD integration using various tools
  • Automate the build, test, and deployment processes to ensure efficiency and reliability
  • Fulltime
Read More
Arrow Right

Senior Software Engineer – ML Model Compliance & Automation

We are seeking a highly skilled and motivated Senior Software Engineer to lead t...
Location
Location
India , Jaipur
Salary
Salary:
Not provided
infoobjects.com Logo
InfoObjects
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience Required: 3 - 7 yrs
  • GoLang (preferred)
  • Python (preferred)
  • Bash
  • MLOps Tools: KitOps, MLModelCI, MLflow, ONNX, TensorFlow, PyTorch, Docker
  • SBOM & Security: Syft, Grype, Trivy, CycloneDX, SPDX
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Infra: Kubernetes, Docker, Helm, Terraform
  • Cloud: AWS, GCP, Azure (EKS/GKE/ECS preferred)
  • Version Control: Git, GitOps
Job Responsibility
Job Responsibility
  • Model Packaging & Artifact Management: Design and implement workflows for packaging ML models using KitOps, ONNX, MLflow, or TensorFlow SavedModel
  • Manage model artifact versioning, registries, and reproducibility
  • Ensure artifact integrity, consistency, and traceability across CI/CD pipelines
  • Model Profiling & Optimization: Automate model profiling (latency, size, ops) using MLModelCI, TorchServe, or ONNX Runtime
  • Apply quantization, pruning, and format conversions (e.g., FP32→INT8) for optimization
  • Embed profiling and optimization checks into CI/CD pipelines to assess deployment readiness
  • Compliance & SBOM Generation: Develop pipelines to generate and validate SBOMs for ML models
  • Implement compliance checks for licensing, vulnerabilities, and security using CycloneDX, SPDX, Syft, or Trivy
  • Validate schema, dependencies, and runtime environments for production readiness
  • Cloud Integration & Deployment: Automate model registration, endpoint creation, and monitoring setup in AWS/GCP/Azure
  • Fulltime
Read More
Arrow Right

Software Engineer

Designs, develops, troubleshoots and debugs software programs for software enhan...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS degree in Computer Science or equivalent experience
  • 1 to 3 years of experience
  • Expert knowledge on Layer 2 and Layer 3 technologies
  • Deep understanding on Clos based Data Center networks architecture: 3-stage and 5-stage and Data Center Interconnect (DCI)
  • Excellent understanding of features Dot1x, DHCP, Firewall, class-of-service, EVPN-VXLAN
  • Proficient in Class of Service and DCQCN that gets heavily used in AI-ML Based Clos Networks
  • Expert knowledge on Python programming
  • Deep understanding of software, networking, and system concepts, including Linux internals, distributed system concepts and network troubleshooting tools
  • Excellent interpersonal and communication skills with a proven ability to develop and maintain effective relationships
  • Strong problem solving and decision-making skills
Job Responsibility
Job Responsibility
  • Design topologies and build network configurations that map well-optimized network reference designs
  • Plan, develop and execute automated and manual test plans for the reference design readiness
  • Provide constructive feedback, report issues, and interact with developers to deliver best in class product quality
  • Review requirements from the Product Management, Technical Marketing & Account teams
  • Utilize available network troubleshooting tools, including network packet captures, monitoring devices, log files, and customer inputs to facilitate effective issue resolution
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Backend

The Staff Engineer will work closely with AI/ML engineers, product managers, app...
Location
Location
United States , NYC
Salary
Salary:
160000.00 - 190000.00 USD / Year
conductor.com Logo
Conductor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Completed studies in Computer Science, Mathematics, engineering or a related field or equivalent professional experience
  • 8+ years of experience in software development, with experience in product-driven companies
  • Strong expertise in system design, distributed computing, and scalable architecture patterns for handling large datasets and high-throughput applications
  • Proficiency in multiple programming languages with strong Python coding skills. Experience with Java is highly valued
  • Strong database experience including both SQL and NoSQL systems, with knowledge of data modeling and optimization techniques
  • Experience with AI/ML technologies including LLMs, vector databases (e.g., Milvus), embeddings, and ML frameworks
  • Knowledge of MLOps practices, model deployment, and AI system integration in production environments
  • Experience working across the full software development lifecycle including CI/CD, monitoring, testing, and production deployment
  • Proven track record of technical leadership, mentoring engineers, and driving engineering excellence within teams
  • Up-to-date with rapidly-evolving technologies and demonstrated ability to evaluate and adopt new tools and frameworks
Job Responsibility
Job Responsibility
  • Lead the technical architecture, design, and implementation of large-scale distributed systems and data platforms to support customer needs and business growth
  • Oversee the planning, execution, and successful delivery of complex engineering projects, ensuring adherence to engineering best practices and quality standards
  • Design and build scalable, high-performance backend systems and APIs that handle millions of requests and large datasets efficiently
  • Architect robust data processing pipelines and ETL workflows using modern cloud technologies and distributed computing frameworks
  • Drive technical decision-making across the engineering organization, evaluating trade-offs and establishing engineering standards and practices
  • Lead cross-functional collaboration with product, AI/ML engineering, data engineering, and infrastructure teams to deliver comprehensive solutions
  • Build and maintain CI/CD pipelines, monitoring systems, and deployment automation to ensure reliable software delivery
  • Implement AI/ML capabilities including LLM integration, vector databases, and intelligent content processing workflows
  • Mentor senior and junior engineers, fostering technical excellence and knowledge sharing within the engineering organization
What we offer
What we offer
  • 100% covered employee medical plan
  • a dental & vision plans
  • 401(k) with employer contribution
  • an unlimited vacation policy
  • 10 sick days
  • short-term disability
  • long-term disability
  • generous paid parental leave
  • employee assistance program
  • flexible savings accounts
  • Fulltime
Read More
Arrow Right

Sr. Embedded Software Engineer

Location
Location
Canada , Toronto or Ottawa
Salary
Salary:
Not provided
advancedtechsearch.com Logo
Advanced Technology Search Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s in electrical engineering, Computer Engineering, or Computer Science
  • Experience with C/C++
  • Experience writing Python scripts
  • Ability to read and understand board schematics and device datasheets
  • Ability to debug embedded software using Oscilloscopes and Logic Analysers
  • Experience with SCM tools (GIT or SVN)
  • Strong analytical and problem-solving abilities
  • Strong communication skills
  • Ability to work in a multi-site team environment
Job Responsibility
Job Responsibility
  • Design, develop, and optimize embedded software for silicon-based systems throughout the entire lifecycle, from conceptualization to deployment, ensuring seamless integration and optimal performance
  • Collaborate with cross-functional teams including hardware engineers, software developers, and machine learning experts to integrate ML models into embedded systems
  • Architect and implement software frameworks for efficient data processing, device control, and communication protocols
  • Conduct performance analysis, debugging, and optimization of embedded systems for reliability and efficiency
  • Develop software and firmware applications to interact with hardware and third-party interfaces
  • Contribute to the architecture and design of the overall AI solution
  • Develop debug and performance analysis tools for AI solution development
  • Play a role in all the phases of embedded AI software development, from requirement gathering, analysis, design, development, testing and final release to customers
  • Provide clear and timely communication related to status and other key aspects of the project to leadership team
  • Develop and maintain software documentation, including specifications, design documents, and test plans
  • Fulltime
Read More
Arrow Right

AI Software Engineer III

Planet DDS is a leading provider of a platform of cloud-based solutions that emp...
Location
Location
United Kingdom , Glasgow
Salary
Salary:
Not provided
planetdds.com Logo
Planet DDS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-7 years of professional software engineering experience
  • At least 4 years in AI/ML-focused roles
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or related field
  • Experience working in a SaaS or enterprise software environment
  • Publications or contributions to open-source AI/ML projects
  • Exposure to reinforcement learning, generative AI (LLMs, diffusion models), or real-time inference systems
Job Responsibility
Job Responsibility
  • Design, develop, and deploy AI and machine learning models in production environments
  • Architect scalable solutions that integrate AI capabilities into our products and services
  • Collaborate with data scientists, product managers, and backend/front-end engineers to translate prototypes into reliable, maintainable code
  • Own end-to-end development of AI systems, including data ingestion, model training, evaluation, and deployment
  • Implement best practices in model versioning, monitoring, and continuous improvement
  • Contribute to the evolution of our AI/ML infrastructure, including CI/CD pipelines and MLOps tools
  • Stay current on advancements in AI, ML, and deep learning and assess their applicability to business needs
  • Ensure AI solutions are ethical, interpretable, and aligned with regulatory requirements
  • Fulltime
Read More
Arrow Right