CrawlJobs Logo

Senior Staff AI Software System Design Engineer

China, Shanghai Employment contract · Job Posted June 03, 2026
Apply Position
Job Link Share

Job Description

As an AICE Software System Design Engineer, you will be responsible for the custom development, debugging, optimization, and technical support of machine learning software for AMD server GPUs.

Job Responsibility

  • Position technical proposals and support to top customers
  • provide significant contribution to customer PoC success
  • drive custom requirements for AI SW
  • collaborate and interact with different teams to analyze and optimize training and inference workloads and solutions
  • analyze competitive solutions to identify strength and weakness for articulate value propositions
  • apply your knowledge of software engineering best practices

Requirements

  • Expert knowledge in machine learning areas such as frameworks (e.g. vLLM, Sglang, Megatron-LM, Deepspeed, TensorRT etc.)
  • distribution
  • kernel operator
  • compiler
  • runtime
  • driver
  • performance optimization for inference or training
  • strong programming skills in C++ and Python
  • hands-on experience with industry AI use scenarios, solutions, end-to-end pipelines, frameworks or SDKs
  • strong debugging and development skillsets
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Nice to have

  • Ability to work independently
  • define project goals and scope
  • lead your own development effort
  • solid communication skills in both English and mandarin
  • knowledge of Linux DRM, HSA, ROCm KMD/UMD driver
  • knowledge of compiler (triton/TVM)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Staff AI Software System Design Engineer

8 matching positions

Senior Staff Software Engineer - AI

GEICO is seeking an experienced Engineer with a passion for building high-perfor...
Location
Location
United States , Seattle, WA; Austin, TX; Palo Alto, CA; Chicago, IL; Dallas, TX
Salary
Salary:
110000.00 - 230000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience building and deploying ML systems in production with cross-functional engineering teams
  • Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
  • Experience architecting multi-component ML platforms using open-source/cloud-agnostic components: Datastores: PostgreSQL, NoSQL (MongoDB, Cassandra, CosmosDB) Streaming: Kafka, Flink, or Spark Streaming
  • Experience with end-to-end ML lifecycle: version control, CI/CD, Kubernetes, testing, monitoring, and production support
  • Experience with cloud providers (Azure, AWS or GCP) in production ML environments
  • Experience with observability tools and distributed systems monitoring, logging, tracing, and root cause analysis
  • Experience building multi-agent systems using LLMs and agentic frameworks (e.g., LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI)
  • Hands-on experience with RAG, semantic search, and vector databases (e.g., Milvus, pgvector, Qdrant, ElasticSearch)
  • Experience designing human-in-the-loop workflows and safety controls for autonomous systems
  • Strong architecture and design skills with ability to influence technical direction and roadmap
Job Responsibility
Job Responsibility
  • Design and build a multi-agent AI platform where specialized agents autonomously detect, diagnose, and resolve issues through agent-to-agent (A2A) collaboration
  • Develop intelligent agents using LLMs and agentic frameworks that coordinate detection, diagnostic, remediation, and knowledge tasks with minimal human intervention
  • Define agent interaction protocols, A2A communication standards, and evaluation frameworks for agent decision quality and autonomous action safety
  • Architect vector database solutions (Milvus, pgvector, Qdrant) for semantic search and RAG to enable context-aware agent decision-making
  • Build end-to-end ML pipelines for severity classification, anomaly detection, failure pattern recognition, and impact forecasting using observability data
  • Establish scalable orchestration infrastructure for multi-agent workflows with CI/CD, automated evaluation, canary releases, and rollback strategies
  • Implement monitoring for agent interactions, A2A communication patterns, decision quality, data drift, and system reliability
  • Lead technical architecture ensuring scalability, observability, and integration with existing alerting, logging, and monitoring systems
  • Define standards for agent safety, explainability, governance, and human-in-the-loop controls for high-impact automated actions
  • Partner with SRE, Product, and Engineering teams to translate reliability goals into measurable ML objectives and maintain pragmatic technical roadmaps
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Staff Software AI Test Engineer- Prisma SASE

We are seeking Test Engineers with a strong Automation First Mindset as we scale...
Location
Location
United States , Santa Clara
Salary
Salary:
126000.00 - 204500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation skills - Python, Playwright
  • Experience with building automation frameworks and leading the automation effort for the team
  • Working Knowledge of CI/CD pipelines
  • Experience with Cloud Technologies such Aws/Azure/GCP
  • Knowledge of common security related protocols and their design (i.e. SSH, IPsec,TCP/IP, DNS, TLS, SSL etc.)
  • Demonstrated ability to learn quickly and to work in a fast paced, innovative environment learning new technologies and multi-tasking
  • 8+ years of experience
  • Bachelor’s Degree OR Master in Computer Science/Engineering/Networking or equivalent military experience required
Job Responsibility
Job Responsibility
  • Develop and execute sophisticated software tests and frameworks to validate Prisma SASE Functionality and Scale, working closely with Development, Product Management, SRE and Technical Marketing teams
  • Provide Thorough Technical Leadership in the areas of Cloud Based Orchestration, Cloud delivered Security, Cloud Networking and Automation Design
  • Participate in system design so that Quality Assurance is considered throughout the entire lifecycle of the Prisma Access Feature Development
  • Develop and/or Enhance Automated test Infrastructure to enable building Scalable & Flexible tests that reflect real world network deployment scenarios
  • Enhance Test strategies, Automation & Build infrastructure with feedback and analysis from real-world Customer deployments
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer Software

As a Senior or Principal Software Engineer in Cortex Cloud, you will contribute ...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building and maintaining production-grade distributed systems
  • Proficiency in Go (Golang) is a strong advantage
  • We are open to engineers with deep expertise in other backend languages (Java, Python, Rust, C#, or Node.js) who are willing to transition to a Go-primary stack and have a focus on clean, well-tested code
  • Strong grasp of system design, data structures, and algorithms in high-scale cloud environments
  • Experience with CI/CD, comprehensive testing (unit, integration, E2E), and rigorous code reviews
  • Proficiency in AWS, GCP, or Azure, including cloud-native services
  • Experience with observability (monitoring, logging, tracing) and system profiling
  • B.Sc. or M.Sc. in Computer Science, Software Engineering, or equivalent technical/military experience
Job Responsibility
Job Responsibility
  • Contribute to the development and scaling of cloud-native security solutions for enterprise organizations
  • Work within an established team to evolve a high-traffic product, with a focus on refining architecture, optimizing the technology stack, and maintaining engineering standards
  • Write reliable code, influence product direction, and design distributed systems
  • Make technical decisions that impact the long-term stability and performance of cloud workload protection services
  • Work with AI Tools: Utilize platforms such as Gemini, Claude, and Cursor for tasks beyond code generation, including root-cause analysis, system design reviews, and architectural assessment
  • Develop AI-Augmented Workflows: Help refine how AI is integrated into the SDLC, including the orchestration of agents and the development of internal tools that extend AI capabilities across our codebase
  • Maintain Quality Standards: Critical review of all generated code and ensuring that AI-assisted work aligns with our architectural requirements and security benchmarks
  • Coordinate with AI agents (Product, Architecture, Security) that operate on shared context to assist in managing complex engineering tasks
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Platform

GoodLeap is a technology company delivering best-in-class financing and software...
Location
Location
United States , AUSTIN; SAN FRANCISCO; IRVINE; ROSEVILLE
Salary
Salary:
173000.00 - 200000.00 USD / Year
goodleap.com Logo
GoodLeap
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building and shipping scalable, robust backend services and APIs
  • Strong proficiency in Python and/or TypeScript
  • Solid understanding of distributed systems, service-oriented architecture, and event-driven patterns (e.g. Kafka, RabbitMQ, SQS)
  • Passion for software development, emerging technologies and culture of innovation
  • A collaborative mindset and interest in mentoring teammates and elevating team practices
  • Excellent communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Build features and extensions to our agentic AI platform using scalable, robust, and AI-first software engineering practices
  • Design tools and infrastructure to enable teams at GoodLeap to easily build and enhance AI agents that empower homeowners, contractors, and operations staff
  • Work alongside a team of AI engineers, product managers, and data scientists to evaluate and improve our agent ecosystem
  • Collaborate with Staff engineers, product, architecture, and design leads to deliver highly-available, fault-tolerant products and services
  • Work on significant and unique technical challenges, evaluate and recommend solutions, and guide decision making by considering technical tradeoffs
  • Grasp both the technical and business perspective so you can help drive innovation
  • Work autonomously and be self-disciplined, requiring minimal supervision or guidance
  • Collaborate with other team members and coach more junior team members to grow both their technical skills and soft skills
What we offer
What we offer
  • May be eligible for a bonus and equity
  • Fulltime
Read More
Arrow Right

Senior Staff Software Engineer: Data & Storage Platform

Uber’s Data Platform is the heart of the company’s critical decision-making and ...
Location
Location
United States , Seattle; San Francisco; Sunnyvale
Salary
Salary:
267000.00 - 297000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 14+ Years of Engineering Excellence: Proven experience designing and operating world-class distributed data and storage systems
  • Mastery of Storage Internals: Extensive storage experience is a must
  • Deep expertise in: Batch & Object Storage: HDFS, Cloud Object Storage (S3/GCS/OCI), and Blobstore metadata management
  • Storage Optimization: Practical experience with Apache Hudi or Apache Iceberg for lakehouse architectures
  • Transactional Systems: Experience with distributed transactional storage (e.g., Docstore, Google Spanner, TiDB)
  • NoSQL & Cache: Cassandra, Redis, and high-throughput Key-Value stores
  • Data + AI Convergence: Deep understanding of how compute fabrics (Spark, Flink, Ray) integrate with vector databases and model-serving platforms
  • Query Engine Proficiency: Architect-level knowledge of Presto, Trino, or Hive for large-scale analytical processing
  • Systems Programming: Expert-level command of Java, Go, Scala, or C++ with a focus on performance tuning and distributed consensus
Job Responsibility
Job Responsibility
  • Architect the Multi-Modal Fabric: Unify batch, streaming, and AI compute into one intelligent fabric, enabling real-time insights and trustworthy AI agents at a global scale
  • Revolutionize Storage & Catalog: Drive the architecture for a unified catalog and metadata management service for unstructured data, leveraging native cloud object store capabilities
  • Operationalize AI Intelligence: Partner with teams like QueryCopilot and DataIQ to bridge human validation with autonomous reasoning through agentic workflows
  • Lead Storage Modernization: Evolve our massive-scale persistence layers—including Docstore (Transactional Distributed Storage) and Distributed MySQL—to increase resiliency and reduce operational overhead
  • Open Source & Act as a force multiplier by contributing to the community (Hudi, Iceberg, Presto)
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • All full-time employees are eligible to participate in a 401(k) plan
  • Eligible for various benefits
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer, AI

We are seeking a visionary and hands-on Senior Staff AI Engineer to be the found...
Location
Location
India
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of professional software engineering experience, with a proven track record of building complex, data-intensive, backend systems
  • Deep expertise (5+ years) in building and scaling production-grade services using modern backend frameworks such as FastAPI, Django, Sanic, Spring Boot or similar
  • Significant, hands-on experience (3+ years) in the complete lifecycle of AI/ML models: from experimentation and prototyping to deploying, monitoring, and iterating on them in a high-volume cloud environment
  • Mastery in designing large-scale distributed systems, demonstrating strong knowledge of asynchronous patterns, streaming/queuing/caching strategies, and robust observability (logging, metrics, tracing)
  • Exceptional communication and leadership skills. You can articulate complex technical concepts to diverse audiences and have the ability to influence engineering direction across multiple teams without direct authority
Job Responsibility
Job Responsibility
  • Spearhead AI Innovation: Act as the chief technical authority on AI, you will research, evaluate, and prototype cutting-edge solutions using Large Language Models (LLMs), Computer Vision, and other machine learning techniques to solve our most complex data extraction challenges
  • Architect for Scale: Design and build robust, highly scalable, and cost-effective AI services and data processing pipelines. Your architecture will be the backbone for processing millions of documents daily with high reliability and throughput
  • Tackle Real-World AI Challenges: Go beyond theory to systematically solve the practical problems of production AI. This includes managing LLM latency and variance, developing sophisticated prompt engineering strategies, and building fault-tolerant, defensive systems that perform consistently
  • Be a Force Multiplier: Act as the key technical mentor and thought leader for our large engineering team and drive some mission-critical initiatives to production
Read More
Arrow Right

Staff Software Engineer, AI Agent Platform

The Geico AI Agent Platform team is seeking an exceptional Staff Software Engine...
Location
Location
United States , Chevy Chase; New York City
Salary
Salary:
115000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, Mathematics, or a related field
  • an advanced degree (master’s or Ph.D.) is highly desirable
  • 6+ years of hands-on experience in designing, implementing, and maintaining multi-tenant AIML systems and platforms in production environments
  • 6+ years of experience working with cloud platforms such as Azure and AWS
  • Extensive expertise in designing and deploying large-scale data pipelines and real-time inference systems and managing the end-to-end AI Agent and/or AIML system development lifecycles, including configuration, evaluation, monitoring, observability and AuthN/AuthR considerations
  • 6+ years of experience working with common backend systems & tools (e.g, Kubernetes, Temporal, OpenSearch, PostgreSQL, Redis, Neo4J, etc.)
  • Deep understanding of Docker, container optimization, and multi-stage builds
  • Experience with Prometheus, Grafana, Open Telemetry and distributed tracing
  • 3+ years of experience building front-end web applications using frameworks such as React and/or Next.JS
  • Deep proficiency in programming languages such as Python, Java, Go, etc., with a strong emphasis on coding excellence
Job Responsibility
Job Responsibility
  • Architect and implement scalable multi-tenant backend systems for building AI agent workflows, including agent configuration, offline evaluation, synthetic data generation, workflow simulation, agent marketplace, etc. using Azure Kubernetes Service (AKS), FastAPI, etc., ensuring economy of scale and control cost of maintenance
  • Collaborate with Design team to architect and implement frontend experiences and workflows for onboarding both technical and non-technical stakeholders, maximizing user adoption and successful AI agent development
  • Develop observability frameworks to ensure 99.9%+ uptime for AI agent platforms through robust monitoring, alerting, and incident response procedures
  • Evaluate and (if desirable) integrate cutting-edge GenAI frameworks, libraries and vendors to maintain a state-of-the-art technology stack, including hybrid cloud solutions with AWS/GCP as backup or specialized use cases
  • Architect and implement scalable, high-performance machine learning platforms and systems capable of processing large data volumes and supporting real-time decision making and workflows
  • Oversee the end-to-end lifecycle of AI agent applications, ensuring robust testing, deployment, and ongoing monitoring
  • Ensure adherence to company production readiness standards, security protocols, and regulatory compliance throughout the development lifecycle
  • Continuously optimize platform performance, reducing latency and improving throughput for AI agent workloads
  • Design and implement backup, recovery, and business continuity plans for hosted platform applications & services
  • Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer, Software Engineering

We are seeking a highly accomplished Senior Staff Engineer to join our engineeri...
Location
Location
United States , Chevy Chase, MD; Palo Alto, CA; Seattle, WA
Salary
Salary:
130000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in infrastructure systems, including compute platforms (Kubernetes, Docker, cloud services), networking, and storage
  • Strong database experience across relational databases (PostgreSQL, MySQL) and NoSQL solutions (MongoDB, Cassandra, Redis, DynamoDB)
  • Demonstrated experience applying AI to solve real-world problems in production environments
  • Expert-level proficiency in at least two programming languages (e.g., Python, Java, Go, Rust)
  • Experience designing and building distributed systems at scale
  • Strong understanding of cloud platforms (Azure OR AWS) and infrastructure-as-code practices
  • Hands-on experience with CI/CD pipelines, build systems, and deployment automation (e.g., GitHub Actions, Jenkins, Azure DevOps, ArgoCD)
  • Background in building real-time data processing systems (Kafka, Flink, Spark)
  • Excellent communication skills with the ability to articulate complex technical concepts to diverse audiences
  • Experience working in a platform engineering team, building internal developer platforms or shared infrastructure services
Job Responsibility
Job Responsibility
  • Define and drive the technical vision for infrastructure and AI-powered systems across the organization
  • Design, architect, and implement highly scalable, fault-tolerant distributed systems
  • Lead technical decision-making on critical projects, balancing short-term needs with long-term sustainability
  • Establish and champion engineering best practices, design patterns, and coding standards
  • Architect and optimize compute infrastructure for performance, reliability, and cost efficiency
  • Design and implement database solutions (relational and NoSQL) that scale to meet business demands
  • Drive cloud infrastructure strategy, including containerization, orchestration, and serverless architectures
  • Ensure system reliability, observability, and operational excellence across all platform components
  • Identify and prioritize opportunities to apply AI/ML to solve high-impact business problems
  • Stay current with emerging AI technologies and evaluate their applicability to business challenges
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right