CrawlJobs Logo

Compute Server Platform Architect

United States; Canada, Sunnyvale · Job Posted March 09, 2026
Apply Position
Job Link Share

Job Description

As a Compute / Server Platform Architect on the Cluster Architecture Team, you will own the server-side platform architecture that enables Cerebras CS3-based AI clusters (training and inference) to deliver predictable performance, scalability, and reliability. Our accelerators are network-attached, so the x86 server fleet is a first-class part of the end-to-end system: it runs critical-path runtime functions (for example orchestration, prompt caching, and IO/control services) and must be co-designed with software for token-level latency, throughput, and cost efficiency. You will translate workload behavior into CPU, memory, IO, PCIe, and host-networking requirements, drive platform evaluations with vendors, and provide technical leadership through qualification and production adoption in close partnership with other function leaders and TPMs.

Job Responsibility

  • Own the architecture for all server roles in Cerebras clusters, including definitions of server types, configurations, and lifecycle strategy
  • Define and maintain server formulas (counts and ratios per CS-3 count, cluster size, and workload type) including capacity planning and headroom policy
  • Specify platform configurations: CPU SKU and core strategy, our vendor roadmap (e.g., AMD, Intel, ARM), memory topology (channels, DIMM type, capacity), PCIe topology and lane budgeting, NIC selection/placement, and local NVMe policy where applicable
  • Translate software and runtime flows into measurable hardware requirements (CPU utilization, memory bandwidth/latency, bursty IO patterns, queueing and concurrency limits) and communicate clear guardrails back to software teams
  • Develop performance and scaling models
  • validate with microbenchmarks and workload-level experiments
  • identify bottlenecks and drive cross-stack fixes
  • Define the OS, BIOS, firmware, and driver baseline for each server type
  • there are other teams that follow these recommendations and apply them on our fleet
  • Stay current on emerging server technologies (CPU generations, new memory technologies, CXL, NVMe evolutions, SmartNIC/DPU capabilities where relevant) and run proof-of-concept evaluations to determine when to adopt
  • Lead technical vendor engagements (OEM/ODM and component vendors): influence roadmap, request platform knobs, and drive joint debugging on performance or reliability issues
  • Define qualification and acceptance criteria (performance, stability, operability) and partner with the Infrastructure Hardware TPM to execute qualification plans and land changes cleanly into production
  • Support bring-up and rare deployment debugging in lab and staging environments
  • drive root-cause analysis for regressions spanning firmware, drivers, OS, and runtime behavior

Requirements

  • PhD. in Computer Science or Electrical/Computer Engineering and + 8 years industry experience, or Master’s/Bachelor’s in CS or EE + 10 years industry experience
  • 5+ years of experience in server platform architecture, systems performance engineering, or large-scale infrastructure design for AI/ML, HPC, or performance-sensitive distributed systems
  • Deep understanding of x86 server architecture: CPU microarchitecture basics, cache hierarchies, NUMA, memory controllers/channels, and memory bandwidth vs latency tradeoffs
  • Strong Linux systems knowledge: profiling and performance analysis, scheduling and syscall overheads, memory management behavior, and practical tuning methodology
  • Experience reasoning about high-performance IO paths, including NIC behavior at a systems level, RDMA/RoCE concepts, and NVMe performance characteristics
  • Proven ability to create capacity and performance models and validate them empirically with a rigorous benchmarking plan
  • Experience working directly with vendors/partners to evaluate platforms, drive issue resolution, and influence roadmaps
  • Strong cross-functional communication skills and ability to drive technical decisions through clear tradeoff documents and reviews
  • Familiarity with application and system software (C, C++, Python)

What we offer

  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Compute Server Platform Architect

8 matching positions

Ai Platform Architect

As an AI Platform Architect I at Teradyne, you will be a hands-on builder and a ...
Location
Location
United States , North Reading
Salary
Salary:
77500.00 - 124500.00 USD / Year
teradyne.com Logo
Teradyne
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4–6 years of experience in AI/ML engineering, with hands-on expertise in enterprise AI platforms and agentic AI development
  • Bachelor’s degree in Computer Science, Information Systems, Engineering, or a related field
  • Proficiency in Python, SQL, agentic orchestration frameworks
  • Hands-on experience building & deploying end-to-end AI solutions with one or more of the following AI platforms: Azure Foundry, Microsoft Copilot Studio, Vertex AI, and Snowflake Cortex AI
  • Experience developing and implementing multi-environment MLOps & LLMOps pipelines
  • Strong knowledge of AI security, observability, and governance frameworks
  • Proven ability to mentor and upskill teams, fostering a culture of innovation and learning
  • Strong collaboration and communication skills, with the ability to work across technical and business teams
  • Analytical mindset with a focus on delivering measurable business outcomes.
Job Responsibility
Job Responsibility
  • Construct, configure, and maintain environments and tools for creating and deploying AI solutions
  • Set up secure, role-based environments in Microsoft Copilot Studio, Azure AI Foundry, and Google Vertex AI
  • Build AI agents for real-world use cases
  • Develop and maintain RAG workflows, MCP Servers
  • Use Microsoft 365 services like Power Automate to streamline business processes
  • Create, configure, and manage development and deployment environments for core AI platforms
  • Implement and maintain Role-Based Access Controls (RBAC)
  • Manage platform settings, integrations, and resource allocations
  • Engineer reusable MLOps/LLMOps pipelines using Azure DevOps or GitHub Actions
  • Mentor and upskill teams in agentic AI design, MLOps & LLMOps practices, and solution architecture
What we offer
What we offer
  • Medical, dental, vision
  • Flexible Spending Accounts
  • Retirement savings plans
  • Life and disability insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior AI Platform Architect

As an AI Platform Architect II at Teradyne, you will be a pivotal leader at the ...
Location
Location
United States , North Reading
Salary
Salary:
139900.00 - 223900.00 USD / Year
teradyne.com Logo
Teradyne
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years in Information Technology including Enterprise Architecture and AI/ML/GenAI Engineering. Proven experience in designing and implementing large-scale, enterprise-wide AI solutions.
  • Bachelor's Degree in Computer Science, Information Systems, Engineering, or a related field required
  • Deep understanding of AI and machine learning concepts, including generative AI, large language models (LLMs), and MLOps.
  • Expertise in designing and building solutions on cloud platforms, with a strong preference for Microsoft Azure and a working knowledge of Google Cloud Platform.
  • Hands-on experience with and a deep architectural understanding of our core AI and data platforms: Microsoft 365 Copilot, Microsoft Copilot Studio, Azure AI (including the future Azure AI Foundry), Google Vertex AI, and Snowflake Cortex AI.
  • Strong knowledge of data architecture, data governance, and data security best practices.
  • Familiarity with Infrastructure as Code (IaC) and CI/CD principles.
  • Exceptional leadership and communication skills, with the ability to articulate a clear vision and influence stakeholders at all levels of the organization.
  • Analytical mindset with a focus on delivering measurable business outcomes.
  • Strong strategic thinking and business acumen, with the ability to connect technology solutions to business value.
Job Responsibility
Job Responsibility
  • Design and evangelize a cohesive, enterprise-wide AI and Data architecture.
  • Create the blueprint for a scalable and secure ecosystem that leverages investments in Microsoft 365 & Copilot, Azure Foundry, Google Vertex AI, and Snowflake Cortex AI.
  • Guide the selection of AI tools, define the data integration strategy, and establish reusable patterns for agentic development.
  • Be a key change agent, responsible for embedding AI into the fabric of the business
  • Bridge the gap between strategy and execution by fostering an AI-ready culture, building institutional literacy.
  • Architect and operationalize the enterprise AI strategy by developing a comprehensive architectural blueprint and component-level roadmap.
  • Lead the design of a unified AI ecosystem that integrates our core platforms.
  • Establish and champion architectural standards, reusable MLOps & LLMOps patterns, and best practices.
  • Architect and lead the strategy for a centralized Model Context Protocol (MCP) Server.
  • Evaluate emerging AI technologies, frameworks, and methodologies, making strategic recommendations.
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Flexible Spending Accounts
  • Retirement savings plans
  • Life and disability insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

ServiceNow Platform Architect

Aptiv is advancing its global digital transformation through a strategic partner...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
aptiv.com Logo
Aptiv plc
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Masters degree in Computer Science, Information Technology, Business Administration, or a related field
  • 8+ years hands-on experience with ServiceNow
  • Certifications: CSA, CAD required
  • CTA / CIS preferred
  • Strong experience with ITSM, ITOM, CMDB, Discovery, HAM, SAM, and CSM
  • Deep understanding of ServiceNow architecture patterns and performance optimization
  • Expertise in enterprise integrations and data synchronization
  • Proven ability to lead multi-instance strategies and complex migrations (clones, splits, consolidations)
  • Experience in GenAI / automation (Now Assist, Virtual Agent, NLU, AI Search preferred)
  • Excellent stakeholder management and communication skills
Job Responsibility
Job Responsibility
  • Define and own the ServiceNow platform architecture and roadmap, aligned with enterprise strategy
  • Establish and enforce platform governance, standards, and best practices
  • Create and maintain technical design documentation, architecture diagrams, and platform standards
  • Lead design and implementation across ServiceNow modules: ITSM, ITOM, CSM, HAM, SAM, CMDB, Discovery, SPM, and GenAI/Now Assist
  • Design scalable solutions using Flow Designer, Workflow Data Fabric, APIs, MID servers, and custom applications
  • Oversee upgrades, patches, cloning, and environment strategy (Dev, Test, Sandbox, Prod)
  • Drive integration strategy with enterprise systems (Entra ID, Azure, AWS, Zabbix, Tanium, Cisco ISE, Salesforce, Microsoft 365)
  • Ensure CMDB health, data quality, and reconciliation strategies
  • Manage and optimize license usage and platform costs
  • Partner with security and infrastructure teams to ensure compliance and risk mitigation
What we offer
What we offer
  • Personal holidays
  • Healthcare
  • Pension
  • Tax saver scheme
  • Free Onsite Breakfast & Lunch
  • Discounted Corporate Gym Membership
  • Multicultural environment
  • Learning, professional growth and development in a world-recognized international environment
  • Access to internal & external training, coaching & certifications
  • Recognition for innovation and excellence
  • Fulltime
Read More
Arrow Right

Platform Architect

As a Platform Architect, you will lead the definition and realization of our AI ...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in system or server hardware architecture, ideally in HPC, AI infrastructure, or hyperscale data centers
  • Deep understanding of PCIe protocols and topologies, including bifurcation, retimer tuning, switch fabrics, and accelerator communication
  • Experience with rack-level and multi-rack system design, including shared power and networking infrastructure
  • Strong expertise in BMC systems, control buses, telemetry integration, and orchestration tooling
  • Familiarity with modern high-speed networking technologies: 400G Ethernet, InfiniBand, CXL fabrics, and NIC-switch integration
  • Proven background in power architecture for dense compute systems, including power budgeting, sequencing logic, and VRM optimization
  • Rack-level management infrastructure design experience, including CDU layout, telemetry aggregation, and rack controller implementation
  • Proven track record of building infrastructure for at-scale deployment, such as automated diagnostics, health monitoring, and fleet orchestration frameworks
  • Understanding of thermal design principles such as airflow, heatsink selection, and liquid cooling systems
  • A systems-level perspective with the ability to design scalable, maintainable, and high-performance platforms
Job Responsibility
Job Responsibility
  • Architect the end-to-end hardware system stack, including server-level components, rack-scale systems, and multi-rack POD designs optimized for AI and high-performance workloads
  • Design and implement advanced PCIe Gen5/Gen6 topologies: root complex architecture, retimer placement, switch hierarchy, and accelerator fan-out strategies
  • Define scalable BMC architecture and platform management features across fleet deployments, including telemetry pipelines, orchestration hooks, and API integrations (e.g., Redfish, IPMI)
  • Specify and lead the implementation of chip-to-chip interconnects such as NVLink, UCIe, and other emerging high-bandwidth, low-latency fabrics
  • Develop integration strategies for power distribution, control planes, cooling systems (air and liquid), and shared interconnect fabrics at the rack level
  • Own the networking architecture across servers and racks, including 400G/800G Ethernet, leaf-spine switching, NIC-to-ToR planning, and cross-rack topology
  • Specify power delivery systems for high-density, multi-kilowatt platforms: VRM selection, power trees, sequencing, and protection logic
  • Guide system design decisions with awareness of mechanical and thermal constraints to ensure performance, manufacturability, and serviceability
  • Contribute to rack-level management infrastructure: CDU planning, telemetry aggregation, rack controller architecture, and out-of-band control
  • Support bring-up and validation teams in debugging complex issues at the system, rack, and POD levels
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Power Platform Architect

Valorem Reply, part of the Reply Network, is a leader in Microsoft-based IT solu...
Location
Location
United States , Chicago, Illinois; Seattle, Washington; Detroit Area, Michigan
Salary
Salary:
140000.00 - 170000.00 USD / Year
valoremreply.com Logo
Valorem Reply
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, engineering, or related field
  • 5 years of experience in developing applications using the Microsoft Power Platform and Azure
  • 2 years of experience developing AI solutions
  • Strong knowledge of Power Apps, Power Automate, Power BI, Power Pages, and Copilot Studio, and their capabilities including proficiency in using various data sources and connectors, such as Dataverse, SQL Server, SharePoint, Excel, etc
  • Experience in integrating Power Platform applications with other Microsoft and third-party services, such as custom solutions, Azure, Dynamics 365, Office 365, etc
Job Responsibility
Job Responsibility
  • Architect and develop traditional applications using the Microsoft Power Platform, leveraging its capabilities and features to meet business requirements and user needs using various app types, such as canvas, model-driven, Copilot Studio and portal apps
  • Architect and develop AI / agentic solutions using Copilot Studio and Foundry and use AI tools to assist in the end-to-end project lifecycle
  • Provide expertise in data storage and access using Dataverse, SQL Server, Azure Blob Storage, and Azure AI Search. Work closely with data engineers to provide data requirements for AI and traditional solutions
  • Deploy and manage Power Platform applications using DevOps tools and processes and ensure compliance with security and governance policies and apply deployment strategies, such as packaging, importing, exporting, and versioning
  • Integrate Power Platform applications with other Microsoft and third-party services, such as custom applications, Azure services, SharePoint, Dynamics 365, SQL Server, etc. using various connectors, such as standard, custom, and premium connectors, and apply integration patterns, such as orchestration, mediation, and transformation
  • Perform unit testing, debugging, and troubleshooting of Power Platform applications, and provide technical support and maintenance
  • Serve and technical lead on projects, directing other Power Platform developers while providing code reviews and feedback and ensure proper coding standards, such as naming conventions, code formatting, and code commenting while building a strong, healthy team culture
  • Provide technical sales guidance, estimations, and grow opportunities by establishing deep relationships with prospective and existing clients
  • Fulltime
Read More
Arrow Right

Datacenter Server Solution Architect

AMD China is seeking a Server Solution Architect position supporting AMD GPU for...
Location
Location
China , Shenzhen or Beijing
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Has good relationship with customer is a MUST
  • 5+ years of GPU technical support experience is a MUST
  • Proficiency in both spoken and written English and Mandarin is a MUST
  • Occasional domestic travel is required
  • Understand AMD or NVIDIA GPU architecture
  • Familiar with AI framework and models (e.g., vLLM, TensorRT, Llma, DeepSeek)
  • Familiar with AMD or NVIDIA GPU performance benchmarks (e.g., TransferBench, rccl-tests, MLPerf)
  • Familiar with AMD or NVIDIA GPU performance profiling tools (e.g., OmniTrace, Nsight System, OmniPerf, Nsight Compute, rocprof, NVIDIA Visual Profiler)
  • GPU programming experience
  • BS or MS in Computer Science, Computer Engineering or Electrical Engineering
Job Responsibility
Job Responsibility
  • Collaborate with the AMD BDM and Platform team to seek opportunities and provide the technical support and training to customers
Read More
Arrow Right

Principal Firmware Architect - Hyperscale & AI Rack-Based Compute Systems

The Principal Firmware Architect will be responsible for architecting server and...
Location
Location
United States , Georgetown
Salary
Salary:
Not provided
sanmina.com Logo
Sanmina
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in one or more of the following: AMI BMC FW, OpenBMC FW, HP iLO, Dell iDRAC, UEFI FW (BIOS)
  • Experience with DMTF standards such as MCTP, NC‑SI, PLDM, OVF, Redfish, SPDM
  • Knowledge of security protocols, Root of Trust, and secure design principles
  • Experience with operating systems and driver design/usage
  • Strong background in Intel/AMD/ARM/GPU platform architectures
  • Strong understanding of Baseboard Management Controller (BMC) functionality, telemetry, and controls
  • Working knowledge of server operating systems including Windows Server (2016, 2019, 2022) and Linux (CentOS, Ubuntu, Fedora, SUSE)
  • Knowledge of virtualization technologies (VMware, Citrix, Microsoft)
  • Understanding of software driver implementation, IP schemas, and network protocols
  • Demonstrated ability to learn and apply new technologies
Job Responsibility
Job Responsibility
  • Develop long‑term hyperscale server firmware and security technology strategies based on customer needs
  • Develop, test, debug, and optimize firmware for ZT hyperscale compute/storage products and proof of concepts
  • Drive adoption of firmware development strategies internally and externally
  • Collaborate directly with customers on new firmware architectures for compute servers, storage servers, and add‑on cards
  • Solve performance and operational challenges to deliver business value through ZT firmware
  • Contribute firmware and security content to System Architecture Specifications for ZT server products
  • Build long‑term technical relationships within the firmware technology ecosystem to influence next‑generation server design
  • Align with customers and partners on security requirements and guide ZT engineering teams accordingly
  • Participate in in‑depth security reviews and drive compliance with industry standards
  • Engage in industry forums, workgroups, and consortiums related to firmware and security initiatives
What we offer
What we offer
  • Competitive base salary
  • Performance-based annual bonus eligibility
  • 401(k) retirement savings plan
  • Tuition reimbursement for eligible education programs
  • Comprehensive medical, dental, and vision coverage with access to leading providers
  • Mental health resources and employee wellness support programs
  • Company-paid life and disability insurance
  • Paid time off (PTO) and company-paid holidays
  • Parental leave and family care support programs
  • Structured training programs and on-the-job learning opportunities
Read More
Arrow Right
New

Expert Cloud Platform Engineer

We are currently seeking a Expert Cloud Platform Engineer (FTE/Hybrid) to join o...
Location
Location
United States , Charlotte
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10 years hands-on experience designing and administering VMware vSphere (ESXi and vCenter) at an enterprise scale
  • Proven ability to deploy and manage software-defined networking and security using VMware NSX
  • Strong operational knowledge of the VMware Aria (formerly vRealize) Suite, particularly Aria Automation and Aria Operations
  • Solid understanding of VMware Cloud Foundation (VCF) architecture and lifecycle management (SDDC Manager)
  • Proficiency in basic system administration, configuration, and troubleshooting for both Red Hat Enterprise Linux (RHEL) and Windows Server environments
  • Solid foundation in TCP/IP networking protocols and enterprise routing/switching principles
  • Hands-on experience managing and integrating core infrastructure services, specifically DNS, DHCP, and IPAM
  • Proficiency in writing and maintaining automation playbooks using Ansible
  • Strong scripting skills in Python for building custom API endpoints, interacting with VMware REST APIs, and automating complex infrastructure tasks
  • 8 years experience with Terraform for infrastructure provisioning and state management
Job Responsibility
Job Responsibility
  • Architect, deploy, and manage private cloud environments utilizing VMware Cloud Foundation (VCF) 9, ensuring optimal resource allocation and scalability
  • Design and implement automated workflows for VM lifecycle management, day-two operations, and event-driven triggers
  • Develop API services to integrate virtualization platforms with internal catalogs and deployment pipelines
  • Oversee the foundational OS and network layer supporting the virtualized environment, ensuring seamless integration of core IP services and reliable guest OS performance
  • Monitor enterprise infrastructure to ensure maximum uptime for mission-critical internal banking applications
  • Proactively tune CPU, memory, and storage configurations for performance and cost-efficiency
  • Implement and enforce strict security policies, micro-segmentation, and role-based access controls (RBAC) to adhere to US banking regulations and internal audit standards
  • Drive the evolution of platform engineering practice by incorporating infrastructure-as-code (IaC) principles
  • Provide technical guidance and escalation support for junior administrators and operational teams
  • Fulltime
Read More
Arrow Right