CrawlJobs Logo

Research Intern - AI Frameworks (Network Systems and Tools)

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

6710.00 - 13270.00 USD / Month

Job Description:

Research Internships at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment. Advances in Artificial Intelligence (AI) increasingly depend on breakthroughs in systems and architecture, where hardware, models, and software must be co-designed to scale efficiently. This Research Internship offers the opportunity to explore next-generation AI systems through performance modeling, architectural analysis, and emerging inference mechanisms. Research Interns will investigate topics such as disaggregated inference, memory-architecture, and interconnect technologies specifically focused on request scheduling and key-value (KV) caching optimizations. This role is ideal for students passionate about understanding AI systems end-to-end and shaping the architectural foundations of tomorrow’s intelligent datacenters.

Job Responsibility:

  • Investigate and evaluate emerging disaggregated KV cache architectures
  • Implement a hierarchical storage architecture with multiple tiers GPU Memory: Active working set of KV caches currently used by the model CPU DRAM: Hot cache for recently used KV chunks using pinned memory for efficient GPU-CPU transfers Local Storage: Large-scale local caching (NVMe, local disk)
  • Build Peer-to-Peer (P2P) service KV cache sharing architecture that enables direct, high-performance cache transfer between multiple LLM serving instances without requiring centralized cache servers

Requirements:

  • Currently enrolled in a PhD program in Computer Science, Electrical/Computer Engineering, or a related field
  • Research experience in areas such as computer architecture, AI/ML systems, performance modeling, distributed systems, or hardware–software co-design
  • Programming skills in Python, C/C++ with experience building prototypes, simulators, or performance analysis tools
  • Familiarity with modern AI workloads and/or deep learning frameworks (e.g., PyTorch)
  • Demonstrated ability to define and pursue original research directions in AI systems or architecture
  • Ability to collaborate effectively with researchers across disciplines and work in cross-group, cross-cultural environments
  • Proficient communication and presentation skills for sharing complex technical insights
  • Ability to think creatively and approach system and architecture challenges with unconventional or innovative solutions
  • Experience with PyTorch, CUDA, Triton, or performance-simulation tools
  • Background in large-scale system design, AI inference bottleneck analysis, or modeling cost/performance tradeoffs
  • Understanding of accelerator, memory-system, or interconnect design principles

Additional Information:

Job Posted:
April 20, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Research Intern - AI Frameworks (Network Systems and Tools)

Research Intern - AI System Architecture Modeling and Performance

Research Internships at Microsoft provide a dynamic environment for research car...
Location
Location
United States , Hillsboro
Salary
Salary:
6710.00 - 13270.00 USD / Month
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Accepted or currently enrolled in a PhD program in Computer Science or related STEM field
  • At least 1 year of experience with performance analysis tools and methodologies, optimization and modeling
  • Proficiency with frameworks such as PyTorch, SGLang, Dynamo, and AI accelerator programming models/compilers such as CUDA and Triton
  • Deep understanding of GPU and AI architectures including memory hierarchies, compute-communication interplay, kernel scheduling and interconnect properties
  • Familiarity with CPU/server architectures including understanding of PCIe topologies and accelerator/NIC/peripheral demand. Solid understanding of CPU involvement in dispatching, scheduling and orchestration of input data pipelines to AI accelerators
  • Hands-on experience with benchmarking, profiling, identifying perf bottlenecks and performance analysis and optimization, including trace generation, event monitoring and instrumentation
  • Familiarity with roofline performance modeling, detailed performance simulations and awareness of speed vs accuracy tradeoffs in various performance modeling methodologies
  • Ability to apply the appropriate performance analysis methodology including devising new or combinatorial approaches in evaluating complex system architecture what-if scenarios
  • Solid verbal and written communication skills
Job Responsibility
Job Responsibility
  • Research Interns put inquiry and theory into practice. Alongside fellow doctoral candidates and some of the world’s best researchers, Research Interns learn, collaborate, and network for life
  • Research Interns not only advance their own careers, but they also contribute to exciting research and development strides
  • During the 12-week internship, Research Interns are paired with mentors and expected to collaborate with other Research Interns and researchers, present findings, and contribute to the vibrant life of the community
  • As a Research Intern, you will be at the forefront of hardware/software co-design and have a direct impact in answering critical questions around designing an optimized AI system and evaluating real-world impact on the Azure’s supporting hyperscale infrastructure
  • This role will evaluate opportunities to co-optimize central processing unit (CPU), graphics processing unit (GPU) and networking infrastructure for the Maia accelerator ecosystem
  • You will be expected to identify system stress points, propose novel architectural ideas, and create methodologies using a combination of workload characterization, modeling and benchmarking to evaluate their effectiveness
  • Fulltime
Read More
Arrow Right

Research Intern - AI Systems and Tools

Research Internships at Microsoft provide a dynamic environment for research car...
Location
Location
United States , Redmond
Salary
Salary:
6710.00 - 13270.00 USD / Month
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently enrolled in PhD program in Computer Science, Computer Engineering, Electrical Engineering or related STEM field
  • At least 1 year of experience in conducting research, writing peer-reviewed publications and software development
  • At least 1 year of experience with software development in C++
  • Research Interns are expected to be physically located in their manager’s Microsoft worksite location for the duration of their internship
  • Submit a minimum of two reference letters for this position as well as a cover letter and any relevant work or research samples
Job Responsibility
Job Responsibility
  • Work on multiple levels of the AI system and infrastructure stack, with an emphasis on developer tools for Microsoft's custom Maia AI hardware
  • Work on device firmware, software running on the host, as well as higher level analysis and integration with AI/ML frameworks such as PyTorch
  • Work with partner teams to build new tools that help developers author highly efficient kernels for state-of-the-art models to execute on AI accelerators
  • Learn, collaborate, and network for life
  • Collaborate with other Research Interns and researchers, present findings, and contribute to the vibrant life of the community
  • Fulltime
Read More
Arrow Right

Field AI Engineer

At JFrog, we’re reinventing DevOps to help the world’s greatest companies innova...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years in a technical hands-on role: engineering, ML, solutions engineering, forward-deployed engineering, or similar
  • Hands-on experience with modern AI/ML tooling: LLMs, RAG pipelines, agentic frameworks (LangChain, LlamaIndex, AutoGen, or equivalent)
  • Strong Python skills
  • comfort with APIs, cloud infrastructure, and developer tooling
  • Demonstrated presence in the AI community — meetups, open-source, writing, speaking, any of it
  • Based in San Francisco or willing to be deeply embedded here
  • Technically deep: you've built real things with AI. You can evaluate a new tool in an afternoon, spot architectural tradeoffs immediately, and hold your own in any technical conversation
  • Genuinely embedded: you're already showing up at SF AI events, or you will be within 30 days of starting. You find energy in these rooms, not obligation
  • A strong synthesizer: you can take a noisy week of conversations and distill it into three things that actually matter
  • Enterprise-aware: you understand what it means to deploy AI inside a large organization: procurement, security, integration complexity, change management, organizational politics
Job Responsibility
Job Responsibility
  • Attend AI meetups, research demos, hackathons, and conferences across the SF Bay Area as a consistent, trusted presence
  • Build a structured view of the AI vendor and startup landscape — what's emerging, what's enterprise-ready, what's overhyped
  • Run a regular intelligence loop: weekly signal briefs, monthly deep-dives for leadership, and ad hoc alerts when something important surfaces
  • Identify and track tools, frameworks, and architectural patterns gaining real adoption before they become mainstream
  • Build authentic relationships with AI researchers, startup founders, enterprise architects, and practitioners — as a peer, not a vendor
  • Become a known and trusted voice in the SF AI scene
  • someone people want to loop in, not avoid
  • Maintain a living network map of who's building what, who's thinking about what problems, and where the interesting work is happening
  • Get hands-on with new tools, APIs, and agentic frameworks — prototype and evaluate them firsthand before forming a view
  • Engage credibly in deep technical conversations about LLMs, RAG architectures, agentic systems, fine-tuning, prompt engineering, and enterprise AI infrastructure
What we offer
What we offer
  • equity package of restricted stock units (RSU)
  • eligibility to participate in our Employee Stock Purchase Plan
  • comprehensive benefits including medical, dental, vision, retirement, wellness and much more
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
Job Responsibility
Job Responsibility
  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Staff II Software Engineer AI/ML Ops

We're looking for a Lead Data Engineer to design, build, and optimize data pipel...
Location
Location
United States , Pleasanton
Salary
Salary:
245000.00 - 307000.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
  • Proficiency in containerization technologies (e.g., Docker, Kubernetes)
  • Proficient in scripting languages (e.g., Bash, python) for automation
  • Experience with workflow orchestration tools (e.g., Apache Airflow)
Job Responsibility
Job Responsibility
  • Lead data pipeline development: Build and maintain PySpark ETL pipelines with high data quality and performance
  • Manage integrations: Establish robust connections to client data sources via APIs and tools like FiveTran, Plaid, and BlackLine's own internal connector ecosystem
  • Ensure reliability: Monitor pipeline performance, automate testing, and validate data accuracy
  • Optimize for scale: Implement performance improvements (e.g., CDC mechanisms, indexing strategies) for large-scale datasets
  • Collaborate & innovate: Work with business stakeholders to refine data requirements and integrate cutting-edge AI and big data technologies
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
What we offer
What we offer
  • Short-term and long-term incentive programs
  • Robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
Job Responsibility
Job Responsibility
  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...
Location
Location
United States , Santa Clara
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
  • Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
  • Expertise in configuration management and IaC (Terraform, Helm, Ansible)
  • Strong proficiency in programming languages like Python, Go, or Java
  • Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
  • Experience with GitOps principles and tools like GitLab CI and ArgoCD
  • Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
  • Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
  • BS or MS in Computer Science, a related field, or equivalent professional/military experience
Job Responsibility
Job Responsibility
  • Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
  • Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
  • Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
  • Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
  • Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
  • Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence
  • Fulltime
Read More
Arrow Right

Engineering Director

We are seeking a seasoned Engineering Director who thrives in challenging and fa...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant work experience as a director or similar position working across multiple stakeholder organizations, with at least 10+ years of people leadership experience specific to SW and Cloud engineering
  • Solid experience leading SW development across storage, networking, on-prem, and SaaS is a must
  • Experience in setting up geographically distributed sites
  • Must have a strong background in software development lifecycle including cloud infrastructure
  • Familiarity with agile methodologies and tools like JIRA
  • Prior experience in cloud product development and deployments
  • end to end ownership and accountability
  • Solid understanding of fundamental AI and machine learning concepts, including supervised and unsupervised learning, deep learning, reinforcement learning, natural language processing, computer vision, and statistical modeling
  • Extensive business acumen, technical knowledge, and industry experience encompassing one or more engineering, technology, and product domains
  • Demonstrated abilities to drive transformation across a business with exceptional skills in the management of change
Job Responsibility
Job Responsibility
  • Oversee the Puerto Rico Site daily operations, strategic planning and cross-functional team leadership for Hybrid Cloud
  • Recruit, mentor, and manage teams of AI/ML engineers, QA Engineers, Design Engineers and innovation specialists to deliver cutting-edge solutions
  • Continuously evaluate new tools, platforms, and frameworks in AI/ML to drive competitive advantage and operational efficiency
  • Ensure alignment with corporate goals while fostering a high-performance culture, operational efficiency, and employee engagement
  • Lead the development and execution of AI/ML strategies that align with business goals and drive innovation across products, services, or operations
  • Create strategic and tactical operations and resource plans, goals, and priorities for assigned organization based on business and technology roadmap and functional objectives
  • Engage with various senior leaders across the organization, program managers, R&D, support, Quality, product managers, technical leaders and executives to communicate program status, escalate issues, and guide and influence strategic decision-making
  • Manage senior relationships and escalated issues with outsourced partners and suppliers, including setting expectations regarding deliverables, product quality, schedules, and costs
  • ensures that organization is effectively leveraging outsourced resources
  • Identify opportunities for and drive organizational initiatives and programs to support business process improvements and cost reductions
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right