CrawlJobs Logo

Research Intern - AI Network Observability

United States, Mountain View 6710.00 - 13270.00 USD / Month · Job Posted April 20, 2026
Apply Position
Job Link Share

Job Description

Research Internships at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment. As a Research Intern in the Strategic Planning and Architecture (SPARC) group, you will contribute to the research, design, and development of tools to provide insights into multi-path network transports for large-scale Artificial Intelligence (AI) datacenter environments. Your work will focus on building high-performance tracing and analysis systems capable of capturing packet-level behavior at extremely high speeds (up to 800Gbps). These tools will enhance observability for next-generation transport protocols supporting AI workloads. The role offers opportunities to prototype solutions on real hardware and collaborate with engineers to improve reliability and strengthen the explainability of AI intra-datacenter networking.

Job Responsibility

  • Engage early with their mentors to clearly formulate a plan of work for the 12 weeks of the Research Internship
  • Clearly and frequently document and communicate their progress, adjusting the plan as the project evolves
  • Show initiative and think unconventionally to derive creative and innovative solutions

Requirements

  • Currently enrolled in a PhD program in Computer Science or a related STEM field
  • Research Interns are expected to be physically located in their manager’s Microsoft worksite location for the duration of their internship
  • Applicants should demonstrate depth of knowledge in datacenter networking and systems research
  • Experience in high performance programming network data paths (e.g., using C++)
  • Experience in RDMA and/or DPDK
  • Experience in RoCE, knowledge of TCP, UDP, IP, ethernet

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Research Intern - AI Network Observability

8 matching positions

Senior Software Engineer - Kubernetes & ServiceMesh

Join us in building Roku’s next-generation cloud-agnostic platform that powers K...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on experience with cloud technologies (AWS preferred
  • GCP or Azure is a plus), specifically in architecting and managing performant, large-scale systems handling significant traffic/data
  • Deep knowledge of Kubernetes (EKS, GKE, AKS, or similar) and service mesh technologies
  • Proficiency in Go or another programming language, Python or another scripting language
  • Experience designing infrastructure and building automation tools, while collaborating with internal team members and external stakeholders
  • Experience building CI/CD pipelines and following modern deployment practices
  • Familiarity with observability tools (Prometheus, Thanos, Loki, Grafana, etc.)
  • Ability to work independently and communicate effectively with technical and non-technical stakeholders
  • Passion for learning and solving complex infrastructure challenges
  • Experience integrating AI tools to improve processes and reduce operational toil (a plus)
Job Responsibility
Job Responsibility
  • Architect, design, and deploy Roku’s next-generation cloud platform and service mesh
  • Build and own solutions to Roku's compute problems using Docker, Kubernetes, Istio/Envoy, Terraform and scripting to evolve our tech stack and deployments
  • Proactively drive the research and implementation of new technologies to enhance scalability, reliability, and developer experience
  • Integrate security best practices into infrastructure design and automation
  • Build tooling to visualize inefficiencies and optimize costs across shared-tenancy clusters, including network traffic insights, cross-cluster communication efficiency, and cost attribution
  • Collaborate with internal teams to migrate workloads to Kubernetes + Istio, leveraging open-source observability tools
  • Work closely with the Observability team to scale monitoring and logging solutions for a holistic view of the platform
  • Leverage SRE principles to maintain high availability and streamline onboarding workflows
  • Mentor team members and help define best practices for infrastructure and automation
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life insurance
  • accident insurance
  • disability insurance
  • commuter benefits
  • retirement options (401(k)/pension)
  • time off
  • Fulltime
Read More
Arrow Right

Senior Full Stack Engineer - Go / React.js

Rapid7’s Metasploit team is building the future of the world’s best-known softwa...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
rapid7.com Logo
Rapid7
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in software development using Go, JavaScript, TyperScript and React (Next.js) or equivalent of programming languages
  • Experience with modern cloud infrastructure (AWS, GCP, or Azure)
  • Experience with design patterns
  • Experience with message queues (RabbitMQ, SQS)
  • Understanding of APIs, interprocess communication, and modern networking and deployment tooling (AWS, Docker)
  • High level of accountability and ownership
  • Leading with empathy and strong user focus
  • Ability to learn and evaluate new technologies quickly
  • Interest in or experience with offensive security, penetration testing, or SOC analysis
  • Product driven mindset
Job Responsibility
Job Responsibility
  • Develop and enhance AI-powered applications within Metasploit ecosystem
  • Architect and implement performant, scalable, and reliable solutions that support AI-driven interactions in web development
  • Collaborate cross-functionally with researchers, engineers and product teams to push the boundaries of AI in cybersecurity
  • Ensure an exceptional user experience through user-friendly UI/UX
  • Diagnose and resolve complex issues, ensuring the reliability and performance of AI-powered products
  • Build tooling and automation to enhance incident response, developer experience, observability, and internal debugging workflows
  • Champion your teammates' successes, and support each other when needed
  • Fulltime
Read More
Arrow Right

Senior Fullstack Engineer - Go / React.js

Rapid7’s Metasploit team is building the future of the world’s best-known softwa...
Location
Location
United Kingdom
Salary
Salary:
Not provided
rapid7.com Logo
Rapid7
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in software development using Go, JavaScript, TypeScript and React (Next.js) or equivalent of programming languages
  • Experience with modern cloud infrastructure (AWS, GCP, or Azure)
  • Experience with design patterns
  • Experience with message queues (RabbitMQ, SQS)
  • Understanding of APIs, interprocess communication, and modern networking and deployment tooling (AWS, Docker)
  • High level of accountability and ownership, taking responsibility for outcomes and proactively drives work forward with minimal oversight
  • Leading with empathy and strong user focus
  • Ability to learn and evaluate new technologies quickly, digging into code to find answers
  • Interest in or experience with offensive security, penetration testing, or SOC analysis
  • Product driven mindset
Job Responsibility
Job Responsibility
  • Develop and enhance AI-powered applications within Metasploit ecosystem
  • Architect and implement performant, scalable, and reliable solutions that support AI-driven interactions in web development
  • Collaborate cross-functionally with researchers, engineers and product teams to push the boundaries of AI in cybersecurity
  • Ensure an exceptional user experience through user-friendly UI/UX
  • Diagnose and resolve complex issues, ensuring the reliability and performance of AI-powered products
  • Build tooling and automation to enhance incident response, developer experience, observability, and internal debugging workflows
  • Champion your teammates' successes, and support each other when needed
  • Fulltime
Read More
Arrow Right

Enterprise Account Executive

We are looking for a fast-paced, client-obsessed Account Executive with an entre...
Location
Location
Australia
Salary
Salary:
250000.00 - 300000.00 USD / Year
arize.com Logo
Arize
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years enterprise SaaS sales experience: Hungry, aggressive and motivated
  • Familiarity or willingness to learn sales technologies to find and attract prospects
  • Self-starter and comfortable working in limited process environments
  • Full-cycle sales experience and ability to navigate the complexities of enterprise deals
  • Fast-paced and focused on helping prospects / customers
  • Team player: Collaboration with peers and other organizations within Arize is critical to success, we deeply value the success of the collective team over individual gains
  • Strong communication skills: Clearly and objectively communicate observations from the field
Job Responsibility
Job Responsibility
  • Be a networker, seller and closer
  • Build relationships with AI/ML stakeholders and be an active member of the community
  • Conduct discovery with prospects and share the Arize vision
  • Run a sophisticated prospecting strategy to 'get the word out' and find deals
  • Create sales plays, write talk tracks and strategically identify new business opportunities
  • Deeply research accounts, stakeholders and competitors
  • Manage proof of concepts, drive adoption and grow accounts
  • Manage and navigate internal / external stakeholders to ensure success
  • Understand use cases, scope licensing and find more workloads
  • BANT or MEDDIC methodology preferred
What we offer
What we offer
  • competitive equity package
  • medical
  • dental
  • vision
  • 401(k) plan
  • unlimited paid time off
  • generous parental leave plan
  • others for mental and wellness support
  • WFH monthly stipend to pay for co-working spaces
  • Fulltime
Read More
Arrow Right

Business Consultant, Digital Commerce

As a Consultant - Digital Commerce, you will work as part of our Strategy and Gr...
Location
Location
United States
Salary
Salary:
Not provided
columbusglobal.com Logo
Columbus United Kingdom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in Retail, Manufacturing, Food and Beverages, and Life Sciences
Job Responsibility
Job Responsibility
  • Leading strategic commitments
  • Gathering information, creating insight based on analysis to draw meaningful conclusions, identify implications for recommendations and gain understanding and acceptance by the customer
  • Follow Columbus framework approaches, tailoring them to specific customer needs
  • Collaborating and contributing to the Advisory competence network, helping to continuously improve the offering through retrospectives on completed work
  • Manage activities and deliveries according to plan, review drafts and provide feedback / coaching to other project members
  • Act as a catalyst and coach in projects that span over your area of expertise
  • Evaluate and manage risks and problems and ensure that the project objectives are achieved
  • Use AI appropriately to drive quality and accelerate delivery
  • Establish a reliable relationship with our clients’ management where they see you as an advisor
  • Develop and maintain strong, trust-based relationships with our customers (at all levels)
What we offer
What we offer
  • Health Insurance
  • Life Insurance
  • Dental Insurance
  • Vision Insurance
  • Short-Term Disability
  • Long-Term Disability
  • paid vacation
  • sick leave
  • holidays
  • 401(k)
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right