CrawlJobs Logo

Senior Researcher - AI and Systems Reliability

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

Help shape the future of reliable AI systems. At Microsoft Research’s AI and Systems Reliability Group (Redmond, WA), we push the boundaries of foundational research and turn ideas into impact across Microsoft and beyond. Our mission is to tackle ambitious challenges that redefine the computing landscape. We are seeking Senior Researcher – AI and Systems Reliability – Microsoft Research areas such as distributed systems and reliability, formal methods and verification, machine learning for system reliability, and reliability of machine learning systems. As AI (Artificial Intelligence) technologies—like large language models—become central to everyday computing, we look for experts who can bring formal rigor and reliability guarantees to AI-powered personal, mobile, and datacenter platforms.

Job Responsibility:

  • Define a novel research agenda, driving forward an effective program of basic, fundamental, and applied research
  • Collaborate and build new ideas with members of the group and others
  • Have the direct opportunity to realize your ideas in products and services used worldwide

Requirements:

  • PhD (or currently pursuing) in Computer Science or Computer Science Engineering
  • A research program demonstrated by journal and conference publications (NeurIPS, SOSP, OSDI)
  • Firm understanding of Distributed Systems and Cloud Systems
  • Demonstrable ability to work in a multi-disciplinary team
  • Effective communication skills and ability to work in a collaborative environment
  • A PhD that was focused on any one of the following core areas of research: datacenter networking, distributed systems, formal methods and verification, high performance computing, ML Systems, operating systems, programming languages, storage systems, systems reliability, systems security and software engineering

Additional Information:

Job Posted:
April 01, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Researcher - AI and Systems Reliability

Senior Generative AI Engineer

The Citi Innovation Lab is a leader in creating new ideas, innovative technology...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with transformer-based models and their applications
  • Strong understanding of LLM, LLM model selection, benchmarking, and optimization
  • Experience with RAG systems and vector databases
  • Proficiency in developing and deploying AI agents
  • Knowledge of open-source models and methods, including benchmarks for evaluating AI performance
  • Knowledge of security risks and mitigation strategies for autonomous AI agents, including OWASP guidelines
  • Proficiency in Python and experience with libraries such as Pandas, Tabula, and TensorFlow/PyTorch
  • Strong problem-solving skills and attention to detail
  • Excellent communication and documentation skills
Job Responsibility
Job Responsibility
  • Develop and implement enterprise scale cutting edge models such as visual document understanding and text2code
  • Implement and Optimize vector-based retrieval systems for RAG by covering embedding models, ANN indexing, hybrid search, and re-ranking
  • Implement autonomous AI agents to implement adaptive, error resistant data extraction, and content validation tasks
  • Develop and deploy enterprise software applications using state of the art practices, such as micro services, modular code, as well as proficiency in writing unit and integration tests to ensure the accuracy and reliability of the AI applications
  • Ensure data privacy and security in all AI-driven processes, adhering to OWASP guidelines and Citi’s stringent authentication and authorization policies
  • Collaborate with cross-functional teams to integrate AI solutions into existing workflows
  • Document the development process and create comprehensive technical specifications
  • Manage and maintain AI applications, ensuring best practices in model management and versioning
  • Deploy resulting AI applications using industrial strength framework and processes, including Kubernetes and OpenShift for scalable and efficient operations on-premises
  • Ability to research and develop and utilize transformer-based models for enhanced application performance
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer (AI Agent)

At Cresta, the AI Agent team is on a mission to create state-of-the-art AI Agent...
Location
Location
United States; Canada
Salary
Salary:
Not provided
cresta.com Logo
Cresta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Mathematics, or a related field
  • Master’s or Ph.D. preferred, or equivalent professional experience
  • 7+ years of hands-on industry experience with AI and machine learning
  • 3+ years of experience working with LLMs in large-scale production environments
  • Expert knowledge of machine learning concepts and methods, especially those related to NLP, Generative AI, and working with LLMs
  • Proven leadership in designing and deploying AI solutions at scale
  • Extensive practical knowledge of modern machine learning frameworks and technologies (e.g., PyTorch, Tensorflow, Hugging Face, NumPy)
  • Experience with distributed systems and cloud-based AI infrastructure
  • Strong problem-solving and strategic thinking abilities
  • Proven ability to lead cross-functional teams and work collaboratively to deliver innovative AI solutions in production
Job Responsibility
Job Responsibility
  • Design, develop, and deploy Cresta’s AI Agent solutions and proprietary models
  • Focus on practical AI challenges such as improving reasoning, planning capabilities, and evaluation in real-world scenarios
  • Collaborate with cross-functional teams including front-end and back-end software engineers to integrate AI Agents into Cresta’s customer solutions
  • Lead initiatives to scale AI systems for production environments, ensuring performance and reliability across use cases
  • Contribute to solving cutting-edge problems in AI and help define the future roadmap for Cresta’s AI Agents
  • Innovate and research ways to improve security, cost-efficiency, and reliability of AI systems
What we offer
What we offer
  • Variety of medical, dental, and vision plans
  • Paid parental leave
  • Monthly Health & Wellness allowance
  • Work from home office stipend
  • Lunch reimbursement for in-office employees
  • PTO: 3 weeks in Canada
  • Base salary, equity, and a variety of benefits
  • Fulltime
Read More
Arrow Right

Artificial intelligence-assisted Reliability Engineer Intern

Amazon development center develops innovative consumer-centric safety total solu...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
amazon.de Logo
Amazon Pforzheim GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Enrolled in or have completed a Master's degree or above in engineering or equivalent
  • Speak, write, and read fluently in Mandarin
  • Currently enrolled in a master’s program in Computer Science, Electrical Engineering, Chemical Engineering, Mechanical Engineering, or related field with focus on deep learning and computer vision
  • Programming experience in C++ and Python
  • Experience in implementing computer vision algorithms using multiple toolkits
  • Proficient in both English and Chinese
Job Responsibility
Job Responsibility
  • Develop and implement novel machine learning algorithms, focusing on computer vision and generative AI applications
  • Design and optimize scalable AI systems for large-scale datasets, ensuring high performance and accuracy
  • Apply state-of-the-art Machine Learning and AI research to solve complex reliability challenges and product lifespan projection
  • Create clear technical documentation and reports to effectively communicate concepts and results
  • Work closely with senior reliability engineers in reporting reliability execution progress, and failure analysis
  • Participate in evaluating and developing reliability test methodologies to reduce test time and increase test coverage
  • Parttime
Read More
Arrow Right

Staff AI Embedded Software Engineer - Connected Devices

As a Staff Embedded Software Engineer, you will lead critical software engineeri...
Location
Location
United States , Seattle; Boston; Scottsdale
Salary
Salary:
168750.00 - 270000.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of professional software development experience, with extensive expertise in C/C++, Go, Python, or comparable systems programming languages, including significant experience building AI- and data-intensive systems
  • Deep, demonstrated expertise in embedded systems architecture, firmware integration, and device-level software engineering, combined with hands-on experience deploying and optimizing AI inference workloads on constrained edge platforms (MCUs, SoCs, NPUs)
  • Proven experience designing, training, and operating machine learning models at scale, including ownership of data pipelines, model evaluation, and iterative improvement in production environments
  • Practical experience with large-scale AI systems, including foundation models and LLMs, such as fine-tuning, adaptation, or integration into real-world products
  • Proven track record of addressing and resolving system-wide challenges in performance, scalability, reliability, security, and safety across AI-enabled and mission-critical systems
  • At least 7+ years mentoring senior engineers and leading complex, strategic engineering initiatives across multiple teams, including setting technical direction for AI-enabled products
  • Advanced understanding of computer science fundamentals, data structures, algorithms, and high-standard software design practices, applied to both embedded and large-scale AI systems
  • Experience with networking and distributed system concepts relevant to connected and AI-enabled devices
Job Responsibility
Job Responsibility
  • Define and significantly advance embedded software architectures for Axon’s current and future connected device products, including AI-enabled systems spanning on-device inference and cloud-assisted workflows
  • Lead the technical direction for AI-enabled capabilities across connected devices, including collaboration on large-scale model training, data strategy, deployment, and iterative improvement in production, across multiple product lines
  • Partner with research, product, and platform teams to explore and integrate emerging AI approaches, including foundation models and multimodal systems, shaping Axon’s medium and long-term AI strategy for connected devices
  • Establish and enforce Axon-wide standards for embedded software and AI system design, including reliability, scalability, safety, observability, and lifecycle management
  • Identify and mitigate risks associated with AI systems, including model failure modes, data drift, and operational edge cases, and drive architectural decisions that ensure safe and reliable behavior in real-world conditions
  • Provide executive-level guidance and mentorship, significantly enhancing the capabilities and technical decision-making of the embedded software engineering teams
  • Continuously improve software engineering practices and drive excellence through strategic retrospectives, planning sessions, and innovation cycles
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer, Agentic

Join us in building the future of finance. Our mission is to democratize finance...
Location
Location
United States , Bellevue; Menlo Park; New York; Washington; Denver; Westlake; Chicago; Lake Mary; Clearwater; Gainesville
Salary
Salary:
146000.00 - 220000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong technical expertise in software development, with understanding of agentic workflows—including reasoning loops, tool invocation, memory, and orchestration of autonomous AI agents
  • Hands-on experience using Large Language Models, including prompt engineering, fine-tuning, model distillation, and deploying optimized models (e.g. via DPO, PPO) into production environments
  • Proven ability to build and scale ML/AI systems, from experimentation to deployment—owning dataset generation, evaluation pipelines, A/B testing, and performance monitoring
  • Leadership and mentorship capabilities, with a track record of guiding complex technical projects and supporting the growth of teammates through code/design reviews and technical direction
  • Excellent communication and collaboration skills, with the ability to translate technical ideas into actionable plans and work effectively with cross-functional partners, including product and infrastructure teams
  • Innovation mindset and commitment to continuous learning and a bias toward action, staying at the forefront of ML/AI trends, agentic systems research, and best practices in tooling, safety, and evaluation
Job Responsibility
Job Responsibility
  • Design and create tools and workflows for agent development that support rapid prototyping—define agents, compose toolchains, and construct reasoning loops with minimal overhead
  • Build platform solutions to support scalable experimentation, synthetic dataset generation, and multi-agent evaluation across diverse tasks and domains
  • Develop feedback and optimization pipelines that incorporate both automated metrics and human-in-the-loop evaluation signals to fine-tune agent behavior
  • Implement and scale optimization techniques such as Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and reward modeling to improve agent performance
  • Launch and support fine-tuned models in production environments with robust evaluation, rollback strategies, and performance monitoring
  • Collaborate closely with applied AI/ML teams to translate state-of-the-art research in agentic reasoning, planning, and tool use into reliable, production-ready systems
What we offer
What we offer
  • Market competitive and pay equity-focused compensation structure
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Annual lifestyle wallet for personal wellness, learning and development, and more
  • Lifetime maximum benefit for family forming and fertility benefits
  • Dedicated mental health support for employees and eligible dependents
  • Generous time away including company holidays, paid time off, sick time, parental leave, and more
  • Lively office environment with catered meals, fully stocked kitchens, and geo-specific commuter benefits
  • Bonus opportunities
  • Equity
  • Fulltime
Read More
Arrow Right

Software Engineer - PostgreSQL for AI Workloads

Microsoft’s Azure Data engineering team is leading the transformation of analyti...
Location
Location
Spain , Barcelona
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND equivalent technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Proven technical engineering capabilities in systems-level engineering, including work on database engines, distributed systems, or backend infrastructure
  • Proficiency in one or more systems programming languages such as C, C++, or Rust
  • Experience working with PostgreSQL or similar engines at the extension, indexing, or query execution level
  • Demonstrated ability to design and deliver reliable, performant systems in a collaborative environment
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Design and implement AI-native capabilities in PostgreSQL, including vector indexing, semantic and full-text search operators, hybrid search patterns, and graph query features
  • Own or contribute to the design and implementation of major AI-native subsystems, from early technical exploration through production readiness, with guidance from senior engineers as appropriate
  • Build and enhance high-performance PostgreSQL extensions and core engine integrations using C, C++, or Rust, with a strong focus on performance, correctness, and maintainability
  • Contribute to end-to-end development, including performance analysis, debugging, tuning, operability, and service integration in cloud database environments
  • Work effectively in high-ambiguity problem spaces, evaluating technical tradeoffs through experimentation as patterns and best practices emerge
  • Collaborate closely with senior engineers, product managers, and AI researchers to translate requirements into scalable, intuitive, and reliable systems
  • Participate actively in technical design discussions, code reviews, and the evolution of engineering standards, while deepening understanding of PostgreSQL internals and systems design
  • Help shape the developer experience through APIs, control plane integration, and extensibility mechanisms
  • Learn, apply, and promote best practices for building reliable, observable, and operable systems in a production cloud database service
  • Stay informed and curious about research and industry trends in databases, search systems, graph systems, and AI-powered data platforms.
  • Fulltime
Read More
Arrow Right

Staff AI Engineer

As a Staff AI Engineer, you will be a senior technical leader responsible for de...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
teradata.com Logo
Teradata
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8+ years of experience building backend services, distributed systems, or data/AI platforms
  • Strong proficiency in Java, Go, or Python, with experience building large‑scale services
  • Deep understanding of distributed system design, scalability, fault tolerance, and cloud‑native architectures
  • Proven experience designing and operating production systems with SQL and NoSQL data stores
Job Responsibility
Job Responsibility
  • Lead the design and evolution of large‑scale, distributed AI systems that power Teradata’s AI platform and AI‑native products
  • Own end‑to‑end architecture for critical AI capabilities such as agentic workflows, RAG pipelines, vector search, semantic retrieval, and AI orchestration frameworks
  • Drive technical strategy and architectural consistency across multiple engineering teams
  • Design and implement production‑grade AI systems using LLMs, embeddings, vector databases, and agent‑based architectures
  • Build scalable, secure, and reusable platform services and APIs supporting AI workloads across the software development lifecycle
  • Define and implement guardrails for reliability, safety, governance, and cost control in enterprise AI systems
  • Partner with product management, architecture, research, and cloud platform teams to translate business requirements into scalable AI solutions
  • Influence roadmap decisions by providing deep technical insight, trade‑off analysis, and long‑term platform thinking
  • Act as a technical escalation point for complex system design, performance, and reliability challenges
  • Drive best practices for testing, observability, evaluation, and production readiness of AI systems
What we offer
What we offer
  • People-first culture
  • Flexible work model
  • Focus on well-being
  • Inclusive environment
  • Fulltime
Read More
Arrow Right

Senior Software Engineer- AI

Are you looking for an opportunity to work with the latest Azure offerings and p...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Software Development
  • Strong programming expertise in one or more languages such as Python, Go, Java, or C#, with experience designing production-grade services and APIs
  • Experience building AI-powered applications, including integrating LLMs, implementing agent or Copilot workflows, and orchestrating multi-step AI interactions
  • Hands-on experience with LLM application frameworks and orchestration tools such as Semantic Kernel, LangChain, or similar agent frameworks
  • Familiarity with retrieval-augmented generation (RAG) architectures, vector databases, embeddings, and semantic search systems
  • Experience evaluating and improving model performance through prompt design, evaluation frameworks, fine-tuning, or feedback loops
  • Solid understanding of distributed systems concepts including scalability, reliability, observability, caching, and asynchronous processing
  • Experience deploying and operating AI workloads in cloud environments (preferably Azure), including containerized services and GPU-enabled infrastructure
  • Understanding of Responsible AI practices, including model governance, safety, privacy, and evaluation of AI behaviour in production systems
  • Ability to work across product, research, and engineering teams to translate product scenarios into scalable AI system architectures
Job Responsibility
Job Responsibility
  • Design, build, and operate scalable AI systems that power intelligent product experiences, including Copilot and agent-driven workflows
  • Architect and implement backend services that support multi-step AI interactions, including orchestration pipelines, context management, memory/state persistence, and tool execution
  • Integrate large language models (LLMs), APIs, and internal services to enable context-aware, human-in-the-loop experiences across customer scenarios
  • Build and maintain data and inference pipelines that support model training, fine-tuning, evaluation, and real-time inference across diverse data sources
  • Evaluate, benchmark, and tune AI/ML models (LLMs and traditional models) to meet product requirements for accuracy, latency, reliability, and safety
  • Implement robust retrieval, grounding, and knowledge integration mechanisms (e.g., RAG systems, semantic indexing, vector search) to power intelligent applications
  • Collaborate with product managers, software engineers, and researchers to translate product vision into production-ready AI capabilities and measurable outcomes
  • Ensure reliability, observability, and governance of AI systems, including monitoring model performance, data quality, and responsible AI practices
  • Build reusable platforms, APIs, and tools that enable teams to rapidly develop AI-powered features and self-service intelligent applications
  • Fulltime
Read More
Arrow Right