This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The GenAI Infrastructure and Solutions team is building large-scale GenAI training infrastructure, LLM-based solutions and tools. We provide the infrastructure for teams in CoreAI and other Microsoft Groups to fine-tune LLMs and serve agentic workload for their own scenarios. As a Principal Software Engineer, you will work on the infrastructure and tools to support large scale model fine-tuning, evaluation, and inference.
Job Responsibility:
Lead the collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models.
Design, build and improve services with high scalability and reliability.
Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements.
Lead the efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments.
Contribute to the deployment and monitoring of services in production environments.
Requirements:
Bachelor's Degree in Computer Science or related technical field and 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience.
6+ years designing, developing, and shipping high quality software.
3+ years of experience with distributed systems and cloud-based infrastructure.
2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
2+ years of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
Passionate and self-motivated. Strong ability in self-learning, entering new domain, managing through uncertainty in an innovative team environment.
Familiarity with virtualization technology.
Familiarity with production ML systems and concepts like model serving, caching, batching, and monitoring.