Senior Researcher - LLM Systems (Research Sciences IC4): Microsoft (Systems Innovation Initiative)
Dec 6, 2025 |
Location: Redmond, Washington, United States |
Deadline: Not specified
Experience: Senior
Continent: North America
Salary: $158,400 - $258,000 per year
This role is within Microsoft's company-wide Systems Innovation initiative, focusing on advancing efficiency across the entire AI stack (models, frameworks, cloud infrastructure, and hardware). This is an Applied Research team driving mid- and long-term product innovations that will materially impact Microsoft Copilots and hundreds of millions of global customers.
The candidate will focus on algorithmic and systems innovations to invent, analyze, and productionize the next generation of serving architectures for transformer-based models across cloud and edge.
Key Responsibilities
Inference Optimization: Invent and evaluate algorithms for dynamic batching, routing, and scheduling for transformer inference under multi-tenant Service Level Objectives (SLOs) and variable sequence lengths.
Systems Design: Design and implement caching layers (e.g., KV cache paging/offload, prompt/result caching) and memory pressure controls to maximize GPU/accelerator utilization.
Deployment & Policy: Develop endpoint configuration policies (e.g., parallelism, quantization, speculative decoding) and safe rollout mechanisms.
Benchmarking: Profile and optimize end-to-end serving pipelines across metrics like token-level latency, throughput-per-$, and cold-start behavior.
Collaboration & Output: Collaborate with model, kernel, and hardware teams; publish research, file patents, and contribute to open-source serving frameworks.
Mentorship: Document designs and operational playbooks, and mentor researchers/engineers on the team.
Qualifications
Required Qualifications
Education: Doctorate in a relevant field OR equivalent experience.
Experience: 2+ years of experience in queuing/scheduling theory and practical request orchestration under SLO constraints.
Coding: 2+ years of experience in C++ and Python for high-performance systems, with reliable code quality and profiling/debugging skills.
Impact: Demonstrated research impact (publications and/or patents) and shipping systems that run at scale.
Preferred Qualifications
Transformer Expertise: Deep understanding of transformer inference efficiency techniques (paged Key-Value (KV) caching, speculative decoding, LoRA, sequence packing/continuous batching, quantization).
Serving Frameworks: Hands-on experience with inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime/ORT, Ray Serve, DeepSpeed-MII).
Cloud Systems: Background in cost/performance modeling, autoscaling, multi-region disaster recovery (DR), and familiarity with GPU/accelerator memory management concepts.
đ Apply Now
đ 0 views | đ 0 clicks