Senior Researcher : Microsoft
Dec 7, 2025 |
Location: United States |
Deadline: Not specified
Experience: Senior
Continent: North America
Salary: $158,400 - $258,000 per year.
This role sits within Microsoft's company-wide Systems Innovation initiative, which works to advance efficiency across AI systems—including models, AI frameworks, cloud infrastructure, and hardware. As part of an Applied Research team, you will drive mid- and long-term product innovations that impact hundreds of millions of customers.
The role blends rigorous research (theory and measurement) with hands-on engineering. You will focus on inventing, analyzing, and productionizing the next generation of serving architectures for transformer-based models across cloud and edge.
Key Responsibilities
Algorithmic Innovation: Invent and evaluate algorithms for dynamic batching, routing, and scheduling for transformer inference under multi-tenant Service Level Objectives (SLOs) and variable sequence lengths.
System Optimization: Design and implement caching layers (e.g., KV cache paging/offload, prompt/result caching) and memory pressure controls to maximize GPU/accelerator utilization.
Configuration & Safety: Develop endpoint configuration policies (e.g., tensor/pipe parallelism, quantization profiles, speculative decoding) and safe rollout mechanisms.
Performance Tuning: Profile and optimize end-to-end serving pipelines, focusing on metrics like token-level latency, end-to-end p95/p99, throughput-per-dollar, and cold-start behavior.
Collaboration & Impact: Collaborate with model, kernel, and hardware teams; publish research, file patents, and contribute to open-source serving frameworks.
Qualifications
Required Qualifications
Education: Doctorate in a relevant field OR equivalent experience.
Experience: 2+ years of experience in queuing/scheduling theory and practical request orchestration under SLO constraints.
Technical Skills: 2+ years of experience in C++ and Python for high-performance systems, with reliable code quality and profiling/debugging skills.
Track Record: Demonstrated research impact (publications and/or patents) and experience shipping systems that run at scale.
Security: Ability to pass the Microsoft Cloud Background Check.
Preferred Qualifications
Transformer Efficiency: Deep understanding of techniques like attention mechanisms, paged Key-Value (KV) caching, speculative decoding, Low-Rank Adaptation (LoRA), sequence packing, and quantization.
Systems Modeling: Background in cost/performance modeling, autoscaling, and multi-region disaster recovery (DR).
Frameworks: Hands-on experience with inference serving frameworks such as vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime/ORT, Ray Serve, or DeepSpeed-MII.
Hardware: Familiarity with GPU/accelerator memory management concepts to co-design cache and throughput policies.
🚀 Apply Now
👀 1 views | 🚀 0 clicks