Senior Researcher : Microsoft

This role sits within Microsoft's company-wide Systems Innovation initiative, which works to advance efficiency across AI systems—including models, AI frameworks, cloud infrastructure, and hardware. As part of an Applied Research team, you will drive mid- and long-term product innovations that impact hundreds of millions of customers.

The role blends rigorous research (theory and measurement) with hands-on engineering. You will focus on inventing, analyzing, and productionizing the next generation of serving architectures for transformer-based models across cloud and edge.

Key Responsibilities
Algorithmic Innovation: Invent and evaluate algorithms for dynamic batching, routing, and scheduling for transformer inference under multi-tenant Service Level Objectives (SLOs) and variable sequence lengths.

System Optimization: Design and implement caching layers (e.g., KV cache paging/offload, prompt/result caching) and memory pressure controls to maximize GPU/accelerator utilization.

Configuration & Safety: Develop endpoint configuration policies (e.g., tensor/pipe parallelism, quantization profiles, speculative decoding) and safe rollout mechanisms.

Performance Tuning: Profile and optimize end-to-end serving pipelines, focusing on metrics like token-level latency, end-to-end p95/p99, throughput-per-dollar, and cold-start behavior.

Collaboration & Impact: Collaborate with model, kernel, and hardware teams; publish research, file patents, and contribute to open-source serving frameworks.

Qualifications
Required Qualifications
Education: Doctorate in a relevant field OR equivalent experience.

Experience: 2+ years of experience in queuing/scheduling theory and practical request orchestration under SLO constraints.

Technical Skills: 2+ years of experience in C++ and Python for high-performance systems, with reliable code quality and profiling/debugging skills.

Track Record: Demonstrated research impact (publications and/or patents) and experience shipping systems that run at scale.

Security: Ability to pass the Microsoft Cloud Background Check.

Preferred Qualifications
Transformer Efficiency: Deep understanding of techniques like attention mechanisms, paged Key-Value (KV) caching, speculative decoding, Low-Rank Adaptation (LoRA), sequence packing, and quantization.

Systems Modeling: Background in cost/performance modeling, autoscaling, and multi-region disaster recovery (DR).

Frameworks: Hands-on experience with inference serving frameworks such as vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime/ORT, Ray Serve, or DeepSpeed-MII.

Hardware: Familiarity with GPU/accelerator memory management concepts to co-design cache and throughput policies.

🧠 Related Jobs