Long-form · cross-posted from mlsys.wuklab.io

Blog.

Long-form posts I have co-authored on the WukLab blog, covering system measurements and design decisions behind production LLM serving and post-training infrastructure. The full archive (including lab-mates' work) lives at mlsys.wuklab.io.

May 17, 2026
MLSys 2026
Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training
Vikranth Srivatsa, Yiying Zhang
Accelerates reinforcement-learning rollouts by targeting long generations with an adaptive speculative-decoding framework.
↗
May 17, 2026
Pre-print
TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents
Yutong Huang, Vikranth Srivatsa, Alex Asch, Hansin Tushar Patwa, Yiying Zhang
A workspace-versioning system enabling fast branching for agents through copy-on-write memory and filesystem sharing.
↗
May 16, 2026
Multi-SLO serving
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, Yiying Zhang
A runtime system treating tensor parallelism as a control surface to dynamically optimize mixed-priority workloads.
↗
November 25, 2024
KDD 2026
Cognify: A Comprehensive, Multi-Faceted Gen-AI Workflow Optimizer
Yiying Zhang, Reyna Abhyankar, Zijian He (paper co-authored with Vikranth Srivatsa)
An autotuning framework for generative-AI workflows. AdaSeek performs hierarchical search across workflow structure, operators, and prompts, improving quality by up to 2.8× while reducing cost by 10× and latency by 2.7×. Accepted to KDD 2026.
↗
September 10, 2024
Systems analysis
Can Scheduling Overhead Dominate LLM Inference Performance?
Vikranth Srivatsa, Dongming Li, Yiying Zhang, Reyna Abhyankar
An analysis of CPU scheduling overhead in modern LLM serving systems. The findings informed the vLLM scheduler redesign for approximately 30% better performance.
↗

Blog.

Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training

TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Cognify: A Comprehensive, Multi-Faceted Gen-AI Workflow Optimizer

Can Scheduling Overhead Dominate LLM Inference Performance?