Selected systems & earlier work

Projects.

A catalogue of the research systems I have built or co-built during my PhD, alongside earlier undergraduate work and creative side projects.

Research systems

DAS

MLSys 20262026

↗

DAS: Distribution-Aware Speculative Decoding for RL Post-Training

A speculative-decoding framework that accelerates reinforcement-learning rollout generation for large language models. A length-aware speculation policy prioritizes aggressive decoding on long-tail trajectories, reducing rollout makespan by up to 50% on agentic reasoning workloads with no change in accuracy.

RL post-trainingspeculative decodingscheduling

TClone

Pre-print2025

↗

TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents

Low-latency workspace versioning for computer-use agents, built on Linux kernel modifications, CRIU extensions, copy-on-write memory sharing, and filesystem versioning. Enables fast branching of live GUI workspaces with isolated process, memory, filesystem, and GUI state. Evaluated across 600+ agent tasks, achieving up to 1.9× lower end-to-end task latency over KVM baselines.

computer-use agentskernel systemsCRIU

Nitsum

Pre-print2025

↗

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

A Rust-based global scheduler with efficient Python/C++ local schedulers and custom CUDA kernels, treating tensor parallelism as a control surface to dynamically meet strict per-tenant SLOs across mixed-priority workloads.

multi-SLO servingtensor parallelismCUDA

Preble: Distributed Prompt Scheduling for LLM Serving

ICLR 20252024

↗

Preble: Distributed Prompt Scheduling for LLM Serving

The first distributed LLM serving platform targeting prompt sharing. Co-optimizes KV-state reuse with computation load-balancing through a new scheduling algorithm and a hierarchical scheduling mechanism, outperforming the prior state of the art by 1.5–14.5× on average latency and 2–10× on p99.

distributed schedulingprefix cachingload balancing

InferCept: Inference for Tool-Augmented LLMs

ICML 20242024

↗

InferCept: Inference for Tool-Augmented LLMs

The first LLM inference framework targeting augmented language models. Minimizes GPU resource waste caused by external interceptions and dedicates the saved memory to additional requests, improving overall serving throughput by 1.6–2× and completing 2× more requests per second over the prior state of the art.

augmented LLMsmemory managementthroughput

Cognify

KDD 20262025

↗

Cognify: Hierarchical Autotuning for Gen-AI Workflows

An autotuning framework for generative-AI workflows. The AdaSeek algorithm performs hierarchical search across workflow structure, operators, and prompts under a fixed budget, improving generation quality by up to 2.8×, reducing monetary cost by up to 10×, and end-to-end latency by 2.7×.

workflow optimizationautotuningRAG

Scheduling Overhead Analysis in vLLM and SGLang

Blog post2024

↗

Scheduling Overhead Analysis in vLLM and SGLang

An investigation of CPU scheduling overhead in modern LLM inference systems. The findings informed the vLLM scheduler redesign, yielding approximately 30% better performance, and were published as a widely circulated post at mlsys.wuklab.io.

systems analysisvLLMSGLang

Earlier work & side projects

Cloudless: multi-cloud serverless scheduling

UC Berkeley Thesis2023

↗

Cloudless: multi-cloud serverless scheduling

A serverless scheduling framework spanning client, 5G, and cloud (AWS, GCP, Azure, Kubernetes). Heuristic and linear-programming-based code placement on compute nodes yielded 70% cost reduction and 30% performance improvement.

Distributed SystemsServerless5G / Edge

NeurIPS 2021 Workshop2021

↗

Worst-group Generalization

Investigated the effect of model size on subgroup robustness under empirical risk minimization. Counter-evidence to the hypothesis that overparameterization hurts rare-subgroup accuracy, with subgroup labels unknown.

ML RobustnessDistribution Shift

Smart Speaker: passive listening privacy

Berkeley CLTC2022

↗

Smart Speaker: passive listening privacy

A smart-speaker testbed studying privacy implications of always-listening devices. NLP intent classifiers and noise-filtering algorithms for passive-listening detection. Published at UC Berkeley CLTC.

PrivacyNLPHardware

2021

↗

Fluid Simulation

A 3D Navier–Stokes water simulation in C++ with KD-tree-accelerated particle queries and parallel solvers.

GraphicsSimulationC++

2020

↗

By A Thread

A 3D-animated short about a mouse trying to reach the moon. Modeled in Maya and edited in After Effects and Premiere.

Animation3DMaya