KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
Abstract
KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present KVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13times JCT speedup in PD-separated serving and up to 32.8times TTFT reduction in KV-disaggregated serving.
Community
We’re excited to share that KVServe has been accepted by SIGCOMM 2026 🎉
KVServe is a service-aware KV cache compression framework for communication-efficient disaggregated LLM serving. It treats KV compression as a runtime strategy selection problem rather than a fixed configuration, and achieves up to 9.13× JCT speedup in PD-separated serving and 32.8× TTFT reduction in KV-disaggregated serving. :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing (2026)
- SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference (2026)
- CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration (2026)
- SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving (2026)
- One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving (2026)
- KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving (2026)
- Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.13734 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper