Distributed LLM Serving on Consumer-Grade GPUs by Reconciling Computation and Communication

Lewei Jin; Kui Zhang; Yongqi Chen; Zhuoyifan; Renjie Li; Yi Gao; Bowei Yang; Zhengong Cai; Wei Dong

Distributed LLM Serving on Consumer-Grade GPUs by Reconciling Computation and Communication

Lewei Jin, Kui Zhang, Yongqi Chen, Zhuoyifan, Renjie Li, Yi Gao, Bowei Yang, Zhengong Cai, Wei Dong

Abstract

Large language models are reshaping internet services. Serving these models is often costly, as it requires multiple high-end GPUs. Consumer-grade GPUs offer cheaper computational power, providing an opportunity for more cost-efficient LLM serving.Prior efforts have explored distributed serving at scale, primarily focusing on model deployment strategies. However, communication efficiency has emerged as a challenge due to the imbalance in data transfer volumes between the two phases of inference: prefill and decode. Prefill requests can involve transmitting up to 1000 times more data than decode requests, leading to decode requests being delayed. Consequently, servers are underutilized while waiting for decode requests. In this paper, we present MoLink, an efficient distributed LLM serving system. It splits the prolonged transmission volume of prefill requests into smaller chunks and carefully scheduling their transmission. It consists of two parts: (i) a transmission scheduling algorithm that fairly determines whether to transmit prefill or decode requests, and (ii) a chunking determination algorithm that determines the transmit volume for prefill requests just-in-time. Our evaluation demonstrates that MoLink reduces TTFT, TPOT, and latency compared to the state-of-the-art distributed LLM serving system, with a maximum reduction of up to 46%.

Anthology ID:: 2025.findings-emnlp.957
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17633–17642
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.957/
DOI:
Bibkey:
Cite (ACL):: Lewei Jin, Kui Zhang, Yongqi Chen, Zhuoyifan, Renjie Li, Yi Gao, Bowei Yang, Zhengong Cai, and Wei Dong. 2025. Distributed LLM Serving on Consumer-Grade GPUs by Reconciling Computation and Communication. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17633–17642, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Distributed LLM Serving on Consumer-Grade GPUs by Reconciling Computation and Communication (Jin et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.957.pdf
Checklist:: 2025.findings-emnlp.957.checklist.pdf

PDF Cite Search Checklist Fix data