In Defense of Structural Sparse Adapters for Concurrent LLM Serving

Junda Su; Zirui Liu; Zeju Qiu; Weiyang Liu; Zhaozhuo Xu

In Defense of Structural Sparse Adapters for Concurrent LLM Serving

Junda Su, Zirui Liu, Zeju Qiu, Weiyang Liu, Zhaozhuo Xu

Abstract

Adapting large language models (LLMs) to specific tasks remains challenging due to the extensive retraining required, prompting the need for efficient adapter techniques. Despite this, the concurrent serving of multiple adapters, each with unique matrix shapes, poses significant system-level challenges. To address these issues, we identify an opportunity in structurally sparse adapters, which, unlike low-rank adapters, maintain consistent matrix shapes while varying in sparsity patterns. Leveraging this characteristic, we introduce SpartanServe, a system designed for efficient concurrent serving of LLMs using multiple structurally sparse adapters. SpartanServe employs a unified matrix multiplication operation and a novel memory management technique to enable effective batching. Furthermore, the incorporation of Triton kernels enhances the acceleration of matrix multiplication in the serving process. Experimental results demonstrate that SpartanServe achieves 2.12× speedup over S-LoRA when serving 96 adapters using a single NVIDIA A100 GPU (40GB), showcasing its efficacy in concurrent LLM serving.

Anthology ID:: 2024.findings-emnlp.284
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4948–4953
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.284
DOI:
Bibkey:
Cite (ACL):: Junda Su, Zirui Liu, Zeju Qiu, Weiyang Liu, and Zhaozhuo Xu. 2024. In Defense of Structural Sparse Adapters for Concurrent LLM Serving. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4948–4953, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: In Defense of Structural Sparse Adapters for Concurrent LLM Serving (Su et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.284.pdf

PDF Cite Search