StitchLLM: Serving LLMs, One Block at a Time

Bodun Hu; Shuozhe Li; Saurabh Agarwal; Myungjin Lee; Akshay Jajoo; Jiamin Li; Le Xu; Geon-Woo Kim; Donghyun Kim; Hong Xu; Amy Zhang; Aditya Akella

doi:10.18653/v1/2025.acl-long.1305

StitchLLM: Serving LLMs, One Block at a Time

Bodun Hu, Shuozhe Li, Saurabh Agarwal, Myungjin Lee, Akshay Jajoo, Jiamin Li, Le Xu, Geon-Woo Kim, Donghyun Kim, Hong Xu, Amy Zhang, Aditya Akella

Abstract

The rapid evolution of large language models (LLMs) has revolutionized natural language processing (NLP) tasks such as text generation, translation, and comprehension. However, the increasing computational demands and inference costs of these models present significant challenges. This study investigates the dynamic and efficient utilization of pre-trained weights from open-sourced LLMs of varying parameter sizes to achieve an optimal balance between computational efficiency and task performance. Drawing inspiration from the dual-process theory of human cognition, we introduce StitchLLM: a dynamic model routing framework that employs a powerful bottom model to process all queries, and uses a lightweight routing mechanism to allocate computational resources appropriately. Our novel framework optimizes efficiency and maintains performance, leveraging a trainable stitching layer for seamless integration of decoder layers across different LLMs. Experimental results demonstrate that StitchLLM improves system throughput while minimizing performance degradation, offering a flexible solution for deploying LLMs in resource-constrained settings.

Anthology ID:: 2025.acl-long.1305
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26887–26903
Language:
URL:: https://aclanthology.org/2025.acl-long.1305/
DOI:: 10.18653/v1/2025.acl-long.1305
Bibkey:
Cite (ACL):: Bodun Hu, Shuozhe Li, Saurabh Agarwal, Myungjin Lee, Akshay Jajoo, Jiamin Li, Le Xu, Geon-Woo Kim, Donghyun Kim, Hong Xu, Amy Zhang, and Aditya Akella. 2025. StitchLLM: Serving LLMs, One Block at a Time. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26887–26903, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: StitchLLM: Serving LLMs, One Block at a Time (Hu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1305.pdf

PDF Cite Search Fix data