ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh; Rezaul Karim; Hossein Rajabzadeh; Omar Mohamed Awad; Boxing Chen; Hyock Ju Kwon; Walid Ahmed; Yang Liu (刘扬)

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Boxing Chen, Hyock Ju Kwon, Walid Ahmed, Yang Liu

Abstract

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77% higher token-per-second throughput during training, up to 16% higher Model FLOPs Utilization (MFU), and up to 14% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

Anthology ID:: 2025.emnlp-industry.156
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2252–2269
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.156/
DOI:
Bibkey:
Cite (ACL):: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Boxing Chen, Hyock Ju Kwon, Walid Ahmed, and Yang Liu. 2025. ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2252–2269, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training (Dialameh et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.156.pdf

PDF Cite Search Fix data