Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Yingfa Chen; Yutong Wu; Chenyang Song; Zhen Leng Thai; Xingyu Shen; Xu Han (韩旭); Zhiyuan Liu; Maosong Sun

doi:10.18653/v1/2025.emnlp-main.272

Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract

Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. Moreover, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up the model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3’s GQA, with *no degradation in model capabilities*. Our findings offer valuable insights for designing efficient long-context LLMs.

Anthology ID:: 2025.emnlp-main.272
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5360–5376
Language:
URL:: https://aclanthology.org/2025.emnlp-main.272/
DOI:: 10.18653/v1/2025.emnlp-main.272
Bibkey:
Cite (ACL):: Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Cost-Optimal Grouped-Query Attention for Long-Context Modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5360–5376, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Cost-Optimal Grouped-Query Attention for Long-Context Modeling (Chen et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.272.pdf
Checklist:: 2025.emnlp-main.272.checklist.pdf

PDF Cite Search Checklist Fix data