Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

Yufeng Du; Minyang Tian; Srikanth Ronanki; Subendhu Rongali; Sravan Babu Bodapati; Aram Galstyan; Azton Wells; Roy Schwartz; Eliu A Huerta; Hao Peng

doi:10.18653/v1/2025.findings-emnlp.1264

Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng

Abstract

Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures—the models’ inability to identify information in the long inputs that is relevant to the task they are solving. Accordingly, recent efforts often focus on evaluating and improving LLMs’ retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one—or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%–85%) as input length increases but remains well within their claimed context lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.

Anthology ID:: 2025.findings-emnlp.1264
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23281–23298
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1264/
DOI:: 10.18653/v1/2025.findings-emnlp.1264
Bibkey:
Cite (ACL):: Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23281–23298, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (Du et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1264.pdf
Checklist:: 2025.findings-emnlp.1264.checklist.pdf

PDF Cite Search Checklist Fix data