Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

Yijiong Yu, Zhixiao Qi, Yongfeng Huang, Wei Wang, Weifeng.liu, Ran Chen, Ji Pei


Abstract
Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks. Our code and datasets are available at https://github.com/yuyijiong/hard_retrieval_for_llm
Anthology ID:
2025.findings-emnlp.301
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5615–5634
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.301/
DOI:
Bibkey:
Cite (ACL):
Yijiong Yu, Zhixiao Qi, Yongfeng Huang, Wei Wang, Weifeng.liu, Ran Chen, and Ji Pei. 2025. Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5615–5634, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps (Yu et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.301.pdf
Checklist:
 2025.findings-emnlp.301.checklist.pdf