Zhixiao Qi


2025

pdf bib
Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps
Yijiong Yu | Zhixiao Qi | Yongfeng Huang | Wei Wang | Weifeng.liu | Ran Chen | Ji Pei
Findings of the Association for Computational Linguistics: EMNLP 2025

Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks. Our code and datasets are available at https://github.com/yuyijiong/hard_retrieval_for_llm