You Only Use Reactive Attention Slice When Retrieving From Long Context

Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao


Abstract
Retrieval-Augmented Generation is a powerful method for enhancing language models (LMs), but existing retrieval techniques are limited.Embedding-based methods are often inaccurate due to their reliance on lexical similarity, while neural retrievers are computationally expensive to train.To overcome these issues, we introduce You Only Use Reactive Attention slice (YOURA), a training-free and fine-tuning-free attention-based retrieval technique. When retrieving, YOURA uses a novel reaction score heuristic, which quantifies how an LM’s self-attention “reacts” to a user query. We also propose a sentence extraction algorithm to efficiently preprocess the context.Evaluations on three open-source LMs using the LongBench and BABILong datasets show YOURA’s effectiveness. Our framework improves QA task accuracy by up to 15% and inference throughput by up to 31% compared to embedding-based retrieval.
Anthology ID:
2025.findings-emnlp.1125
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20665–20686
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1125/
DOI:
Bibkey:
Cite (ACL):
Yun Joon Soh, Hanxian Huang, Yuandong Tian, and Jishen Zhao. 2025. You Only Use Reactive Attention Slice When Retrieving From Long Context. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20665–20686, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
You Only Use Reactive Attention Slice When Retrieving From Long Context (Soh et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1125.pdf
Checklist:
 2025.findings-emnlp.1125.checklist.pdf