Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

Ken Shi, Gerald Penn


Abstract
In this paper, we introduce the concept of Semantic Masking, where semantically coherent surrounding text (the haystack) interferes with the retrieval and comprehension of specific information (the needle) embedded within it. We propose the Needle-in-a-Haystack-QA Test, an evaluation pipeline that assesses LLMs’ long-text capabilities through question answering, explicitly accounting for the Semantic Masking effect. We conduct experiments to demonstrate that Semantic Masking significantly impacts LLM performance more than text length does. By accounting for Semantic Masking, we provide a more accurate assessment of LLMs’ true proficiency in utilizing extended contexts, paving the way for future research to develop models that are not only capable of handling longer inputs but are also adept at navigating complex semantic landscapes.
Anthology ID:
2025.wraicogs-1.2
Volume:
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Michael Zock, Kentaro Inui, Zheng Yuan
Venues:
WRAICOGS | WS
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
16–23
Language:
URL:
https://aclanthology.org/2025.wraicogs-1.2/
DOI:
Bibkey:
Cite (ACL):
Ken Shi and Gerald Penn. 2025. Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities. In Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025), pages 16–23, Abu Dhabi, UAE. International Committee on Computational Linguistics.
Cite (Informal):
Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities (Shi & Penn, WRAICOGS 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.wraicogs-1.2.pdf