LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Tommaso Bonomo; Luca Gioffré; Roberto Navigli

doi:10.18653/v1/2025.emnlp-main.1729

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Tommaso Bonomo, Luca Gioffré, Roberto Navigli

Abstract

Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA.This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans.Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/sapienzaNLP/LiteraryQA.

Anthology ID:: 2025.emnlp-main.1729
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34086–34107
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1729/
DOI:: 10.18653/v1/2025.emnlp-main.1729
Bibkey:
Cite (ACL):: Tommaso Bonomo, Luca Gioffré, and Roberto Navigli. 2025. LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34086–34107, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA (Bonomo et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1729.pdf
Checklist:: 2025.emnlp-main.1729.checklist.pdf

PDF Cite Search Checklist Fix data