ChapterBreak: A Challenge Dataset for Long-Range Language Models

Simeng Sun; Katherine Thai; Mohit Iyyer

doi:10.18653/v1/2022.naacl-main.271

ChapterBreak: A Challenge Dataset for Long-Range Language Models

Abstract

While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our ChapterBreak dataset to spur more principled future research into LRLMs.

Anthology ID:: 2022.naacl-main.271
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3704–3714
Language:
URL:: https://aclanthology.org/2022.naacl-main.271/
DOI:: 10.18653/v1/2022.naacl-main.271
Bibkey:
Cite (ACL):: Simeng Sun, Katherine Thai, and Mohit Iyyer. 2022. ChapterBreak: A Challenge Dataset for Long-Range Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3704–3714, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: ChapterBreak: A Challenge Dataset for Long-Range Language Models (Sun et al., NAACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.naacl-main.271.pdf
Video:: https://aclanthology.org/2022.naacl-main.271.mp4

PDF Cite Search Video Fix data