LumberChunker: Long-Form Narrative Document Segmentation

André Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo Oliveira


Abstract
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content’s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 “needle in a haystack” type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro.
Anthology ID:
2024.findings-emnlp.377
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6473–6486
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.377
DOI:
Bibkey:
Cite (ACL):
André Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, and Arlindo Oliveira. 2024. LumberChunker: Long-Form Narrative Document Segmentation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6473–6486, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
LumberChunker: Long-Form Narrative Document Segmentation (Duarte et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.377.pdf
Software:
 2024.findings-emnlp.377.software.zip
Data:
 2024.findings-emnlp.377.data.zip