LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte; João DS Marques; Miguel Graça; Miguel Freire; Lei Li; Arlindo L. Oliveira

doi:10.18653/v1/2024.findings-emnlp.377

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João DS Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Abstract

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content’s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 “needle in a haystack” type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro.

Anthology ID:: 2024.findings-emnlp.377
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6473–6486
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.377/
DOI:: 10.18653/v1/2024.findings-emnlp.377
Bibkey:
Cite (ACL):: André V. Duarte, João DS Marques, Miguel Graça, Miguel Freire, Lei Li, and Arlindo L. Oliveira. 2024. LumberChunker: Long-Form Narrative Document Segmentation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6473–6486, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: LumberChunker: Long-Form Narrative Document Segmentation (Duarte et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.377.pdf
Software:: 2024.findings-emnlp.377.software.zip
Data:: 2024.findings-emnlp.377.data.zip

PDF Cite Search Software Data Fix data