Document Segmentation Matters for Retrieval-Augmented Generation

Zhitong Wang; Cheng Gao; Chaojun Xiao; Yufei Huang; Shuzheng Si; Kangyang Luo; Yuzhuo Bai; Wenhao Li; Tangjian Duan; Chuancheng Lv; Guoshan Lu; Gang Chen; Fanchao Qi; Maosong Sun

doi:10.18653/v1/2025.findings-acl.422

Document Segmentation Matters for Retrieval-Augmented Generation

Zhitong Wang, Cheng Gao, Chaojun Xiao, Yufei Huang, Shuzheng Si, Kangyang Luo, Yuzhuo Bai, Wenhao Li, Tangjian Duan, Chuancheng Lv, Guoshan Lu, Gang Chen, Fanchao Qi, Maosong Sun

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge. A critical yet underexplored challenge in RAG is document segmentation, also known as document chunking. Existing widely-used rule-based chunking methods usually lead to suboptimal splits, where overly large chunks introduce irrelevant information and small chunks lack semantic coherence. Existing semantic-based approaches either require costly LLM calls or fail to adaptively group contextually related sentences. To address these limitations, we propose PIC, Pseudo-Instruction for document Chunking), a simple yet effective method that leverages document summaries as pseudo-instructions to guide chunking. By computing semantic similarity between sentences and the summary, PIC dynamically groups sentences into chunks that align with the document’s key themes, ensuring semantic completeness and relevance to potential user instructions. Experiments on multiple open-domain question-answering benchmarks demonstrate that PIC can significantly improve retrieval accuracy (Hits@k) and end-to-end QA performance (Exact Match) without any additional training.

Anthology ID:: 2025.findings-acl.422
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8063–8075
Language:
URL:: https://aclanthology.org/2025.findings-acl.422/
DOI:: 10.18653/v1/2025.findings-acl.422
Bibkey:
Cite (ACL):: Zhitong Wang, Cheng Gao, Chaojun Xiao, Yufei Huang, Shuzheng Si, Kangyang Luo, Yuzhuo Bai, Wenhao Li, Tangjian Duan, Chuancheng Lv, Guoshan Lu, Gang Chen, Fanchao Qi, and Maosong Sun. 2025. Document Segmentation Matters for Retrieval-Augmented Generation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8063–8075, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Document Segmentation Matters for Retrieval-Augmented Generation (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.422.pdf

PDF Cite Search Fix data