Yi Quan Lee
2024
MC-indexing: Effective Long Document Retrieval via Multi-view Content-aware Indexing
Kuicai Dong
|
Derrick Goh Xin Deik
|
Yi Quan Lee
|
Hao Zhang
|
Xiangyang Li
|
Cong Zhang
|
Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the **M**ulti-view **C**ontent-aware indexing (**MC-indexing**) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by **42.8%**, **30.0%**, **23.9%**, and **16.3%** via top k = 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
Search
Co-authors
- Kuicai Dong 1
- Derrick Goh Xin Deik 1
- Hao Zhang 1
- Xiangyang Li 1
- Cong Zhang 1
- show all...
- Yong Liu 1