Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Hai Yu; Chong Deng; Qinglin Zhang; Jiaqing Liu; Qian Chen; Wen Wang

doi:10.18653/v1/2023.emnlp-main.341

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Hai Yu, Chong Deng, Qinglin Zhang, Jiaqing Liu, Qian Chen, Wen Wang

Abstract

Topic segmentation is critical for obtaining structured documents and improving down- stream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve F₁ of old SOTA by 3.42 (73.74 → 77.16) and reduces P_k by 1.11 points (15.0 → 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on P_k on WikiSection. The average relative P_k drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.

Anthology ID:: 2023.emnlp-main.341
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5592–5605
Language:
URL:: https://aclanthology.org/2023.emnlp-main.341
DOI:: 10.18653/v1/2023.emnlp-main.341
Bibkey:
Cite (ACL):: Hai Yu, Chong Deng, Qinglin Zhang, Jiaqing Liu, Qian Chen, and Wen Wang. 2023. Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5592–5605, Singapore. Association for Computational Linguistics.
Cite (Informal):: Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling (Yu et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.341.pdf
Video:: https://aclanthology.org/2023.emnlp-main.341.mp4

PDF Cite Search Video