Data Selection Curriculum for Abstractive Text Summarization

Shichao Sun, Ruifeng Yuan, Jianfei He, Ziqiang Cao, Wenjie Li, Xiaohua Jia


Abstract
Abstractive Text Summarization (ATS) models are commonly trained using large-scale data that is randomly shuffled. However, the impact of data selection and data ordering on ATS models remains a relatively unexplored research area, where a significant challenge lies in accurately assessing the learning difficulty of each training instance. This study introduces a Data Selection Curriculum (DSC) scoring system that incorporates both the difficulty of improving ATS model via an instance and the expected performance on this instance. By selectively excluding excessively simple and overly complex instances, the training efficiency can be optimized. Furthermore, curriculum learning is integrated to accelerate convergence and improve performance by gradually increasing the learning difficulty, inspired by human learners. Experimental results on the CNN/DailyMail dataset demonstrate that our approach surpasses potent baselines, utilizing a mere 20% of the available instances.
Anthology ID:
2023.findings-emnlp.537
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7990–7995
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.537
DOI:
10.18653/v1/2023.findings-emnlp.537
Bibkey:
Cite (ACL):
Shichao Sun, Ruifeng Yuan, Jianfei He, Ziqiang Cao, Wenjie Li, and Xiaohua Jia. 2023. Data Selection Curriculum for Abstractive Text Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7990–7995, Singapore. Association for Computational Linguistics.
Cite (Informal):
Data Selection Curriculum for Abstractive Text Summarization (Sun et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.537.pdf