MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Daniel Varab; Natalie Schluter

doi:10.18653/v1/2021.emnlp-main.797

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Abstract

Current research in automatic summarisation is unapologetically anglo-centered–a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.

Anthology ID:: 2021.emnlp-main.797
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10150–10161
Language:
URL:: https://aclanthology.org/2021.emnlp-main.797
DOI:: 10.18653/v1/2021.emnlp-main.797
Bibkey:
Cite (ACL):: Daniel Varab and Natalie Schluter. 2021. MassiveSumm: a very large-scale, very multilingual, news summarisation dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10150–10161, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: MassiveSumm: a very large-scale, very multilingual, news summarisation dataset (Varab & Schluter, EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.797.pdf
Video:: https://aclanthology.org/2021.emnlp-main.797.mp4
Code: danielvarab/massive-summ
Data: CNN/Daily Mail, DaNewsroom, Global Voices, MLSUM, NEWSROOM, New York Times Annotated Corpus

PDF Cite Search Code Video