MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Daniel Varab, Natalie Schluter


Abstract
Current research in automatic summarisation is unapologetically anglo-centered–a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.
Anthology ID:
2021.emnlp-main.797
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10150–10161
Language:
URL:
https://aclanthology.org/2021.emnlp-main.797
DOI:
10.18653/v1/2021.emnlp-main.797
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.797.pdf
Code
 danielvarab/massive-summ
Data
CNN/Daily MailGlobal VoicesMLSUMNEWSROOMNew York Times Annotated Corpus