A Multilingual Simplified Language News Corpus

Renate Hauser, Jannis Vamvas, Sarah Ebling, Martin Volk


Abstract
Simplified language news articles are being offered by specialized web portals in several countries. The thousands of articles that have been published over the years are a valuable resource for natural language processing, especially for efforts towards automatic text simplification. In this paper, we present SNIML, a large multilingual corpus of news in simplified language. The corpus contains 13k simplified news articles written in one of six languages: Finnish, French, Italian, Swedish, English, and German. All articles are shared under open licenses that permit academic use. The level of text simplification varies depending on the news portal. We believe that even though SNIML is not a parallel corpus, it can be useful as a complement to the more homogeneous but often smaller corpora of news in the simplified variety of one language that are currently in use.
Anthology ID:
2022.readi-1.4
Volume:
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rodrigo Wilkens, David Alfter, Rémi Cardon, Núria Gala
Venue:
READI
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
25–30
Language:
URL:
https://aclanthology.org/2022.readi-1.4
DOI:
Bibkey:
Cite (ACL):
Renate Hauser, Jannis Vamvas, Sarah Ebling, and Martin Volk. 2022. A Multilingual Simplified Language News Corpus. In Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference, pages 25–30, Marseille, France. European Language Resources Association.
Cite (Informal):
A Multilingual Simplified Language News Corpus (Hauser et al., READI 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.readi-1.4.pdf