DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

Martin Brümmer, Milan Dojchinovski, Sebastian Hellmann


Abstract
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
Anthology ID:
L16-1532
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3339–3343
Language:
URL:
https://aclanthology.org/L16-1532
DOI:
Bibkey:
Cite (ACL):
Martin Brümmer, Milan Dojchinovski, and Sebastian Hellmann. 2016. DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3339–3343, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus (Brümmer et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1532.pdf
Data
DBpedia