Quality Assessment of the Reuters Vol. 2 Multilingual Corpus

Robin Eriksson

Quality Assessment of the Reuters Vol. 2 Multilingual Corpus

Abstract

We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.

Anthology ID:: L16-1286
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1813–1819
Language:
URL:: https://aclanthology.org/L16-1286/
DOI:
Bibkey:
Cite (ACL):: Robin Eriksson. 2016. Quality Assessment of the Reuters Vol. 2 Multilingual Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1813–1819, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Quality Assessment of the Reuters Vol. 2 Multilingual Corpus (Eriksson, LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1286.pdf

PDF Cite Search Fix data