OpusTools and Parallel Corpus Diagnostics

Mikko Aulamo, Umut Sulubacak, Sami Virpioja, Jörg Tiedemann


Abstract
This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.
Anthology ID:
2020.lrec-1.467
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3782–3789
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.467
DOI:
Bibkey:
Cite (ACL):
Mikko Aulamo, Umut Sulubacak, Sami Virpioja, and Jörg Tiedemann. 2020. OpusTools and Parallel Corpus Diagnostics. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3782–3789, Marseille, France. European Language Resources Association.
Cite (Informal):
OpusTools and Parallel Corpus Diagnostics (Aulamo et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.467.pdf