Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus

Raivis Skadiņš; Jörg Tiedemann; Roberts Rozis; Daiga Deksne

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus

Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, Daiga Deksne

Abstract

The European Union is a great source of high quality documents with translations into several languages. Parallel corpora from its publications are frequently used in various tasks, machine translation in particular. A source that has not systematically been explored yet is the EU Bookshop ― an online service and archive of publications from various European institutions. The service contains a large body of publications in the 24 official of the EU. This paper describes our efforts in collecting those publications and converting them to a format that is useful for natural language processing in particular statistical machine translation. We report our procedure of crawling the website and various pre-processing steps that were necessary to clean up the data after the conversion from the original PDF files. Furthermore, we demonstrate the use of this dataset in training SMT models for English, French, German, Spanish, and Latvian.

Anthology ID:: L14-1652
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1850–1855
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/846_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, and Daiga Deksne. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1850–1855, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus (Skadiņš et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/846_Paper.pdf

PDF Cite Search Fix data