Exploiting a Large Strongly Comparable Corpus

Thierry Etchegoyhen, Andoni Azpeitia, Naiara Pérez


Abstract
This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.
Anthology ID:
L16-1560
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3523–3529
Language:
URL:
https://aclanthology.org/L16-1560
DOI:
Bibkey:
Cite (ACL):
Thierry Etchegoyhen, Andoni Azpeitia, and Naiara Pérez. 2016. Exploiting a Large Strongly Comparable Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3523–3529, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Exploiting a Large Strongly Comparable Corpus (Etchegoyhen et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1560.pdf