PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web

Iñaki San Vicente, Iker Manterola


Abstract
The importance of parallel corpora in the NLP field is fully acknowledged. This paper presents a tool that can build parallel corpora given just a seed word list and a pair of languages. Our approach is similar to others proposed in the literature, but introduces a new phase to the process. While most of the systems leave the task of finding websites containing parallel content up to the user, PaCo2 (Parallel Corpora Collector) takes care of that as well. The tool is language independent as far as possible, and adapting the system to work with new languages is fairly straightforward. Evaluation of the different modules has been carried out for Basque-Spanish, Spanish-English and Portuguese-English language pairs. Even though there is still room for improvement, results are positive. Results show that the corpora created have very good quality translations units, and the quality is maintained for the various language pairs. Details of the corpora created up until now are also provided.
Anthology ID:
L12-1085
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1–6
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/231_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Iñaki San Vicente and Iker Manterola. 2012. PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1–6, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web (San Vicente & Manterola, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/231_Paper.pdf