TweetMT: A Parallel Microblog Corpus

Iñaki San Vicente, Iñaki Alegría, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, Eva Martínez Garcia, Antonio Toral, Arkaitz Zubiaga, Nora Aranberri


Abstract
We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.
Anthology ID:
L16-1469
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2936–2941
Language:
URL:
https://aclanthology.org/L16-1469
DOI:
Bibkey:
Cite (ACL):
Iñaki San Vicente, Iñaki Alegría, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, Eva Martínez Garcia, Antonio Toral, Arkaitz Zubiaga, and Nora Aranberri. 2016. TweetMT: A Parallel Microblog Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2936–2941, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
TweetMT: A Parallel Microblog Corpus (Vicente et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1469.pdf