Development of a Guarani - Spanish Parallel Corpus

Luis Chiruzzo, Pedro Amarilla, Adolfo Ríos, Gustavo Giménez Lugo


Abstract
This paper presents the development of a Guarani - Spanish parallel corpus with sentence-level alignment. The Guarani sentences of the corpus use the Jopara Guarani dialect, the dialect of Guarani spoken in Paraguay, which is based on Guarani grammar and may include several Spanish loanwords or neologisms. The corpus has around 14,500 sentence pairs aligned using a semi-automatic process, containing 228,000 Guarani tokens and 336,000 Spanish tokens extracted from web sources.
Anthology ID:
2020.lrec-1.320
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2629–2633
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.320
DOI:
Bibkey:
Cite (ACL):
Luis Chiruzzo, Pedro Amarilla, Adolfo Ríos, and Gustavo Giménez Lugo. 2020. Development of a Guarani - Spanish Parallel Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2629–2633, Marseille, France. European Language Resources Association.
Cite (Informal):
Development of a Guarani - Spanish Parallel Corpus (Chiruzzo et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.320.pdf