A Parallel Corpus Mixtec-Spanish

Cynthia Montaño; Gerardo Sierra; Gemma Bel-Enguix; Helena Gomez

A Parallel Corpus Mixtec-Spanish

Cynthia Montaño, Gerardo Sierra Martínez, Gemma Bel-Enguix, Helena Gomez

Abstract

This work is about the compilation process of parallel documents Spanish-Mixtec. There are not many Spanish-Mixec parallel texts and most of the sources are non-digital books. Due to this, we need to face the errors when digitizing the sources and difficulties in sentence alignment, as well as the fact that does not exist a standard orthography. Our parallel corpus consists of sixty texts coming from books and digital repositories. These documents belong to different domains: history, traditional stories, didactic material, recipes, ethnographical descriptions of each town and instruction manuals for disease prevention. We have classified this material in five major categories: didactic (6 texts), educative (6 texts), interpretative (7 texts), narrative (39 texts), and poetic (2 texts). The final total of tokens is 49,814 Spanish words and 47,774 Mixtec words. The texts belong to the states of Oaxaca (48 texts), Guerrero (9 texts) and Puebla (3 texts). According to this data, we see that the corpus is unbalanced in what refers to the representation of the different territories. While 55% of speakers are in Oaxaca, 80% of texts come from this region. Guerrero has the 30% of speakers and the 15% of texts and Puebla, with the 15% of the speakers has a representation of the 5% in the corpus.

Anthology ID:: W19-3650
Volume:: Proceedings of the 2019 Workshop on Widening NLP
Month:: August
Year:: 2019
Address:: Florence, Italy
Editors:: Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, Zeerak Waseem
Venue:: WiNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 157–159
Language:
URL:: https://aclanthology.org/W19-3650/
DOI:
Bibkey:
Cite (ACL):: Cynthia Montaño, Gerardo Sierra Martínez, Gemma Bel-Enguix, and Helena Gomez. 2019. A Parallel Corpus Mixtec-Spanish. In Proceedings of the 2019 Workshop on Widening NLP, pages 157–159, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: A Parallel Corpus Mixtec-Spanish (Montaño et al., WiNLP 2019)
Copy Citation:

Cite Search Fix data