Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

In this paper, we present a parallel Spanish- Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook m2m100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The results indicate that translation performance is influenced by the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) and is more effective when indigenous languages are used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings.


Introduction
Natural Language Processing (NLP), a sub-field of Artificial Intelligence (AI), has been attracting a lot of attention in terms of research and development as a result of the surge in the number of applications it has in a variety of different industries (Kalyanathaya et al., 2019).Machine Translation (MT), Sentiment or Opinion Analysis, POS Tagging, Question Classification (QC) and Answering (QA), Chunking, Named Entity Recognition (NER), Emotion Detection, and Semantic Role Labeling are currently highly researched areas in various high-resource languages (Tonja et al., 2023a).
The domain of machine translation (MT) is advancing at a rapid pace due to the growing prevalence of computational tasks and the expanding global reach of the Internet, which caters to diverse, multilingual communities (Kenny, 2018).MT systems have demonstrated remarkable translation outcomes for language pairs that possess abundant resources, such as English-Spanish, English-French, English-Russian, and English-Portuguese.However, in scenarios with limited or no resources, MT systems encounter difficulties due to the primary obstacle of inadequate training data for certain languages (Mager et al., 2018;Tonja et al., 2021Tonja et al., , 2022Tonja et al., , 2023b)).
Low-resource languages have been suffering from a lack of new language technology designs.When the resources are limited and only a small amount of unlabeled data is available, it is very hard to reach a true breakthrough in creating powerful novel methods for language applications (Tonja et al., 2022), the problem becomes worse if there is no parallel dataset for certain languages.
Mexico is a multicultural and multilingual country with 68 officially recognized indigenous languages, 238 variants, and Spanish, a widely used language spoken by 90 percent of the population (Mager et al., 2021).Few language technologies have been developed for indigenous languages spoken in Northern and Southern America; moreover, many indigenous languages spoken in the Americas face a risk of extinction (Mager et al., 2018).
Indigenous language speakers often experience feelings of shame or reluctance to use their native languages, primarily due to limited opportunities for application in the presence of pervasive, dominant majority languages (Hornberger, 2008;Skutnabb-Kangas, 2000).This phenomenon can be attributed to social and cultural pressures that prioritize the use of majority languages over minority languages, thereby marginalizing indigenous linguistic communities and undermining the value of their linguistic heritage (Hinton, 2011).
In this paper, we introduce the first parallel corpus for machine translation tasks for two indigenous languages that are spoken in Mexico and benchmark experimental results.The contributions of our work are the following: • We introduce the first parallel corpus for machine translation for Mazatec and Mixtec languages.
• We evaluate the performance of the collected corpus and present benchmark results by using transformers, transfer learning, and finetuning approaches.
• We open-source the parallel corpus and the scripts used in this paper.
The rest of the paper is organized as follows: Section 2 describes previous research related to this study, Section 3 describes the properties of Mazatec and Mixtec languages, Section 4 describes the statistics of the collected dataset, Section 5 describes models used for baseline experiments and their results, and Section 6 describes the conclusion of the paper.

Related works
Due to an increase in the enormous amount of data for different languages, machine translation is currently one of the most researched areas in NLP and has shown promising results in high-resource languages (Tonja et al., 2022).There are different MT approaches that have been used by different researchers, neural machine translation (NMT)is one of the current state-of-the-art approaches trained on huge datasets containing sentences in a source language and their equivalent target language translations (Belay et al., 2022).Basically, NMT takes advantage of huge translation memories with hundreds of thousands or even millions of translation units (Forcada, 2017).However, NMT for lowresource languages still under-performs due to the scarcity of parallel datasets (Tonja et al., 2022(Tonja et al., , 2023b)).Many researchers explored different approaches to solving low-resource machine translation problems.Zoph et al. (2016) proposed a transfer learning method to improve the MT performance of low-resource languages.The authors first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training.The data augmentation approach proposed by Fadaee et al. (2017), targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts.Pourdamghani and Knight (2019) proposed using high-resource language resources to improve MT performance for low-resource languages without requiring any parallel data.Copying monolingual data of the target language is proposed by Currey et al. (2017) to improve the performance of low-resource MT. Tonja et al. (2023b) proposed the use of source-side monolingual data as another way of improving low-resource MT performance.Transfer learning method, where one first trains a "parent" model for a high-resource language pair and then continues training on a lowresource pair only by replacing the training corpus was proposed by Kocmi and Bojar (2018).Mixing low-resource language resources during training, as proposed by Tonja et al. (2022) showed an improvement in MT performance for low-resource languages.
There have been promising research works done for indigenous languages; Feldman and Coto-Solano (2020) presented an NMT model and a dataset for the Bribri Chibchan language for Bribri-Spanish translation.Kann et al. (2022) compiled AmericasNLI, a natural language inference dataset covering 10 indigenous languages of the Americas.They conducted experiments with pre-trained models, exploring zero-shot learning in combination with model adaptation.Oncevay (2021) proposed the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua, and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models, outperformed pairwise baselines.Zheng et al. (2021) presented a low-resource MT system that improves translation accuracy using crosslingual language model pre-training.The authors used an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri.On average, their proposed system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.Nagoudi et al. ( 2021) introduced IndT5, the first Transformer language model for 10 Indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri.To train IndT5, they built IndCorpus-a new dataset for ten indigenous languages and Spanish.

Mazatec
The Mazatec language comprises a collection of closely related indigenous languages spoken primarily in the Northern region of Oaxaca, with smaller populations in the adjacent states of Puebla and Veracruz in Mexico.Approximately 200,000 individuals speak Mazatec; however, this number may fluctuate depending on which particular dialects or linguistic variations are taken into account (Léonard et al., 2019).
Word order -Typically, Mazatec exhibits a VSO (Verb-Subject-Object) word order; however, alternative structures such as SVO can also occur depending on the sentence, the focus of the statement, and the context.
Example sentence: Kitsaara kji xi makjíñeni kua apana (I gave a pill for the headache to my father) -VSO order

Mixtec
The Mixtec language comprises a group of closely related indigenous languages predominantly spoken in the region known as La Mixteca, which spans the states of Oaxaca, Puebla, and Guerrero in Southern Mexico.Estimates indicate that there are approximately 500,000 speakers of Mixtec; however, this number may fluctuate depending on the specific dialects or language varieties considered (Josserand, 1983).
And here are a couple of examples sentences: • Ka'nu ña'a nuu ntaa (Sitting on the plain) -VSO order • Ña'a nuu ntaa ka'nu (On the plain, sitting) -SVO order Note that the Mixtec language has many dialects, so the phonetic inventory, numerals, word order, and example sentences provided here may vary across different Mixtec-speaking communities.The examples given here are intended to provide a general overview of the language's features

Parallel Dataset
Data is one of the crucial building blocks of any NLP application (Belay et al., 2022;Tonja et al., 2023a), and a parallel corpus is essential to the success of any machine translation task.For Mazatec and Mixtec, we were unable to find publicly available datasets for the MT task.We collected datasets for these two indigenous Mexican languages from two main domains: religious and constitution.We also collected additional resources for the Mixtec language from different textbooks which have a similar translation to Spanish.Table 1 shows the statistics of the collected parallel corpus for Mazatec and Mixtec.
Text Alignment -We took a base directory path where text files were stored as input.Then we read and merged the content of all text files in the directory, and obtained a list of lists containing the content of each file.We proceeded to iterate through each file in the directory and read their contents line by line.Each line was normalized using the Unicode Normalization Form KC (NFKC) before being appended to the resulting list.We added a function that takes a language code lang as input, which determines the filename of the text file to be read from a predefined folder.The function read the file line by line, normalized each line using NFKC, and concatenated the lines into a single string.The result was returned as an array.
With another function, we added the two lists as input: one containing the content of the files to be aligned, and the other containing the filenames for the output files.We then iterated through the content list and aligned the text by iterating through the chapters and paragraphs of each translation.The aligned text was written to the corresponding output file as tab-separated values (TSV).Then we defined the root path where the input files were located, initialized the name and content arrays, and called the function that populated the content array with the pre-processed text.Finally, the function that writes the file was called to align and write the output files.
Pre-processing -After aligning the texts of two indigenous languages with their equivalent translations in Spanish, we pre-processed the corpus before splitting it for our experiments.The preprocessing steps included removing the numbers and special character symbols such as ;,",?, etc.For the baseline experiment, we split the pre-processed corpus into training, development, and test sets in the ratio of 70:10:20, respectively.Table 2 shows the split of the dataset used for our experiments.

Baseline Experiment and Discussion
In this section, we discuss the models used for the baseline experiment, the hyper-parameter used, the benchmark results, and the discussion.We used three approaches to evaluate the usability of the collected corpus.These are :- • Transformer -is a type of neural network architecture first introduced in the paper Attention Is All You Need (Vaswani et al., 2017).
The key innovation of the Transformer architecture is the attention mechanism, which allows the network to selectively focus on different parts of the input sequence when making predictions.This is in contrast to traditional recurrent neural networks (RNNs), which process input sequentially and are prone to the vanishing gradient problem.
In the transformer architecture, the input sequence is processed in parallel by multiple layers of self-attention and feed-forward neural networks.Each layer can be thought of as a "block" that takes the output of the previous layer as input and applies its own set of transformations to it.The self-attention mechanism allows the network to weigh the importance of each element in the input sequence when making predictions, while the feed-forward networks help to capture nonlinear relationships between the elements.
Currently, transformers are state-of-the-art approaches and are widely used in NLP tasks such as MT, text summarization, sentiment analysis, etc.We used the base transformer configuration as described in (Vaswani et al., 2017) work.
• Transfer learningrefers to the process of leveraging pre-trained language models to improve the performance of downstream NLP tasks.Specifically, transfer learning involves  using a pre-trained model to initialize the parameters of an MT system and then finetuning the system on a smaller dataset specific to the target language pair or domain.
Transfer learning can be especially useful in MT because training a high-quality MT system from scratch requires a large amount of data and computational resources, which may not be available for all language pairs or domains.By leveraging pre-trained models, transfer learning allows MT systems to achieve high performance with fewer data and fewer resources.For our baseline experiments, we used English-Spanish as parent model with two (opus-mt-es-en1 and opus-mt-tc-big-enes2 ) pre-trained models available from Hugging Face3 trained for English-Spanish on the OPUS dataset (Tiedemann and Thottingal, 2020) by Helsinki-NLP group.
• Fine tuning -is the process of taking a pretrained MT model and adapting it to a specific translation task, such as translating between a particular language pair or in a specific domain.The process of fine-tuning involves taking the pre-trained model, which has already learned representations of words and phrases from a large corpus of text, and training it on a smaller dataset of specific task examples.This involves updating the parameters of the pretrained model to better capture the patterns and structures present in the target translation task.
Fine-tuning can be useful in MT because it allows the pre-trained model to quickly adapt to a new task without having to train a new model from scratch.This is especially beneficial when working with limited data or when there is a need to quickly adapt to changing translation requirements.We used two commonly known pre-trained multilingual MT models: -M2M100-48 -is a multilingual encoderdecoder (seq-to-seq) model trained for many-to-many multilingual translation (Fan et al., 2020).We used a model with 48M parameters due to computing resource limitations.-mBART50 -is a multilingual sequenceto-sequence model pre-trained using the Multilingual Denoising pre-training objective (Tang et al., 2020).
Hyper-parameters -For the transformer approach we tokenized the source and target parallel sentences into subword tokens using Byte Pair Encoding (BPE) (Gage, 1994).The BPE representation was chosen in order to remove vocabulary overlap during dataset combinations.For other approaches we applied the tokenizer of each model, Table 3 shows hyper-parameters used in our baseline experiments.

Results
Table 4 and Figure 1 shows the benchmark experimental results for bi-directional neural machine translation for Mazatec(maq) -Spanish(spa) and Mixtec(xtn) -Spanish(spa).In our baseline experiments, we observed that employing a transformer model for low-resource languages shows sub-optimal results compared to transfer learning and fine-tuning methodologies.As demonstrated in Table 4 and Figure 1, the performance of the transformer was inferior to alternative approaches utilized in the study.This finding substantiates the hypothesis that the efficacy of transformer models is heavily reliant on the availability of exten-
Transfer learning approach showed more promising results for the indigenous low-resource languages than the transformer approach.Out of the two models used in the transfer learning experiment, the model with transformer-big configuration outperformed the model with transformer-base configuration.This shows that the transfer learning approach depends on the size of the model parameter.Similarly, when using the transfer learning approach for indigenous low-resource languages by utilizing models trained on high-resource languages, better results were obtained when Spanish was used as the source language than when Spanish was used as the target language.
Fine-tuning approach outperformed the rest of the approaches used in our baseline experiment in both translation directions.This shows that using a multilingual pre-trained translation model for fine-tuning low-resource languages outperforms other models.From the two multilingual models used in the experiment, the M2M100-48 model outperformed the mBART50 multilingual model.The M2M100-48 model showed 4.7 and 5.5 BLEU scores on average for Mazatec (maq)-Spanish (spa) and Spanish (spa)-Mazatec (maq) translation.For Mixtec (xtn)-Spanish (spa) and Mixtec (xtn)-Spanish (spa), the M2M100-48 model showed a 10.2 and 7.5 BLEU score improvement on average when compared to the other models used in the experiments.When comparing the results of the two languages in all the approaches used, Mixtec (xtn)-Spanish (spa) translation showed better performance than Mazatec (maq)-Spanish (spa) translation when using Spanish as the target language, This shows that the availability of the parallel corpora for the language pairs has a high impact on the performance of the translation models.The overall results show that using multilingual MT models for fine-tuning in our selected indigenous low-resource languages gives promising results.

Discussion
In our analysis, we conducted an error analysis to identify the strengths and weaknesses of the three approaches: transformer, transfer learning, and fine-tuning.We found that the transformer approach, which relies on large parallel corpora, yielded sub-optimal results for low-resource languages.It struggled to capture the linguistic patterns and structures specific to indigenous languages.This limitation indicates that the transformer model's performance is highly dependent on the availability of extensive parallel corpora for effective machine translation.
On the other hand, the transfer learning approach showed more promising results for low-resource indigenous languages.We observed that models pretrained on high-resource languages, such as Spanish, and fine-tuned on the indigenous languages improved translation quality.However, even with transfer learning, the performance was not satisfactory, and there were errors that persisted across all three approaches.
The general error that all three approaches failed to address adequately was the translation of domain-specific and culturally specific terms in Mazatec and Mixtec.These languages have unique vocabulary and cultural nuances that require a deeper understanding and context to ensure accurate translation.The limited availability of domain-specific parallel corpora for these languages hampered the models' ability to capture and translate such terms effectively.

Conclusion
In this paper, we presented a parallel corpus for two indigenous Mexican languages (Mazatec (maq) and Mixtec (xtn)) for machine translation tasks and evaluate the usability of the collected corpus using three different approaches.From the approaches, fine-tuning multilingual pre-trained MT models outperformed the rest of the experiments; Facebook's M2M100-48 outperformed all other models with BLEU scores of 12.09 and 22.25 for maq-spa and spa-maq, respectively, and 16.75 and 22.15 for xtn-spa and spa-xtn, respectively.We noticed from the experimental results that the dataset size has less impact when using indigenous languages as a target than the source.This observation highlights the potential benefits of focusing on developing and fine-tuning models specifically designed for translation tasks involving low-resource languages.Moreover, it underscores the value of creating and employing parallel corpora tailored to indigenous languages, as these resources can significantly improve machine translation performance, particularly when used in conjunction with advanced multilingual pre-trained models.
Our BLEU results for Mizatec and Miztec to Spanish translation were very low on the best configuration to have any usability in real-life applications, but the translation in the opposite direction demonstrated BLEU scores above 22 facilitating uses, for example in government apps to present hints to Mixtec and Mazatec native speakers who have a low level of Spanish comprehension, in the government web pages.This could significantly improve the usefulness of the native language of the speakers, thus promoting communication of the language and its preservation.
In future research, we plan to investigate the efficacy of advanced techniques, including zero-shot and few-shot learning, for low-resource languages in the context of limited parallel datasets.These methodologies hold promise for effectively leveraging sparse data available in low-resource settings, as they capitalize on pre-existing knowledge from related tasks or languages without requiring extensive fine-tuning or additional annotated data.By exploring these approaches, we aim to uncover potential benefits and improvements in the machine translation performance of low-resource languages, thus contributing to developing more robust and accurate translation systems for underrepresented linguistic communities.

Figure 1 :
Figure 1: Benchmark results of selected approaches

Table 1 :
Parallel dataset distribution of Mazatec-Spanish and Mixtec-Spanish

Table 2 :
Dataset split used in baseline experiments