Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models

This paper describes CIC NLP’s submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) — Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.


Introduction
While machine translation systems have shown commendable performance in recent years, the performance is lagging for low-resource languages (Hadgu et al., 2022;Tonja et al., 2023).Since lowresource languages suffer from a lack of sufficient data (Siddhant et al., 2022;Haddow et al., 2022), most models and methods that are developed for high-resource languages do not work well in lowresource settings.Additionally, low-resource languages are linguistically diverse and have divergent properties from the mainstream languages in NLP studies (Zheng et al., 2021).
Though low-resource languages lack sufficient data to train large models, some such languages still have a large number of native speakers (Zheng et al., 2021).While the availability of language technologies such as machine translation systems can be helpful for such linguistic communities, they could also bring harm and exposure to exploitation (Hovy and Spruit, 2016).Borrowing from human-computer interaction (HCI) studies (Schneider et al., 2018), we want to acknowledge our belief that low-resource language speakers should be empowered to create technologies that benefit their communities.Many indigenous communi-ties have community-rooted efforts for preserving their languages and building language technologies for their communities1 and we hope that methods from Shared Tasks like this will contribute to their efforts.
Improving machine translation systems for lowresource languages is an active research area and different approaches (Zoph et al., 2016;Karakanta et al., 2018;Ortega et al., 2020a;Goyal et al., 2020;Tonja et al., 2022;Imankulova et al., 2017) have been to improve the performance of systems geared forward low-resource languages.We participated in the AmericasNLP 2023 Shared Task in hopes of contributing new approaches for low-resource machine translation that are likely to be helpful for community members interested in developing and adapting these technologies for their languages.
In recent years, large pre-trained models have been used for downstream NLP tasks, including machine translation (Brants et al., 2007) because of the higher performance in downstream tasks compared to traditional approaches (Han et al., 2021).One trend is to use these pre-trained models and fine-tune them on smaller data sets for specific tasks (Sun et al., 2019).This method has shown promising results in downstream NLP tasks for languages with low or limited resources (Tars et al., 2022;Zhao and Zhang, 2022).In our experiments, we used multilingual and bilingual models and employed different fine-tuning strategies for the eleven languages in the 2023 Shared Task (Ebrahimi et al., 2023).
In this paper, we describe the system setups we used and the results we obtained from our experiments.One of our systems improves upon the baseline for three languages.We also reflect on the setups we experimented with but ended up not submitting in hopes that future work could improve upon them.
In this section, we present the languages and datasets used in our shared task submission.Table 1 provides an overview of the languages, their linguistic families, and the numbers of parallel sentences.Aymara is an Aymaran language spoken by the Aymara people of the Bolivian Andes.It is one of only a handful of Native American languages with over one million speakers (Homola, 2012).Aymara, along with Spanish and Quechua, is an official language in Bolivia and Peru.The data for the Aymara-Spanish come from the Global Voices (Tiedemann, 2012).

Language
Bribri The Bribri language is spoken in Southern Costa Rica.Bribri has two major orthographies: Jara2 and Constenla3 and the writing is not standardized which results in spelling variations across documents.In this case, the sentences use an intermediate representation to unify existing orthographies.The Bribri-Spanish data (Feldman and Coto-Solano, 2020) came from six different sources.
Asháninka Asháninka is an Arawakan language spoken by the Asháninka people of Peru and Acre, Brazil4 .It is primarily spoken in the Satipo Province located in the Amazon forest.The parallel data for Asháninka-Spanish come mainly from three sources (Cushimariano Romano and Sebastián Q., 2008;Ortega et al., 2020b;Mihas, 2011) and translations by Richard Castro.
Chatino Chatino is a group of indigenous Mesoamerican languages.These languages are a branch of the Zapotecan family within the Oto-Manguean language family.They are natively spoken by 45,000 Chatino people (Cruz and Woodbury, 2006) whose communities are located in the southern portion of the Mexican state of Oaxaca.The parallel data for Chatino-Spanish can be accessed here5 .
Guarani Guarani is a South American language that belongs to the Tupi-Guarani family (Britton, 2005) of the Tupian languages.It is one of the official languages of Paraguay (along with Spanish), where it is spoken by the majority of the population, and where half of the rural population are monolingual speakers of the language (Mortimer, 2006).
Wixarika Wixarika is an indigenous language of Mexico that belongs to the Uto-Aztecan language family (de la Federación, 2003).It is spoken by the ethnic group widely known as the Huichol (selfdesignation Wixaritari), whose mountainous territory extends over portions of the Mexican states of Jalisco, San Luis Potosí, Nayarit, Zacatecas, and Durango, but mostly in Jalisco.United States: La Habra, California; Houston, Texas.
Nahuatl Nahuatl is a Uto-Aztecan language and was spoken by the Aztec and Toltec civilizations of Mexico6 .The Nahuatl language has no standard orthography and has wide dialectical variations (Zheng et al., 2021).
Hñähñu Hñähñu, also known as Otomí, belongs to the Oto-Pamean family and lived in central Mexico for many centuries (Lastra, 2001).Otomí is a tonal language with a Subject-Verb-Object (SVO) word order (Ebrahimi et al., 2022).It is spoken in several states across Mexico.
Quechua The Quechua-Spanish data (Agić and Vulić, 2019;Tiedemann, 2012) has three different sources: the Jehova's Witnesses texts, the Peru Minister of Education, and dictionary entries and samples collected by Diego Huarcaya.The Quechua language, also known as Runasimi is spoken in Peru and is the most widely spoken pre-Columbian language family of the Americas (Ebrahimi et al., 2022).Rarámuri Rarámuri, also known as Tarahumara is a Uto-Azetcan language spoken in Northern Mexico (Caballero, 2017).Rarámuri is a polysynthetic and agglutinative language spoken mainly in the Sierra Madre Occidental region of Mexico et al., 2022).

Models
We experimented with two multilingual and one bilingual translation model with different transfer learning setups.We used M2M-100 and mBART50 for the multilingual experiment and the Helsinki-NLP Spanish-English model for the bilingual experiment.Figure 1 shows the models used in this experiment.

Bilingual models
For the bilingual model, as shown in Figure 1a, we use a publicly available Spanish -English7 pre-trained model from Huggingface8 trained by Helsinki-NLP.The pre-trained MT models released by Helsinki-NLP are trained on OPUS, an open-source parallel corpus for covering 500 languages (Tiedemann and Thottingal, 2020;Tiedemann, 2020).This model is trained using the framework of Marian NMT (Junczys-Dowmunt et al., 2018).Each model has six self-attention layers in the encoder and decoder parts, and each layer has eight attention heads.
We used this model with the intention that the model trained with high-resource languages will improve the translation performance of lowresource indigenous languages when using a model trained with high-resource languages.We finetuned the Spanish-English model for each of the Spanish-to-Indigenous language pairs.

Multilingual models
For multilingual models, we used the Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages (M2M100) (Fan et al., 2021) with 48M parameters and a sequence-to-sequence denoising autoencoder pre-trained on large-scale monolingual corpora in 50 languages (mBART50) (Tang et al., 2020).We fine-tuned multilingual models in two ways: 1. We fine-tuned two multilingual models on each Spanish-Indigenous language pair for 5 epochs and evaluated their performance using the development data before training the final submission system.As shown in Figure 1b, for the final system, we only finetuned mBART50 on Spanish-indigenous data based on the development set evaluation performance.
2. Fine-tuning multilingual models first on the Spanish -All (mixture of all indigenous language data) dataset to produce an intermediate model and then fine-tuning the intermediate model for each of the Spanish-Indigenous language pairs as shown in Figure 1c.For this experiment, we combined all language pairs' training data to form a Spanish -all parallel corpus, and then we first fine-tuned m2m100-48 using a combined dataset for five epochs and saved the model, here referred to as m2m100-48 inter model.We fine-tuned the m2m100-48 inter model again on each Spanish-Indigenous language pair for another 5 epochs and evaluated the performance on the development set before training the final submission system.
Evaluation We used chrF2 (Popović, 2017) evaluation metric to evaluate our MT systems.

Results
We submitted three (two multilingual and one bilingual) systems, as shown in Table 2, namely m2m100-48 inter , mBART50, and Helsinki-NLP.We included the dev set performance for all the models we trained before the final model to compare the results with the final model evaluated by using test set data.From the dev set result, it can be seen that fine-tuning the multilingual model on the Spanish-Indigenous language pair outperforms the fine-tuned result of the bilingual and m2m100-48 inter models.From all the models evaluated using the dev set, mBART50 outperformed the others on average.Our test results show comparable results when compared to the strongest baseline shared by the AmericasNLP 2023, and our model outperformed the baseline for Spanish-Bribri (es-bzd), Spanish-Asháninka (es-cni), and Spanish-Quechua (es-quy) pairs.Similarly, mBART50 outperformed the other models on average on the test set.

Conclusion
In this work, we present the system descriptions and results for our submission to the 2023 Ameri-casNLP Shared Task on Machine Translation into Indigenous Languages.We used pre-trained models and tested different fine-tuning strategies for the eleven languages provided for the shared task.We used one bilingual (Helsinki NLP English-Spanish model) and two multilingual (M2M-100 and mBART50) models for our experiments.In addition to fine-tuning the individual languages' data, we concatenated the data from all eleven languages to create a Spanish-All dataset and fine-tuned the M2M-100 model before fine-tuning for the individual languages.Our mBAERT50 model beat the strong baseline in three languages.

Figure 1 :
Figure 1: Experiments on (a) fine-tuning bilingual model, (b) and (c) fine-tuning multilingual models.For (a) we fine-tuned the bilingual Spanish-English model on Spanish-Indigenous pairs, for (b) we fine-tuned the multilingual mBART50 model on Spanish-Indigenous pairs, and for (c) we fine-tuned the multilingual m2m100-48 model first on Spanish-All to produce m2m100-48 inter model and then fine-tuned the m2m100-48 inter model on Spanish-Indigenous pairs.

Table 1 :
This table provides information about the languages with which we experimented including ISO language code and language family as well as the number of sentences in training, development, and test sets for each language.

Table 2 :
(Vázquez et al., 2021)hree submissions, computed on the development and test sets.M1, M2, M3, and M4 represent M2M100-48, M2M100-48 inter , mBART50 and Helsinki-NLP models respectively.The development set evaluations are used to select the best-performing model before working on submission data.The development set was not trained when evaluating the dev set, but we included the dev set during training for the final submission.The bold results show the models that out-preforms the baseline(Vázquez et al., 2021)results.The bold results show out-preforming models from our three model setups(excluding the baseline) for each individual language.