A Large-Scale Study of Machine Translation in the Turkic Languages

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 2 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.


Introduction
Having been studied widely over the last few decades, machine translation (MT) evaluation has traditionally focused on European languages, due to limitations of the available technology as well as resources. Although low-resource MT has recently started to gain more attention and new evaluation benchmarks are becoming available (Guzmán et al., 2019;Ojha et al., 2020;Fraser, 2020;Ansari et al., 2020), there are still a large amount of underrepresented languages excluded from MT evaluation. In addition to the cost of preparing such labor-intensive annotations, the lack of training resources also limits the evaluation of MT models in terms of their applicability across a wide range of 1 https://github.com/ turkic-interlingua/til-mt  world languages. On the other hand, many studies have pointed to the limited applicability of prominent methods in MT research including models and evaluation metrics (Birch et al., 2008;Stanojević et al., 2015;Bugliarello et al., 2020) in translating languages with varying linguistic typology. In order to extend the evaluation of the state-ofthe-art methods in MT (Joshi et al., 2019) and ultimately aid in designing methods with wider range of applicability, in this paper, we present a largescale case study of MT methods in a very challenging case of the Turkic language family. The Turkic language family consists of around 35 languages spoken by communities across Eurasia by around 200 million people. Of this number, around 20 are official languages of a state, or sub-national entity, with the remaining being minority languages. The languages are distinct in their highly complex use of morphology, and thus create extremely sparse vocabularies, presenting a challenging case of evaluation of MT systems (Tantug et al., 2008) and generally leading to worse performance in n-gram language models (Bender, 2011;Tsarfaty et al., 2020). Table 1 presents the amount of resources and the number of speakers in Turkic languages 2 which aids our analysis on the feasibility in crowdsourcing, based on the approach of Moshagen et al. (2014).
Our study includes the preparation of novel public resources covering many languages in the Turkic family, most of which included for the first time in parallel corpora. We also present new benchmarks for MT which could be used for assessing different factors determining the limits of MT methods in various languages, such as data size, evaluation metrics, translation domain, linguistic typology, relatedness, and the writing system. We test the use of our resources in MT and present the first evaluation results for many Turkic languages. Our novel resources consist of i) a large-scale multicentric parallel corpus of 75M+ sentence pairs in 22 Turkic languages and their translations into English, Russian, as well as in-family languages, covering over 400 translation directions, ii) 3 new test sets for each translation direction curated from our corpus in 3 different translation domains, iii) bilingual baselines in 26 different language pairs. Our baselines are evaluated using automatic metrics as well as human assessments against commercial or open-source systems where applicable. We release our parallel corpora, test sets, and baseline systems publicly to encourage future research in Turkic languages.

Turkic Languages & MT
This section gives a brief overview of Turkic languages from a linguistic perspective as well as highlights the previous work on MT of these languages. In our study, we include 22 Turkic languages: Altai, Azerbaijani, Bashkir, Crimean Tatar, Chuvash, Gagauz, Karachay-Balkar, Karakalpak, Khakas, Kazakh, Kumyk, Kyrgyz, Sakha, Salar, Shor, Turkmen, Turkish, Tatar, Tuvan, Uyghur, Urum, and Uzbek. There are several other widely spoken languages that were left out from our study such as Nogai, Khorasani Turkic, Qashqai, and Khalaj, due to the lack of any available parallel corpora. Future work will focus on extending the corpus to these languages as well.

Linguistic Typology
The Turkic languages are spoken in a wide area that stretches from south-eastern Europe to northeastern Asia. The languages are of the agglutinative morphological type and uniformly have Subject-Object-Verb main constituent order.
Nominal morphology is highly similar between the languages, with all of them exhibiting inflection for number, possession, and case. There are a variable number of cases, but the six-core cases of nominative, genitive, accusative, dative, locative, and ablative are extant in the vast majority of languages. As part of the nominal inflectional system, the languages also have a derivational process whereby locatives and genitives can be pronominalized and constitute full noun phrases in their own right. Verbal inflection, on the other hand, is more heterogeneous between the languages with each language having a variety of strategies for encoding tense, aspect, voice, modality, and evidentiality. One common feature however is that each of the languages has an extensive system of non-finite forms: verbal adjectives, verbal nouns, and verbal adverbs. These are full clauses that can be used as either modifiers (in the case of verbal adjectives and verbal adverbs) or heads (in the case of verbal nouns). Many of the languages also have constructions consisting of a non-finite verbal form and an auxiliary verb which constitute a single predicate, with the auxiliary verb giving extra information about tense or mood (Johanson and Johanson, 2015).
The modern Turkic languages are written in a variety of scripts, with Latin, Cyrillic, and Perso-Arabic being most common. Many of the languages have been written in several writing systems over the past century, making collecting texts more problematic. For example, we can find instances where the same language have texts that are written in Perso-Arabic before the 1920s, in Latin until the 1930s, in Cyrillic until the 1990s, and then in Latin again (Róna-Tas, 2015). In addition, many languages have gone through several orthographic norms based on the same script, and some languages are currently written in different scripts depending on which country the speakers are in. This orthographic diversity makes collecting and collating text resources difficult, as many texts may be available only in a previously-used orthography and conversion between orthographic systems is never deterministic owing to the large number of loan words in many texts.

MT of Turkic Languages
The need for more comprehensive and diverse multilingual parallel corpora has sped up the creation of such large-scale resources for many language families and linguistic regions (Koehn, 2005;Choudhary and Jha, 2011;Post et al., 2012;Nomoto et al., 2018;Esplà-Gomis et al., 2019;. Tiedemann (2020) released a large-scale corpus for over 500 languages covering thousands of translation directions. The corpus includes 14 Turkic languages and provides bilingual baselines for all translation directions present in the corpus. However, the varying and limited size of the test sets does not allow for the extensive analysis and comparisons between different model artifacts, linguistic features, and translation domains. Khusainov et al. (2020) collected a large-scale Russian-Turkic parallel corpus for 6 language pairs and reports bilingual baselines using a number of NMT-based approaches, although the dataset, test sets, and the models are not released to the public which limits its use to serve as a comparable benchmark. Alkım and Ç ebi (2019) introduces a rule-based MT framework for Turkic languages and demonstrates the performance with 4 language pairs. Washington et al. (2019) demonstrates several rule-based MT systems built for Turkic languages which are available through the Apertium 3 website.

TIL Corpus
Our parallel corpus is collected through unifying publicly available datasets and additional parallel data we prepare by crawling public domain resources. Table 2 shows the total amount of sentences in that particular language across the corpus along with number of sentences that are newly introduced (previously unavailable). This section describes the details of our data collection process.

Public Datasets
In our corpus we include the following public data sets: • The Tatoeba corpus (Tiedemann, 2020) provides training and test sets for over 500 languages and thousands of translation pairs. It uses the latest version of OPUS 4 (Tiedemann and Nygaard, 2004) as training sets and use parallel sentences from the Tatoeba project for testing. Tatoeba consists of 58 language pairs of interest. For the purposes of our corpus, we merge the training, development, and test sets into a single set for all available languages.
• JW300 (Agić and Vulić, 2019) is a public dataset available for download through OPUS. Although most of the parallel data in JW300 was provided through the Tatoeba corpus, we have identified several pairs that were missing in Tatoeba but present in JW300. To avoid further data loss, we have obtained the JW300 dataset directly from OPUS and deduplicated it against the Tatoeba corpus. This dataset provided data for 59 language pairs of interest and resulted in 5.2 million parallel sentences.
• GoURMET 5 is another dataset available through OPUS and provides parallel sen-  In addition to this, with the permission from the owners, we include privately owned corpora for English-Azerbaijani 6 containing data from news articles, English-Uzbek 7 containing data from KhanAcademy website localization, and Bashkir-Russian 8 having a mix of data from news articles and literary works.

Data Crawling
We obtained additional parallel data from a few different public domain websites that contain a large amount of text translated into many different languages. One of these includes TED Talks, 9 which contains talks across various domains that are translated by volunteers. Qi et al. (2018) compiled a dataset for 60 languages, however, only a few Turkic languages were available at their time of curation. We have compiled an updated version of this dataset and obtained sentence pairs for 8 Turkic languages. Bible.is 10 is another website that contains an extensive list of languages into which religious texts and books are translated. 19 out of 22 Turkic languages were covered in this source with an average of approximately 8,000 sentence pairs for each translation direction. Additionally, we have crawled other public websites, online dictionaries, and resources with parallel data that were identified by native speakers of these languages. The full list of online resources we used in our crawling is given in the Appendices.

Data Alignment
All crawled documents are aligned using Hunalign (Varga et al., 2005), with a threshold of either 0.2 or 0.4 depending on the availability of a native speaker for the language. When crawling prealigned sources such as TED Talks, we noticed serious alignment issues with certain Turkic languages, especially when the source and target differ greatly in size. In these cases, we split both sides into sentences using NLTK sentence tokenizer 11 and realign using the Hunalign tool. Specifically for the Bible dataset, all the data has been aligned at the verse level first, then split into sentence-level bitexts whenever possible. This results in parallel texts that are relatively longer while ensuring higher quality alignments.

Data Preprocessing
Many of the languages in our dataset are written using multiple scripts, which creates consistency problems for building MT systems. Therefore, we transliterate three of the languages in our dataset that have a high mix of multiple scripts. Namely, we transliterate Uzbek into a Latin script, while all Karakalpak text is converted into Cyrillic. Although the performance of transliteration tools (Uzbek 12 and Karakalpak 13 ) were not strictly evaluated, the tools we have used were recommended and widely adopted by the native speakers of the languages. Once we combine the entire corpus data, we deduplicate the sentences in each language pair.

Bilingual Baselines
We train bilingual baselines for 26 language pairs in three different resource categories: high (¿5M), medium (100K-5M) and low (¡100K). The choice of pairs to train was based on multiple factors such as the availability of test sets, native speakers (for human evaluation), and other comparable MT systems.

Model Details
All models are Transformers (Vaswani et al., 2017) (transformer-base) whose exact configuration depends on the amount of data available for training. Models for low-resource pairs use 256-dimensional embeddings and hidden layers. Models for midresource pairs use 512-dimensonal embeddings and hidden layers. The models for high-resource pairs use the same 512-dimensonal embedding and hidden layer sizes for the encoder, but for the decoder both dimensions are increased to 1024. All models are trained with the Adam optimizer (Kingma and Ba, 2015) over cross-entropy loss with a maximum learning rate of 3 * 10 −4 and a minimum of 1 * 10 −8 , which warms up for the first 4800 training steps and then decays after reaching the maximum. We use a training batch size of 4096. We use perplexity as our early stopping metric with a patience of 5 epochs. We set a dropout (Srivastava et al., 2014) probability of 0.3 in both the encoder and the decoder. We apply a byte pair encoding (BPE) (Sennrich et al., 2015;Dong et al., 2015) with a joint vocabulary size of 4K and 32K for low-and mid/high-resource scenarios respectively.
All models use the Joey NMT (Kreutzer et al., 2019) implementation and apex 14 where possible to speed up training. Models were trained on preemptible GPUs freely available on Google Colab. 15

Test Sets
High-quality and diverse test sets are essential in evaluating the strength and weaknesses of MT systems. We curate 3 test sets covering 3 translation domains: religious (Bible), conversational (TED Talks), and news (X-WMT).
Bible dataset is the main source that exists across almost all of the 24 language pairs that are included in our corpus. From this dataset, around 400 to 800 most commonly present sentences for every language pair were separated to create a test set. This allowed having a test set comparable in all language pairs, which we find essential for a controlled evaluation and believe would be a useful resource in future studies involving multilingual models.
TED Talks is another resource we use for collecting sentences across multiple languages to create a language-wise comparable test set in the conversational domain. This allows our approach to be comparable also across different domains. After deduplication, 3000-5000 sentences per language pair are picked as a part of our TED Talks test set.

Evaluation
Automatic evaluation metrics are very commonplace in MT research, and there has been a recent line of work exploring better metrics that capture translation quality beyond the syntactic and lexical features (Zhang et al., 2019;Sellam et al., 2020;Rei et al., 2020). Methods relying on contextual embeddings to capture the semantic similarity between the hypothesis and references fall short in terms of their language coverage. This is largely due to the pretraining of these evaluation models that require a significant of monolingual data which most of the low-resource languages lack. In this study, we evaluate our systems using both automatic metrics and human evaluation of translations.

Automatic Metrics for MT
We employ two widely adopted metrics: BLEU (Papineni et al., 2002) and ChrF (Popović, 2015). BLEU utilizes modified n-gram precision where the consecutive n-grams of the system translation are compared with the consecutive n-grams of the reference translation. We use the standard Sacre-BLEU implementation (Post, 2018). ChrF applies the same method at the level of character n-gram and we use the original implementation from the paper as provided through NLTK library. 17 17 https://github.com/m-popovic/chrF

Human Evaluation
To perform a more holistic analysis of MT systems, it is critical to involve native speakers in the evaluation process. We conducted a human evaluation campaign using a randomly sampled subset of 250 sentences from X-WMT or Bible (whenever X-WMT was not available) for evaluating the outputs of 14 bilingual baseline models. Our assessment is based on Direct Assessment (DA) test (Nießen et al., 2000;Papineni et al., 2002;Doddington, 2002), where annotators were asked to rate a translation according to adequacy and fluency on a 5 point Likert scale. All participants of the study were bilingual speakers of the source and target language. To better understand the importance of directionality (e.g. English-X vs X-English) and avoid variance in scores, we ensure that both directions of the same pair are evaluated by the same annotator (whenever possible). While reporting, we average the scores for each pair but report adequacy and fluency separately. Adequacy is defined as how much information is preserved in the translation. A score of 1 would mean that the translation is meaningless and has no correlation with the target sentence. A score of 5 would mean the translation retains all of the information. Fluency is defined as how grammatically, syntactically, and stylistically correct the translation is. A score of 1 would mean the sentence makes no sense grammatically or syntactically. A score of 5 would mean the sentence is perfectly correct.

Results & Discussion
The upper section of Table 4 highlights the bilingual baselines for high-resource pairs and their evaluation scores in the three domains. Despite the large training size, both models perform relatively modestly on the Bible and TED Talks with the en-tr model slightly better than ru-tr. Our hypothesis is that the domain of the Bible test set is far from the rest of the training set for both pairs, as most of the training data for Turkish comes from OpenSubtitles. 18 Another likely bottleneck is the suboptimal model size and hyperparameters, which were not tuned due to limited computational resources.
Baseline results for the mid-and low-resource pairs are in the lower part of  between models when translated in and out of non-Turkic languages. However, these differences are not as prominent when evaluated using ChrF, which is a character-level metric. This can partially be attributed to the complex morphology of Turkic languages which penalizes lexical mispredictions at a much higher rate than in English for example (Tantug et al., 2008). This in return would lead to lower BLEU scores. To examine this phenomena in more detail, we compare the results of X-WMT against human evaluations for the translations these models produced in Section 6.1.
Another notable aspect is the importance of scripts in the performance of the models. Language pairs with more than one script consistently underperform (both in automatic and human evaluations) the ones where both the source and target language use the same script. In fact, the best 6 models on the X-WMT test sets all have Latin scripts in both the source and target language. A suboptimal performance in the face of a script disparity is a known phenomenon (Anastasopoulos and Neubig, 2019;Murikinati et al., 2020;Aji et al., 2020;Amrhein and Sennrich, 2020), where techniques such as transliteration show to improve performance. This is mostly attributable to model's inability to represent both languages in a shared space effectively when they do not share the same script, which can be damaging for the downstream performance.

Comparing Human Evaluations to BLEU
Using the Direct Assessment (DA) surveys described in Section 5.2, we obtain average scores of adequacy and fluency for almost all baseline models. Figures 1 show the scores for BLEU/ChrF and adequacy/fluency respectively. Comparing the scores from native speakers of these languages, it is quite evident that the disparities in BLEU scores between two translation directions are exaggerated and, even misleading (e.g. en-az vs az-en). Results in the human evaluations for mid-resource pairs seem a lot more closely clustered than in the BLEU/ChrF figure. These results further emphasize the pitfalls of automatic metrics of MT evaluation and emphasize the role of native speakers in the MT process.   Table 6: Correlation between scores from human evaluation and automatic metrics for translating into Turkic and non-Turkic. Correlation is measured using Pearson's r.

Turkic Languages on the target side
Even though BLEU scores do not offer a holistic way to compare two MT systems, they are effective in telling which system performs better. As seen clearly from the results in Table 4, the performance of the baseline system as measured by the BLEU metric when translating into a Turkic language from English is substantially worse than when translating into English from a Turkic language. Translating into the Turkic language is typically twice as bad in terms of BLEU as translating from the Turkic language. The reliability of the BLEU score also decreases especially in the case of translating into morphologically-rich languages, which has indeed been shown to correlate poorly with human judgments in Turkic languages (Ma et al., 2018(Ma et al., , 2019. Table 6 shows the correlation between BLEU/Chrf and adequacy/fluency scores. BLEU seems to correlate with adequacy/fluency a lot better when the target side is a non-Turkic language, which emphasizes our earlier points regarding the language morphology. ChrF's correlation to adequacy scores is about the same regardless of the target language.  (Forcada et al., 2011). Google Translate results are significantly higher than our baselines and other MT systems. There are quite a few reasons for the score disparities. First, commercial systems have access to more data for training and possibly also include the public data we exclude from our test sets. Moreover, several test-set translators used Google Translate to do the translations and performed post-(a) BLEU and ChrF scores for select pairs. Note: ChrF scores were multiplied by 20 for better visibility.  edits afterwards (e.g. en-uz) which creates a bias favoring sentences generated by Google's service. A safer comparison of the baselines is achieved with Yandex Translate, which despite the lower performance also supports more Turkic languages (8 in Google and 9 in Yandex). However, it is important to note that their API yielded worse results than their web interface. Apertium is a rule-based MT framework that supports several Turkic-Turkic pairs and we include the results whenever one is available. For those pairs, the results are comparable with our baselines and Yandex Translate.

Conclusion & Future Work
In this paper, we introduce a large parallel corpus covering 22 Turkic languages along with in-domain and out-of-domain evaluation sets. We also train the first baseline models for several language pairs and take the initial steps to address the challenges associated with machine translation in the Turkic languages. This study was carried out as in a participatory research setting by a diverse community of researchers, engineers, language specialists, and native speakers of Turkic languages. Future work will focus on studies of methods for effective crosslingual transfer, extending of the coverage of the corpus to more languages and domains, and increasing the size of the test sets to provide more comprehensive benchmarks.