CHIA: CHoosing Instances to Annotate for Machine Translation

,


Introduction
Machine translation (MT) systems have been widely adopted into daily use and facilitate easy communication and access to information.Advances in neural MT have enabled systems to approach human performance (Hassan et al., 2018;Popel et al., 2020).However, such high-performing MT systems are only available for a small subset of the world's languages as they require large training corpora (Mueller et al., 2020;Koehn and Knowles, 2017).While unsupervised methods (Lample et al., 2018;Artetxe et al., 2018) that use limited or no parallel data are effective for many languages, they perform poorly on low-resource language pairs * Equal contribution Figure 1: Overview of our method, CHIA.Given data in high-resource languages such as English and Spanish, CHIA selects data to annotate for low-resource languages such as Irish or Burmese.(Guzmán et al., 2019), and are outperformed by supervised methods (Kim et al., 2020).
To ensure that advances in MT benefit all communities and users equally, we need efficient ways to collect parallel data.For high-resource languages with a large number of speakers, parallel data sources exist (Koehn, 2005;Agić and Vulić, 2019;McCarthy et al., 2020), and crowdsourcing has proved cheap and effective (Post et al., 2012).However, for low-resource languages, collecting sufficient parallel sentences is harder, as bilingual translators may be difficult to find or expensive.The amount of instances which can be manually translated for use during model development is thus limited either by time or monetary constraints.
Here, we explore how existing or easily obtainable parallel sentences in high-resource languages, e.g., English and Spanish, can help us construct high-quality datasets for a target language with low or no resources, under a limited annotation budget.We present CHIA (choosing instances to annotate), which requires a multi-way parallel dataset and automatically identifies those instances which will result in the strongest MT system if translated to construct a training set for a new language.
CHIA is based on cross-lingual information: First, we identify the most effective instances to train MT systems between the center language -the language from which we wish to translate into a low-resource language -and multiple highresource languages.For this, we utilize MT model training dynamics to identify examples that help a model learn, as proposed by Swayamdipta et al. (2020).We extend their method, originally proposed for classification tasks, to sequence-tosequence tasks.Second, we use the intersection between the sets of informative instances for different language pairs to determine which instances will be most beneficial for training an MT system for a new language pair, cf. Figure 1.
We perform experiments on two multi-way parallel datasets, the Europarl corpus (Koehn, 2005) and the JHU Bible corpus (McCarthy et al., 2020).Our language pairs consist of English -our center language -and 15 simulated low-resource languages as well as 5 truly low-resource languages.We show that, on average, training on examples selected by CHIA results in gains of 1.59 BLEU as compared to training on randomly selected instances.We further examine the characteristics of the selected training examples, and find that CHIA does not rely on simple properties such as sentence length or number of unique words.
CHIA is based on the two contributions we present in this paper: 1) a method to identify the most useful training instances for sequence-tosequence tasks, and 2) an empirical demonstration that this method can be used to identify examples, which will be beneficial for a new low-resource language, based on a set of high-resource languages.

Background: Dataset Cartography
Individual instances in a training set have varying impact on a model's learning behavior (Lewis and Gale, 1994;Cohn et al., 1995).To identify the examples that contribute the most to the training process for classification tasks, Swayamdipta et al. (2020) propose to look at two metrics for each instance i, confidence c i and variability v i : and with E being the number of training epochs, θ e being the model parameters at the end of epoch e, y * i being the true label in the training set, and x i being the respective input.Thus, c i corresponds to the average probability of the true label of an example over epochs, and v i is the confidence's standard deviation.
Based on confidence and variability, Swayamdipta et al. (2020) partition the training set into three regions: instances with high confidence and low variability are designated as easy-to-learn, instances with high variability are designated as ambiguous, and instances with low confidence and low variability are designated as hard-to-learn.They show that instances in the ambiguous region contribute the most to the model's ability to generalize out of the training distribution, i.e., ambiguous instances are the most effective training examples.

Computing MT Training Dynamics
Our first contribution is a generalization of Swayamdipta et al. (2020)'s method for classification tasks to sequence-to-sequence tasks like machine translation.Importantly, we have more than one gold label per example: each ground-truth translation consists of a sequence of gold labels y * i of length T .We modify Equation 1 as follows to compute the confidence c S i for a gold sequence: We then compute the variance v S i as: Based on v S and c S we group training instances into three sets: hard-to-learn examples H, easy-tolearn E, and ambiguous A.

Selecting Instances for New Languages
The above method for beneficial-instance detection requires that a model has already been trained on all available data, i.e., it chooses a subset of already existing data, and cannot be used to select new  2017)).(1b) Using training dynamics, identify high variability or ambiguous instances for each language pair.For readability, we plot the probability here as confidence, whereas we use loss in our experiments.
(2) Using all ambiguous sets, find instances at the intersection that are most informative.We first select instances in the yellow region, then the green, and finally the red.instances for annotation.Thus, our second contribution is the second step of CHIA, which chooses instances to annotate (i.e., translate) for a new language.Importantly, CHIA does not rely on any existing parallel data between the two languages.
CHIA assumes that the following is given: (1) an n-way parallel corpus, (2) a center language L c , which is one of the n languages in the corpus and from which we want to translate into a new language, and (3) separate models for translating between the center language and all n − 1 other languages together with their training dynamics.The latter enables us to compute sets A LsLt , which contain the ambiguous instances for training an MT system between a source language L s and a target language L t .We assume that the center language is either the source or target language in all cases1 .
Once we have the ambiguous sets corresponding to each language pair, we select instances that lie at the intersection of multiple ambiguous sets.Specifically, for each instance i in the n-way parallel dataset, we count the number of language pairs l i for which i ∈ A LsLt , where 0 ≤ l i ≤ n.We rank each instance i by l i , and select the top k instances, where k is the desired size of the dataset for a new target language.The selected instances in the source language can then be used for constructing a parallel dataset in the target language.In practice, a human translator would manually trans-

Europarl
The Europarl corpus covers parliamentary proceedings.It contains multi-way parallel sentences in 21 European languages, which we filter for those sentences that are parallel between all languages.Our final dataset contains 180,000 sentences per language.To explore the effectiveness of CHIA for different dataset sizes, we create subsets of our data with 20k, 40k, 80k, and 160k sentences.
Seen languages.We create a set of 10 seen language pairs, which we use to compute training dynamics and to identify ambiguous instances.Our seen language pairs consist of our center language English paired in two directions with Greek, German, Finnish, Spanish, and Slovak.
Unseen languages.We create a set of 30 unseen or evaluation language pairs, consisting of English, our center language, paired in two directions with Bulgarian, Czech, Danish, Estonian, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovene, and Swedish.
Validation and test sets.We randomly select 7.8k and 18.2k English sentences to be our validation and test set, respectively.We then choose parallel sentences from other languages corresponding to the above English sentences to keep the test and validation set the same for all languages.

JHU Bible Corpus
The Johns Hopkins University Bible corpus covers 1611 languages across 95 different language families, with dataset sizes for each language ranging from 8k sentences (verses, in the corpus) to 40k sentences.Similar to above, we create sets of seen and unseen languages to be used for identifying ambiguous instances and evaluation, respectively.We select languages that have at least 30k sentences, yielding a multi-way parallel dataset of 29k sentences.
Seen languages.We use 10 high-resource, European seen language pairs, consisting of English as our center language paired in two directions with Bulgarian, Italian, Finnish, German, and Greek.
Unseen languages.We evaluate on both European languages and low-resource languages from a typologically diverse set of families.The European languages contains 14 language pairs, consisting of English paired in two directions with Swedish, Portuguese, Lithuanian, Danish, Dutch, Czech, and French.The low-resource languages contain ten language pairs, consisting of English paired in two directions with Lashi (Sino-Tibetan), Tampulma (Niger-Congo), Cebuano (Austronesian), Yucatec Maya (Mayan), and Dyula (Niger-Congo).
Validation and test sets.We randomly select 1.8k sentences as the validation set and 7.4k sentences as the test set from the multi-way parallel corpus, similar to the Europarl setup.

Experimental Setup
Machine translation model.Our MT model is a standard transformer (Vaswani et al., 2017) implemented in PyTorch (Paszke et al., 2019) 2 .Our hyperparameters are as follows: 6 layers, 4 attention heads, an embedding size of 512, and a hidden dimension of 1024.During training, we use dropout (Srivastava et al., 2014) with a probability of 0.3 on embedding layer and dropout with a probability of 0.2 on the attention layers.We train the models for a maximum of 100 epochs using early stopping with a patience of 15.We employ an Adam optimizer (?) with beta values of 0.9, 0.98 and a learning rate of 0.0005.Each model was trained for a maximum of six hours on a single nVIDIA V100 GPU.
CHIA hyperparameters.We select 33% of the instances with the highest variability as our ambiguous sets, following Swayamdipta et al. (2020).For the Europarl corpus, this results in new training sets of size 6.6k, 13.2k, 26.4k and 52.8k.For the Bible, this gives a new training set of size 6.6k.
Random baseline.We compare CHIA to random sampling of instances, or Random.In order to account for variation caused by randomness, for each language pair and data size, we create three independent random sets.Results we report for the random baseline are average performances of the models trained on the random sets.

Europarl: MT Training Dynamics
Figure 3 shows performance of models trained on the seen languages.For each data subset, we report the difference in BLEU between a model trained on instances selected using CHIA and Random.The exact scores for both methods, as well as model performance on the entire dataset can be found in the appendix.
In 39 out of the 40 models we investigate, models trained on instances chosen by CHIA outperform models trained on randomly selected instances.Further, we observe that the improvement in performance is greater when the size of the dataset is small.On dataset sizes of 6.6k, 13.2k, 26.4k, and 52.8k, the average BLEU score improvements are 4. 30, 4.41, 2.48, and 1.29 respectively.
Looking at individual subsets, we observe that ambiguous instances that are most effective change with dataset size.For subsets of size 6.6k, with CHIA, we see the largest improvements of 10.76 BLEU on the Spanish-English model.In contrast, with the Finnish-English dataset, there is a drop of 1.09 BLEU when instances are selected using CHIA.For subsets of size 13.2k, the English-German model shows the largest improvement with en-sk en-es es-en en-el sk-en en-de en-fi el-en de-en fi-en     6.26 BLEU, whereas the models trained on English-Finish have the smallest improvement with 1.02 BLEU.For subsets of size 26.4k, the English-Greek model shows the largest improvement with 3.06 BLEU, whereas the Spanish-English model has the smallest improvement of 1.61 BLEU.For subsets of size 52.8k, the German-English shows the largest improvement with 1.85 BLEU, whereas the Slovak-English model has the smallest improvement with 0.67 BLEU.

Europarl: Selecting Effective Instances for New Languages
Figure 4 shows the performance of models on the unseen languages when trained on instances selected using CHIA, in comparison to randomly selected instances.The exact scores for both methods can be found in the Appendix.
We see that out of the 120 models we investigate, 118 of the models outperform Random when trained on instances selected using CHIA.The correlation between BLEU score improvements and dataset sizes is less pronounced in this case than with the seen languages -on average, the BLEU score improvements obtained by selecting instances through CHIA on dataset sizes of 6.6k, 13.2k, 26.4k, and 52.8k are 0.98, 2.54, 1.41, and 0.60, respectively.
As with the performance on the seen languages, we observe changes with dataset size.For subsets of size 6.6k, the English-Romanian model shows the largest improvement with 3.57 BLEU, whereas the models trained on English-Finnish have the smallest improvement with 0.32 BLEU.For subsets of size 13.2k, the Slovene-English model shows the largest improvement with 4.71 BLEU, whereas the models trained on English-Estonian have the smallest improvement with 0.79 BLEU.For subsets of size 26.4k, the English-Lithuanian model shows the largest improvement with 2.32 BLEU, whereas the models trained on Portuguese-English have the smallest improvement with 0.32 BLEU.For subsets of size 52.8k, the English-Croatian model shows the largest improvement with 1.11 BLEU, whereas models trained with CHIA on Danish-English and French-English underperform randomly chosen instances by 0.1 and 0.05 points respectively.
Overall, our results show the benefit of using CHIA to select sentences that should be translated for an unseen target language.Depending on the desired size of the dataset, selecting sentences with CHIA can result in MT model performance improvements of up to 4.71 points over randomly selecting sentences.

Bible
Figure 5 shows the performance of CHIA on the Bible; exact scores can be found in the appendix.
From the first subplot, we see that for all seen languages, all models trained on instances chosen by CHIA outperform those trained on randomly selected instances.The maximum improvement (8.76 BLEU) is seen for English-Greek, and the average improvement is 4.50 BLEU.
The second and third subplots show the performance on unseen languages.First, for European languages, we see a maximum improvement of 5.31 BLEU for English-Dutch and an average improvement of 2.12 BLEU.However, more importantly, we also obtain consistent improvements for languages from low-resource language families: the largest improvement (3.8 BLEU) is for Cebuano-English, while the smallest improvement (0.14 BLEU) is for English-Yucatan Maya.On average, we observe an improvement of 1.28 BLEU when instances are chosen through CHIA.

Number of Overlapping Ambiguous Sets
As described in Section 3.3, we determine our final informative instances using ambiguous sets from all 10 seen language pairs.We perform a case study on the Europarl corpus to investigate the benefit of using CHIA when fewer seen languages are available for identifying ambiguous instances.Specifically, for each language pair in our unseen set, we train MT models on the ambiguous sets from each of the 10 language pairs in our seen set.
The results are presented in Table 1.The last row indicates the average performance across all unseen languages when each seen language is used as the informative set.The overall average performance in this setting, computed as the average of the last row, is -0.60, indicating that models trained on instances selected by CHIA underperform randomly selected instances.In contrast, for the same dataset size of 6.6k, when all 10 seen languages are used, the average BLEU score increases by 0.99 for models trained on instances selected using CHIA vs Random (as described in Section 4.2).
Looking at individual unseen language pairs, we see that for 22 out of 30 language pairs, using the intersection of all 10 seen languages (All) gives an improvement over using any single seen language pair.Out of the eight exceptions, for five of them, using ambiguous sentences from the English-Finish dataset gives the largest improvement: these are Danish-English, Czech-English, Lithuanian-English, Swedish-English and English-Danish.Additionally, for Estonian-English and English-Slovene, using the Finnish-English dataset gives the largest improvement.Finally, for English-Swedish, using the German-English dataset gives the largest improvement.
Further, we observe that out of all individual seen language pairs, using the English-Finnish dataset achieves the largest improvement over using randomly selected sentences in 19 out of 30 unseen language pairs.Next, for 5 out of 30 unseen language pairs, using the Finnish-English dataset achieves the largest improvement.This is notable since from looking at Figure 3, we see that models trained on ambiguous sets from the Finnish-English and English-Finnish dataset achieve the lowest performance improvement on their respective validation sets.

Analysis of Ambiguous Instances
To investigate the characteristics of sentences selected by CHIA, we analyze if ambiguous sentences are similar to randomly sampled sentences in length and number of types.We examine sentences in the training set of the Europarl corpus, and the results are presented in Table 2.
We find that sentences chosen by CHIA are the same length as sentences chosen randomly -in the source set, sentences chosen by CHIA are 1.07%longer on average, and in the target set, sentences chosen by CHIA are 0.09% shorter on average.We also find that the number of distinct word types are comparable between both -in the source set, sentences chosen by CHIA have 1.58% fewer types than sentences chosen randomly, whereas in the target set, sentences chosen by CHIA have 3.84% fewer types than sentences chosen randomly.
This indicates that CHIA does not just rely on simple characteristics of the training sentences, and instead identifies sentences that are truly beneficial to a model's learning ability.We leave it to future work to design more complex probes that can characterize the nature of the sentences selected by CHIA.

Related Work
Active learning.Active learning (Cohn et al., 1995;Settles, 2009;Ren et al., 2021) provides a way to identify the most useful instances that help a model learn, thereby optimizing limited labeled training data or annotation budgets.The most relevant active learning strategy to ours is uncertainty sampling (Lewis and Gale, 1994;Lewis and Catlett, 1994), where a learner trained on seed data is used to choose examples whose labels it is least certain about, and passing those examples to an oracle for annotation, typically a human annotator.This strategy has been utilized in NLP for tasks including text classification (Lewis and Gale, 1994;Zhu et al., 2008), named entity recognition (Shen et al., 2018), dependency parsing (Li et al., 2016), inter alia.

Feature Percentage difference
Source length 1.07 Target length -0.09Source number of types -1.58 Target number of types -3.84 Table 2: Analysis of the characteristics of chosen sentences.Percentage difference is between sentences selected by CHIA and sentences selected randomly.
Low-resource machine translation.While our work provides a way to obtain high-quality parallel data for low-resource machine translation, existing research has investigated how monolingual data can be used to develop MT systems (Pytlik and Yarowsky, 2006;Klementiev et al., 2012;Gülçehre et al., 2015;Sennrich et al., 2016;Zhang and Zong, 2016;Domhan and Hieber, 2017;Gibadullin et al., 2019).Monolingual corpora can be used to generate pseudo-parallel data through back-translation (Sennrich et al., 2016;Hoang et al., 2018), round-trip training (Cheng et al., 2016), or copying target language sentences to the source (Currey et al., 2017).Monolingual corpora have also been exploited by unsupervised methods (Lample et al., 2018;Artetxe et al., 2018;Liu et al., 2020) that need limited or no parallel data.However, on truly low-resource languages, existing unsupervised methods have been found to perform poorly (Guzmán et al., 2019;Marchisio et al., 2020).
Collecting parallel data in an efficient and costeffective manner is thus important for building and evaluating MT systems for low-resource languages.For medium and high-resource languages, parallel data can be collected through scraping the web (Bañón et al., 2020;Ramesh et al., 2021), through religious corpora (Resnik et al., 1999;Agić and Vulić, 2019;McCarthy et al., 2020), or parliament proceedings (Koehn, 2005).Since such resources are not available for low-resources languages, other techniques such as crowdsourcing have been used, using bilingual speakers (Post et al., 2012), as well as monolingual speakers with images or GIFs as a pivot (Madaan et al., 2020;Bhatnagar et al., 2021).

Conclusion
We propose CHIA, an algorithm for choosing informative instances to annotate when creating MT datasets, thereby maximizing a limited annotation budget.CHIA is based on our two contributions: 1) By extending prior work we propose a method to identify beneficial training instances for sequence-to-sequence tasks.2) Using this method, we show that we can leverage existing parallel data in high-resource languages to identify informative instances for new languages.We find that in comparison to randomly selected data, MT models trained on data selected using CHIA achieve average improvements in BLEU score of 1.59 points.Notably, CHIA is effective even when evaluating on low-resource languages, providing an efficient data annotation strategy.
Limitations and Future Work In our experiments we use English as a center language.In future work, we will investigate alternate center languages, as well as the effectiveness of our method without a center language.Additionally, a limitation of CHIA is that a multi-way parallel dataset containing the center language is required, which might be difficult to find for specific domains.Our case study further indicates that selecting data using a single high-resource language alone may not be adequate to achieve performance improvements.In future work, we will investigate the effect of different numbers and combinations of seen languages on our method.
Furthermore, while we apply CHIA to select useful data for machine translation, our method can potentially be applied for collecting data in a lowresource language for any NLP task, provided a multi-way parallel dataset, which may be investigated in future work.

A.1 BLEU Scores for Europarl
The exact values for each model, trained on ambiguous subsets chosen by CHIA, three randomly selected subsets, as well as the entire available dataset, can be found below in Tables 3, 4, 5, and 6.The first ten rows are from our seen languages, and the next 30 are for our unseen languages.

A.2 BLEU scores for Bible
The exact BLEU scores for CHIA and random are reported below in Table 7.The first ten rows correspond to the seen languages, and the rest to the low-resource languages.

Figure 2 :
Figure 2: CHIA steps: (1a) Train MT models for each high-resource language (Transformer model figure from Vaswani et al. (2017)).(1b) Using training dynamics, identify high variability or ambiguous instances for each language pair.For readability, we plot the probability here as confidence, whereas we use loss in our experiments.(2)Using all ambiguous sets, find instances at the intersection that are most informative.We first select instances in the yellow region, then the green, and finally the red.
en-es es-en en-el sk-en en-de en-fi el-en de-en fi-en Dataset size 13200 en-sk en-es es-en en-el sk-en en-de en-fi el-en de-en fi-en-sk en-es es-en en-el sk-en en-de en-fi el-en de-en fi-en Dataset size 52800 BLEU Score Difference

Figure 3 :
Figure3: Difference in BLEU score between models trained on sentences chosen using CHIA and randomly selected sentences.The languages reported here are from our seen set of Europarl languages.

Figure 4 :
Figure4: Difference in BLEU score between models trained on sentences chosen using CHIA and randomly selected sentences.The languages reported here are on our unseen set of Europarl languages.
Figure5: Difference in BLEU score between models trained on sentences chosen using CHIA and randomly selected sentences.All results are reported on the Bible corpus.

Table 3 :
BLEU scores for language pairs for each setting where the base dataset count is 20000

Table 4 :
BLEU scores for language pairs for each setting where the base dataset count is 40000

Table 5 :
BLEU scores for language pairs for each setting where the base dataset count is 80000

Table 6 :
BLEU scores for language pairs for each setting where the base dataset count is 160000