Exploring Diversity in Back Translation for Low-Resource Machine Translation

Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. We argue that the definitions and metrics used to quantify 'diversity' in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English$\leftrightarrow$Turkish and mid-resource English$\leftrightarrow$Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance.


Introduction
The data augmentation technique of back translation (BT) is used in nearly every current neural machine translation (NMT) system to reach optimal performance (Edunov et al., 2020;Barrault et al., 2020;Akhbardeh et al., 2021, inter alia). It involves creating a pseudo-parallel dataset by translating target-side monolingual data into the source language using a secondary NMT system (Sennrich et al., 2016). In this way, it enables the incorporation of monolingual data into the NMT system. Whilst adding data in this way helps nearly all language pairs, it is particularly important for low-resource NMT where parallel data is scarce by definition.
Because of its ubiquity, there has been extensive research into how to improve BT (Burlot and Yvon, 2018;Hoang et al., 2018;Fadaee and Monz, 2018;Caswell et al., 2019), especially in ways which increase the 'diversity' of the back-translated dataset (Edunov et al., 2018;Soto et al., 2020). Previous work (Gimpel et al., 2013;Ott et al., 2018;Vanmassenhove et al., 2019) has found that machine translations lack the diversity of human productions. This is because most translation systems use some form of maximum a-posteriori (MAP) estimation, meaning that they will always favour the most probable output. Edunov et al. (2018) and Soto et al. (2020) argue that this makes standard BT data worse training data since it lacks 'richness' or diversity.
Despite the focus on increasing diversity in BT, what 'diversity' actually means in the context of NMT training data is ill-defined. In fact, Tevet and Berant (2021) point out that there is no standard metric for measuring diversity. Most previous work uses the BLEU score between candidate sentences or another n-gram based metric to estimate similarity (Zhu et al., 2018;Hu et al., 2019;He et al., 2018;Shen et al., 2019;Shu et al., 2019;Holtzman et al., 2020;Thompson and Post, 2020). However, such metrics mostly measure changes in the vocabulary or spelling. Because of this, they are likely to be less sensitive to other kinds of variety such as changes in structure.
We argue that quantifying 'diversity' using ngram based metrics alone is insufficient. Instead, we split diversity into two aspects: variety in the word choice and spelling, and variety in structure. We call these aspects lexical diversity and syntactic diversity respectively. Here, we follow recent work in natural language generation and particularly paraphrasing (e.g. Iyyer et al., 2018;Krishna et al., 2020;Goyal and Durrett, 2020;Huang and Chang, 2021;Hosking and Lapata, 2021) which explicitly models the meaning and form of the input separately. Of course, there are likely more kinds of diversity than this, but this distinction provides a common-sense framework to extend our understanding of the concept. To our knowledge, no other previous work in data augmentation has attempted to isolate and automatically measure syntactic and lexical diversity.
Building from our definition, we introduce novel metrics aimed at measuring lexical and syntactic diversity separately. We then carry out an empirical study into what effect training data with these two kinds of diversity has on final NMT performance in the context of low-resource machine translation. We do this by creating BT datasets using different generation methods and measuring their diversity. We then evaluate what impact different aspects of diversity have on final model performance. We find that a high level of diversity is beneficial for final NMT performance, though lexical diversity seems more important than syntactic diversity. Importantly though there are limits to both; the data should not be so 'diverse' that it affects the adequacy of the parallel data.
We summarise our contributions as follows: • We put forward a more nuanced definition of 'diversity' in NMT training data, splitting it into lexical diversity and syntactic diversity. We present two novel metrics for measuring these different aspects of diversity.
• We carry out empirical analysis into the effect of these types of diversity on final NMT model performance for lowresource English↔Turkish and mid-resource English↔Icelandic.
• We find that nucleus sampling is the highestperforming method of generating BT, and it combines both lexical and syntactic diversity.
• We make our code publicly available. 1

Methods
We explain each method we use for creating diverse BT datasets in Section 2.1, then discuss our metrics for diversity in Section 2.2.

Generating diverse back translation
We use four methods to generate diverse BT datasets: beam search, pure sampling, nucleus sampling, and syntax-group fine-tuning. The first three were chosen because they are in common 1 github.com/laurieburchell/ exploring-diversity-bt use and so more relevant for future work. The last, syntax-group fine-tuning, aims to increase syntactic diversity specifically and so allows us to separate its effect on final NMT performance from lexical diversity. For each method, we create a diverse BT dataset by generating three candidate translations for each input sentence. This allows us to measure diversity whilst keeping the 'meaning' of the sentence as similar as possible. In this way, we measure inter-sentence diversity as a proxy for the diversity of the dataset as a whole. We discuss our datasets in detail in Section 3.1.
Beam search Beam search is the most common search algorithm used to decode in NMT systems.
Whilst it is generally successful in finding a highprobability output, the translations it produces tend to lack diversity since it will always default to the most likely alternative in the case of ambiguity (Ott et al., 2018). We use beam search to generate three datasets for each language pair, using a beam size of five and no length penalty: • base: three million input sentences used to generate one output per input (BT dataset length: three million) • beam: three million input sentences used to generate three outputs per input (BT dataset length: nine million) • base-big: nine million input sentences used to generate one output per output (BT dataset length: nine million) Pure sampling An alternative to beam search is sampling from the model distribution. At each decoding step, we sample from the learned distribution without restriction to generate output. This method means we are likely to generate a much wider range of tokens than restricting our choice to those which are most likely (as in beam search). However, it also means that the generated text is less likely to be adequate (have the same meaning as the input) as the output space does not necessary restrict itself to choices which best reflect the meaning of the input. In other words, the output may be diverse, but it may not be the kind of diversity that we want for NMT training data. We create one dataset per language pair (sampling) by generating three candidate translations for each of the three million monolingual input sentences. This results in nine-million line BT dataset. We set our beam size to five when generating.
Nucleus sampling Nucleus or top-p sampling is another sampling-based method, introduced by Holtzman et al. (2020). Unlike pure sampling, which samples from the entire distribution, top-p sampling only samples from the highest probability tokens whose cumulative probability mass exceeds the pre-chosen threshold p. The intuition is that when only a small number of tokens are likely, we want to limit our sampling space to those. However, when there are many likely hypotheses, we want to widen the number of tokens we might sample from. We chose this method in the hope it represents a middle ground between high-probability but repetitive beam search generations, and more diverse but potentially low-adequacy pure sampling generation. We create one dataset per language pair (nucleus) by generating three hypothesis translations for each of the three million monolingual input sentences. Each dataset is therefore nine million lines long. We set the beam size to five and p to 0.95.
Syntax-group fine-tuning For our analysis in this paper, we want to generate diverse BT in a way which focuses on syntactic diversity over lexical diversity, so that we can separate out its effect on final NMT performance. We therefore take a fine-tuning approach for our final generation method. To do this, we generate the dependency parse of each sentence in the English side of the parallel data for each language pair using the Stanford neural network dependency parser (Chen and Manning, 2014). We then label each pair of parallel sentences in the training data according to the first split in the corresponding syntactic parse tree. We then create three fine-tuning training datasets out of the three biggest syntactic groups. 2 Finally, we take NMT models trained on parallel data alone and restart training on each syntactic-group dataset, resulting in three NMT systems which are fine-tuned to produce a particular syntactic structure. We are only able to create models this way which translate into English, as good syntactic parsers are not available for the other languages in our study.
To verify this method works as expected, we translated the test set for each language pair with the model trained on parallel data only. We then 2 For English-Turkish, we combine the third and fourth largest syntactic groups to create the third fine-tuning dataset, as the third-largest syntactic group alone was not large enough for successful fine-tuning. S → N P V P .
S → P P , N P V P . S → " S , " N P V P . S → S , C C S .
Productions (  translated the same test set with each fine-tuned model and checked it was producing more of the required syntactic group. We did indeed find that fine-tuning resulted in more candidate sentences from the required group. Figure 1 gives an example of the different pattern of productions between the parallel-only model and a model fine-tuned on a particular syntactic group (S -> PP NP VP .)

Diversity metrics
We use three primary metrics to measure lexical and syntactic diversity: i-BLEU, i-chrF, and tree kernel difference. As mentioned in Section 2.1, we generate three output sentences for each input to our BT systems and measure inter-sentence diversity as a proxy for the diversity produced by the system. Due to compute time, we calculate all inter-sentence metrics over a sample of 30,000 sentence groups rather than the whole BT dataset.
i-BLEU Following previous work, we calculate the BLEU score between all sentence pairs generated from the same input (Papineni et al., 2002), take the mean and then subtract it from one to give inter-sentence or i-BLEU (Zhu et al., 2018). We believe that lexical diversity as we define it is the main driver of this metric, since BLEU scores are calculated based on n-gram overlap and so the biggest changes to the score will result from changes to the words used (though changes in ordering of words and their morphology will also have an effect). The higher the i-BLEU score, the higher the diversity of output.
i-chrF Building from i-BLEU, we introduce i-chrF, which is generated in the same way as i-BLEU but using chrF (Popović, 2015). Since chrF is also based on n-gram overlap, we believe it will also mostly measure lexical diversity. However, i-chrF is based on character rather than word overlap, and so should be less affected by morphological changes to the form of words than i-BLEU. We calculate both chrF and BLEU scores using the sacreBLEU toolkit (Post, 2018).
Tree kernel difference We propose a novel metric which focuses on syntactic diversity: mean tree kernel difference. To calculate it, we first generate the dependency parse of each candidate sentence using the Stanford neural network dependency parser (Chen and Manning, 2014). We replace all terminals with a dummy token to minimise the effect of lexical differences, then we calculate the tree kernel for each pair of parses using code from Conklin et al. (2021), which is in turn based on Moschitti (2006). Finally, we calculate the mean across all pairs to give the mean tree kernel difference for each set of generated sentences.
We are only able to calculate the tree kernel metric for the English datasets due to the lack of reliable parsers in Turkish and Icelandic, though this method could extend to any language with a reasonable parser available. The higher the score, the higher the diversity of the output.
Summary statistics We calculate mean word length, mean sentence length, and vocabulary size over the entire generated dataset as summary statistics. We use the definition of 'word' as understood by the bash wc command to calculate all metrics, since we are only interested in a rough measure to check for degenerate results.

Experiments
Having discussed the methods by which we generate diverse BT datasets and the metrics with which we measure the diversity in these datasets, we now outline our experimental set up for testing the effect of training data diversity on final NMT model performance.

Data and preprocessing
We carry out our experiments on two language pairs: low-resource Turkish-English and midresource Icelandic-English. These languages are sufficiently low-resource that augmenting the training data will likely be beneficial, but wellresourced enough that we can still train a reasonable back-translation model on the available parallel data alone.
Data provenance The Turkish-English parallel data is from the WMT 2018 news translation task (Bojar et al., 2018). The training data is from the SETIMES dataset, a parallel dataset of news articles in Balkan languages (Tiedemann, 2012). We use the development set from WMT 2016 and the test sets from WMT 2016-18.
The English monolingual data is made up of news crawl data from 2016 to 2020, version 16 of news-commentary crawl, 4 and crawled news discussions from 2012 to 2019. 5 The Turkish monolingual data is news crawl data from 2016 to 2020. 6 The Icelandic monolingual data is made up of news crawl data from 2020, and part one of the Icelandic Gigaword dataset (Steingrímsson et al., 2018).
Data cleaning Our cleaning scripts are adapted from those provided by the Bergamot project. 7 The full data preparation procedure is provided in the repo accompanying this paper. After cleaning, the Turkish-English parallel dataset contains 202 thousand lines and the Icelandic-English parallel dataset contains 3.97 million lines. The English, Icelandic, and Turkish cleaned monolingual datasets contain 487 million, 39.9 million, and 26.1 million lines respectively. We select 9 million lines of each monolingual dataset for BT at random since all the monolingual datasets are the same domain as the test sets.  Text pre-processing We learn a joint BPE model with SentencePiece using the concatenated training data for each language pair (Kudo and Richardson, 2018). We set vocabulary size to 16,000 and character coverage to 1.0. All other settings are default. We apply this model to the training, development, and test data. We remove the BPE segmentation before calculating any metrics.

Model training
Model architecture and infrastructure All NMT models in this paper are transformer models (Vaswani et al., 2017). We give full details about hyper-parameters and infrastructure in Appendix A.2.
Parallel-only models for back translation For each language pair and in both directions, we train an NMT model on the cleaned parallel data alone using the relevant hyper-parameter settings in Table 5. We measure the performance of these models by calculating the BLEU score (Papineni et al., 2002) using the sacreBLEU toolkit (Post, 2018) 8 and by evaluating the translations with COMET using the wmt20-comet-da model (Rei et al., 2020).  datasets as described in Section 2.1. We translate the same three million sentences of monolingual data each time for consistency, translating an additional six million lines of monolingual data for the base-big dataset.

Generating back translation
Training final models We train final models for each language direction on the concatenation of the parallel data and each back-translation dataset (back-translation on the source side, original monolingual data as target). We measure the final performance of these models using BLEU and COMET as before.

Results and Analysis
4.1 Final model performance Figures 2 and 3 show the mean BLEU and COMET scores achieved by the final models trained on the concatenation of the parallel data and the different BT datasets. In most cases, adding any BT data to the training data results in some improvement over the parallel-only baseline for both scores. However, augmenting the training data with BT produced with nucleus sampling nearly always results in the strongest performance, with mean gains of 2.88 BLEU or 0.078 COMET. This compares to mean gains of 2.24 BLEU or 0.026 COMET when using the baseline BT dataset of three million lines translated with beam search. Pure sampling tends to perform similarly but not quite as well as nucleus sampling. Based on this result, we suggest that  future work generate BT with nucleus sampling rather than pure sampling.

Diversity metrics
We give the diversity metrics for each language pair and each generated dataset in Tables 1 to 4. 9 Sentence and word lengths are comparable across the same language for all generation methods, suggesting that each method is generating tokens from roughly the right language distribution. However, the vocabulary size is much larger for nucleus compared to base or beam, and sampling is around twice that of nucleus. Examining the data, we find many neologisms (that is, 'words' which do not appear in the training data) for nucleus and more still for sampling. We note that the syntax-groups dataset has a much smaller vocabulary again; this is what we would hope if the generation method is producing syntactic rather than lexical diversity as required. We give representative examples of generated triples in Appendix A.1, along with some explanation of how the phenomena they demonstrate fit into the general trend of the dataset.
Effect on performance With respect to the intersentence diversity metrics (i-BLEU, i-chrF, and tree kernel scores), we see that the sampling dataset has the highest diversity scores, followed by nucleus,   then syntax, then beam. Taken together with the performance scores and the summary statistics, this suggests that NMT data benefits from a high level of diversity, but not so high that the two halves of the parallel data no longer have the same meaning (as shown by the very high vocabulary size for sampling).
Metric correlation There is a high correlation between i-BLEU, i-chrF, and tree kernel score for the beam, sampling, and nucleus datasets. This is not entirely unexpected: it is likely to be difficult if not impossible to disentangle lexical and syntactic diversity, since changing sentence structure would also affect the word choice and vice versa. This correlation is much weaker for the syntaxgroups dataset: whilst the tree-kernel scores are comparable to the sampling and nucleus datasets, there is a much smaller increase in the other (lexical) diversity scores. This suggests that this generation method encourages relatively more syntactic variation than lexical compared to the other diverse generation method, as was its original aim (see paragraph on syntax-group fine-tuning in section 2.1). The fact that the final model trained on this BT dataset has lower performance compared to other forms of diversity suggests that lexical diversity is more important than syntactic diversity when undertaking data augmentation. We leave it to future work to investigate this hypothesis further.

Data augmentation versus more monolingual data
The right-most cross in each quadrant of Figures 2  and 3 gives the performance of base-big, the dataset where we simply add six million more lines of new data rather than carrying out data augmentation. Interestingly, pure and nucleus sampling both often outperform base-big. This may be because the model over-fits to too much back-translated data, whereas having multiple sufficiently-diverse pseudo-source sentences for each target sentence has a regularising effect on the model.
To further support this hypothesis, Figure 4 gives training perplexity for the first 50,000 steps of training for the final Icelandic→English models, which are representative of the results for the other language pairs. We see that the base-big dataset has the lowest training perplexity at each step, suggesting this data is easier to model. Conversely, the model has highest training perplexity on the sampling and nucleus datasets, suggesting generating the data this way has a regularising effect.

Translationese effect
Several studies have found that back-translated text is easier to translate than forward-translated text, and so inflates intrinsic metrics like BLEU (Edunov et al., 2020;Graham et al., 2020;Roberts et al., 2020). To use a concrete example, the WMT test sets for English to Turkish are made up of half native English translated into Turkish, and half native Turkish translated into English. We want models that perform well when translating from native text (in this example: the native English side), as this is the usual direction of translation. However, half the test set is made up of translations on the source side. The translationese effect means that the model will usually get higher scores on this half of the test set, potentially inflating the score. Consequently, the intrinsic metrics could suggest choosing a model that does not actually perform well on the desired task (translating from native text). We investigate this effect in our own work by examining the mean BLEU scores for each model on each half of the test sets, giving the results in Figure 5. Each bar indicates the mean percentage change in BLEU scores over the parallel-only baseline model for the models trained on the different BT datasets, so a larger bar means a better performing model. The left-hand bars in each quadrant show the performance of each model on the back-translated half of the test set (to native) and the right-hand bars give the performance of each model on the forward-translated half of the test set (from native).
We see a significant translationese effect for all models, as the percentage change in scores over the baseline are much higher when the models translate already translated text (the left-hand side bars are higher than the right-hand ones). However, it appears that the nucleus dataset is less affected by the translationese effect than the other datasets, since it shows less of a decline in performance when translating native text. This may be due to a similar regularising effect as discussed previously, as it is more difficult for the model to overfit to BT data when it is generated with nucleus sampling. A direction for future research is how to obtain the benefits of using monolingual data (as BT does) without exacerbating the translationese effect.

Related work
Improving back translation The original paper introducing BT by Sennrich et al. (2016) found that using a higher-quality NMT system for BT led to higher BLEU scores in the final trained system. presents a comprehensive survey of BT and its variants as applied to low-resource NMT.

Diversity in machine translation
Most of the work on the lack of diversity in machine-translated text are in the context of automatic evaluation (Edunov et al., 2020;Graham et al., 2020;Roberts et al., 2020). As for diversity in BT specifically, Edunov et al. (2018) argue that MAP prediction, as is typically used to generate BT through beam search, leads to overly-regular synthetic source sentences which do not cover the true data distribution. They propose instead generating BT with sampling or noised beam outputs, and find model performance increases for all but the lowest resource scenarios. Alternatively, Soto et al. (2020) generate diverse BT by training multiple machine-translation systems with varying architectures.
Generating diversity Increasing diversity in BT is part of the broader field of diverse generation, by which we mean methods to vary the surface form of a production whilst keeping the meaning as similar as possible. Aside from generating diverse translations (Gimpel et al., 2013;He et al., 2018;Shen et al., 2019;Nguyen et al., 2020;Li et al., 2021), it is also used in question answering systems (Sultan et al., 2020), visually-grounded generation (Vi-jayakumar et al., 2018), conversation models (Li et al., 2016), and particularly paraphrasing (Mallinson et al., 2017;Wieting and Gimpel, 2018;Hu et al., 2019;Thompson and Post, 2020;Goyal and Durrett, 2020;Krishna et al., 2020). Some recent work such as Iyyer et al. (2018), Huang and Chang (2021), and Hosking and Lapata (2021) explicitly model the meaning and the form of the input separately. In this way, they aim to vary the syntax of the output whilst preserving the semantics so as to generate more diverse paraphrases. Unfortunately, these methods are difficult to apply to a low-resource scenario as they require external resources (e.g. accurate syntactic parsers, large-scale paraphrase data) which are not available for most of the world's languages.

Conclusion
In this paper, we introduced a two-part framework for understanding diversity in NMT data, splitting it into lexical diversity and syntactic diversity.
Our empirical analysis suggests that whilst high amounts of both types of diversity are important in training data, lexical diversity may be more beneficial than syntactic. In addition, achieving high diversity in BT should not be at the expense of ad- Finally, the authors would like to thank our anonymous reviewers for their time and helpful comments, and we give special thanks to Henry Conklin and Bailin Wang for their help with tree kernels and many useful discussions.