Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation

Neural Machine Translation (NMT) is known to suffer from a beam-search problem: after a certain point, increasing beam size causes an overall drop in translation quality. This effect is especially pronounced for long sentences. While much work was done analyzing this phenomenon, primarily for autoregressive NMT models, there is still no consensus on its underlying cause. In this work, we analyze errors that cause major quality degradation with large beams in NMT and Automatic Speech Recognition (ASR). We show that a factor that strongly contributes to the quality degradation with large beams is dataset length-bias - NMT datasets are strongly biased towards short sentences. To mitigate this issue, we propose a new data augmentation technique – Multi-Sentence Resampling (MSR). This technique extends the training examples by concatenating several sentences from the original dataset to make a long training example. We demonstrate that MSR significantly reduces degradation with growing beam size and improves final translation quality on the IWSTL15 En-Vi, IWSTL17 En-Fr, and WMT14 En-De datasets.


Introduction
In this work, we address the beam-search problem in Neural Machine Translation (Koehn and Knowles, 2017). Beam Search is the standard hypothesis search method for autoregressive sequence generation. Large beams provide more probable hypotheses than small beams; however, the overall translation quality drops with growing beam size after a certain point. This effect is especially strong for long sentences, connecting with the fact that NMT models are biased to giving high probabilities to short hypotheses. Stahlberg and Byrne (2019) showed that exact search by likelihood for neural machine translation finds empty string as the optimal hypothesis in more than 50% of cases.
One of the most famous methods to mitigate quality degradation with growing beam size is length normalization (Bahdanau et al., 2016;Wu et al., 2016). This technique normalizes loglikelihoods of a hypothesis in beam search by its length, thus promoting long hypotheses. Other methods examine adding a reward for each token's score during the decoding process (Yang et al., 2018;Murray and Chiang, 2018).
While the beam-search problem has been extensively studied (Sountsov and Sarawagi, 2016;Murray and Chiang, 2018;Kumar and Sarawagi, 2019;Meister et al., 2020;Eikema and Aziz, 2020;Yang et al., 2020;Wang and Sennrich, 2020) there is still no consensus on the underlying reason for such model behavior. Furthermore, prior work has investigated this problem primarily for NMT models, giving little attention to other domains that are also known to suffer from it, such as Automatic Speech Recognition (ASR) (Chorowski and Jaitly, 2017;Zhou et al., 2020). Murray and Chiang (2018) noticed that since in each step of beamsearch generation, negative log-probability is added to the hypothesis' score, if a model overestimates the probability of an already generated sequence of tokens, there is no way to downgrade this probability afterward. Consequently, models are biased towards finalizing a short hypothesis by generating an end-of-sequence token (EOS) rather than generating a long continuation. Their experiments show a connection between quality degradation and decreasing length of hypotheses with growing beam size. Our work is in agreement with their explanation. Moreover, we show that the main quality degradation with large beams in NMT and ASR comes from short translations obtained by early termination of long hypotheses from small beams.
Our work examines how the distribution of sentence lengths in a dataset affects the beam-search problem. We demonstrate that the beam-search problem is strongly connected with the distribution of sentence lengths in training datasets. Specifically, we show that common NMT datasets, such as IWSLT and WMT, exhibit a strongly skewed distribution of sentence lengths, with the mode focused on short sentences. NMT models learn biased probability distributions and fail on long sentences during inference. In contrast, for ASR models trained on Librispeech (Panayotov et al., 2015), where the distribution of sentence lengths is more symmetrical and biased towards longer sentences, the beamsearch degradation occurs at much larger lengths. Based on our findings, we propose a simple and effective dataset augmentation technique that makes training examples longer -Multi-Sentence Resampling. It creates a new dataset where each training sample can be a concatenation of multiple sentences. Our method alleviates quality degradation with growing beam size and increases the final quality of the model.
The key contributions of our work are as follows: • We show that quality degradation with growing beam size comes mostly from short translations, which are early finalized prefixes of long hypotheses; • We show that training datasets that are biased towards short sentences strongly contribute to the beam-search problem; • We introduce Multi-Sentence Resamplinga simple and effective dataset augmentation technique that alleviates beam search problem and increases the final translation quality 1 .

Quality degradation analysis
This section analyzes quality degradation with the growing beam size of two systems: Neural Machine Translation (NMT) and Automatic Speech Recognition (ASR). ASR is also known to suffer from the beam-search problem (Zhou et al., 2020;Chorowski and Jaitly, 2017). The models, training, and evaluation processes of NMT and ASR models in our work are almost identical. However, the ASR dataset (Librispeech) has some properties that are differ from the Machine Translation setting: the average length of target sentences in the training dataset is much larger than in the test, which is why we chose this task for comparison.

Experimental setup
In order to make an informative comparative analysis of beam-search quality degradation between NMT and ASR, we aimed to make the model and experimental setup for both tasks as similar as possible to minimize mismatch. Specifically, the vocabulary, pre-processing, and models (except the first two layers of the ASR encoder) are identical between the two tasks.

Datasets and preprocessing
We use IWSLT2017 Fr-En, IWSLT2015 En-Vi, WMT2014 En-De, and Librispeech (Panayotov et al., 2015) datasets. The information about them is summarized in Table 1. We use standard validation and test splits for WMT, for the IWSLT En-Vi pair we used test 2012 for validation and test 2013 as a test set, for the IWSLT En-Fr pair we used development set 2010 for validation and test 2015 as a test set. The bulk of the analysis is done on IWSLT17 Fr-En and LibriSpeechclean, as they are similar in the number of targetside tokens. As there is no information about the case in Librispeech, we converted Librispeech and IWSLT2017 Fr-En to lowercase to have similar conditions for these datasets. For all other datasets the casing is unchanged. We preprocess all datasets with the Moses toolkit 3 , and use BPE (Sennrich et al., 2016) with vocabulary size 32k for WMT and 5k for other datasets, as small vocabularies are beneficial for small datasets (Ding et al., 2019).

Model and Optimization
For NMT, we use Transformer base (Vaswani et al., 2017) model from fairseq (Ott et al., 2019). For IWSLT, we use the batch size of 8k tokens and dropout coefficient 0.2; all other parameters are kept as in (Vaswani et al., 2017). Models are trained until convergence on a validation dataset.
For ASR, we used Transformer-base with two additional convolutional layers in the encoder, as suggested in (Wang et al., 2020a), all parameters for ASR are kept as in the original paper.

Inference and Evaluation
To produce length-normalized hypotheses, we use standard beam-search from fairseq (Ott et al., 2019). For evaluation, we averaged the last 5 checkpoints and use BLEU (Papineni et al., 2002) computed via Sacrebleu (Post, 2018).
For evaluating the ASR system, we used worderror-rate (WER) (Marzal and Vidal, 1993) -a standard metric that shows edit distance from generated sequence to reference.  In this section, we analyse which errors contribute to quality degradation with growing beam size in ASR and NMT and provide additional evidence to connect the beam-search quality degradation with the hypotheses shortening on large beams.

Degradation analysis
Here, and later in this work, we abbreviate models with length-normalized beam search as normalized, while models without length-normalization as unnormalized. Figure 1 shows quality of IWSLT Fr-En and Librispeech models with beam size growing from 1 to 800 on test sets. Normalized models do not show quality degradation with growing beam size. However, without length normalization, quality drops significantly with increasing beam-size.
To show which test samples cause a drop in quality, we divide the test sets into several categories, based on hypotheses from beam size 5 and beam size 400. These categories are following: • b400 ě b5 -sentences on which sentencelevel BLEU of a top hypothesis from beam size 400 is greater or equal than sentence-level BLEU of a correspondent hypothesis from beam size 5. In other words, all cases where quality improved or didn't change with the large beam size.
• b400 À b5 -sentences where best hypothesis from beam 400 is a prefix of a corresponding best hypothesis from beam 5 (except EOS token and "." before EOS). An example of this category is a pair of hypotheses: "I can" from beam size 400, and "I can do this tomorrow if you wait." from beam size 5, the first is a prefix of the second; • b400 ă b5 -all other cases that are not in the first 2 categories. In other words, examples where quality drops, however top hypothesis from beam 400 is not a prefix of a correspondent top hypothesis from beam 5.   Table 2 shows how hypotheses are distributed among categories. The smallest is the category with prefixes -"b400 À b5". Such examples are only 3% of cases in IWSLT and nearly 1% in Librispeech in unnormalized versions. This is significantly less than the category "b400 ă b5" which represents all other cases where quality drops.  Although examples where the EOS token appeared too early during the generation of a reasonable, long hypothesis, are smallest category, they have the greatest contribution to the overall quality degradation with growing beam size.
Consider Table 3, which shows quality in terms of BLEU/WER for different categories and beam sizes. The biggest drop in quality is in the prefixes category. It drops from 38.71 BLEU to almost 0 for IWSLT. For Librispeech, WER increases from 22.9 to 86.5 in the same category. Performance within the category "b400 ă b5" degrades more modestly, losing 5 BLEU and gaining 3.52 WER, respectively. Weighed by the fraction of each category within the datasets, the prefixes category contributes nearly 3 times more than the non-prefix category to the overall BLEU on IWSLT (1.16 vs. 0.42), and nearly 4 times more to the overall WER on Librispeech (0.57 vs. 0.13) 4 . Table 4 shows that the prefix category (early EOS) is also the most significant in terms of length reduction with growing beam size. Length for beam 400 in this category is nearly 84% lower compared to beam 5. Interestingly, early EOS occurs mainly in examples where the top hypothesis from beam 5 is long, on average 53.8 tokens in IWSLT and 67.17 in Librispeech, which is much longer than average lengths over the whole test datasets, 24.53 and 26.73 respectively. This observation adds further evidence to the connection between hypotheses shortening and the quality degradation with growing beam size.  Table 4: Average token lengths of best hypotheses from different beam sizes and categories. Column "Contribution to shortening" represents the difference between columns "beam 400" and "beam 5" weighted by the fraction of the corresponding category in the dataset.
Our findings relate to work studying calibration 5 problems of NMT, which show that NMT architectures are poorly calibrated, especially the EOS token (Kumar and Sarawagi, 2019;Wang et al., 2020b).

Dataset length-bias
Having found further evidence to link length bias with the beam search problem, we examine and compare the distribution of sentence lengths in typical NMT and ASR datasets.
Consider Figure 2, which shows the distribution of sentence lengths in IWSLT17 Fr-En, IWSLT15 En-Vi, WMT14 De-En, and Librispeech datasets. The NMT datasets have an average sentence length between 20 and 25 tokens and exhibit a strong, asymmetric skew towards short sentences. In contrast, the Librispeech training dataset exhibits a more symmetric distribution of sentence length, with an average length of more than 40. At the same time, the distribution of lengths in the test and validation sets is similar to NMT. As a result, during training on Librispeech, the model sees a far larger and diverse set of long sentences than is encountered in evaluation.
Let's investigate how quality relates to the length of the target sentences. Figure 3 shows BLEU/WER scores on the test sets for buckets based on the reference length. If we compare Figure 2 with Figure 3 we will note an interesting feature: quality degradation on unnormalized beam 400 in machine translation tasks starts after length around 30 and mostly monotonically in- creases as we go to longer sentences. In contrast, on Librispeech, the quality starts to drop only after the length 60, which correlates with the distribution of lengths of train examples. Specifically, in IWSLT17 Fr-En dataset, only 30% of training sentences have lengths greater than 30 and less than 10% are longer than 50 tokens. In contrast, Librispeech has many training sentences with a length of 60 tokens or less, and their amount drops rapidly only after this value, with about 5% of sentences having a length greater than 65 tokens.
Thus, we can clearly see that beam search quality degrades when operating on sentences, which are underrepresented in the training datasets in terms of reference length. This brings us to one of the main ideas of our work: training datasets biased towards short sentences strongly contribute to the quality degradation with growing beam size. In typical training datasets in Machine Translation (IWSLT and WMT) long sentences are significantly underrepresented, causing models to overfit to shorter sentences and overestimate probabilities of short prefixes. In the next section, we propose a dataaugmentation strategy to alleviate this issue. for i in 1..S do n Ð random integer from 1 to N new_source " "" new_target " "" for k in 1..n do (cur_source, cur_target) " sample random example from D new_source`" cur_source new_target`" cur_target end R.appendppnew_source, new_targetqq end return R;

Multi-Sentence Resampling
In this section, we introduce Multi-Sentence Resampling (MSR) -a simple data augmentation method that alleviates the beam search problem by addressing dataset length bias and which increases the overall quality of translation models. Specifically, MSR augments a dataset such that instead of one sentence, each training example consists of 1 to N sentences, randomly chosen from a dataset and concatenated one after another. It preserves the order of sentences: the source side is concatenated to the source side, and the target side is concatenated to the target side of a new train example. For each new training example, MSR randomly chooses from 1 to N sentences, so that the model does not overfit to the particular number of sentences. Figure 4 illustrates In contrast to other methods that use rescoring of hypotheses and per-token rewards (Yang et al., 2018) or predict target length separately (Yang et al., 2020), our method does not change the search procedure. Figure 5 illustrates how train examples length distribution changes in IWSLT17 Fr-En dataset for N from 2 to 5. With growing N distributions be- where L is the average length of the original dataset.

Experiments
In this section we provide an empirical evaluation of the Multi-Sentence Resampling on the IWSLT and WMT datasets.

Experimental setup
To compare with the original Transformer paper (Vaswani et al., 2017), where authors used another beam-search parameters and BLEU computation, we added additional part to Table 5. For this part we changed length penalty to 0.6 in beamsearch and compute BLEU as in (Vaswani et al., 2017) 6 . As baselines we use standard models trained without data augmentation.
In our experiments, for IWSLT datasets M " 10 -the new train dataset is 10 times larger than the original one, for WMT2014 En-De M " 5, as this dataset is much bigger. Figure 6 compares quality degradation with growing beam size for the baseline and MSR with N " 4 for IWSLT17 Fr-En datasets and WMT14 En-De. As an additional baseline, we compare MSR with a simpler strategy on IWSLT -resampling the dataset multiple times so that the probability of a sentence is proportional to its length. This way, long sentences occur more frequently during training. There are several interesting points in this comparison. Firstly, Multi-Sentence Resampling achieves significantly better quality than the baseline on both datasets. Secondly, while the baseline's quality rapidly drops with the growing beam size, the quality of the Multi-Sentence Resampling drops much more slowly. In particular, MSR with N " 4 with beam size 800 has quality better than the baseline with any beam size on IWSLT. On WMT, improvements for large beam sizes are more modest, which is expected, as data augmentation has less effect on larger datasets. However, MSR works on par with the length-normalized baseline up to the beam size 400. Third, simple resampling in IWSLT works slightly better than the baseline in the unnormalized setting; however, it drops quality in the normalized case. The benefits of simple resampling are limited because, unlike with MSR, the set of long sentences severely lacks diversity, and the model overfits to it during training.

Quality with growing beam size
We analyze how the value of the hyperparameter N in MSR affects beam search quality in Figure 8, which shows the quality of trained models for different values of N across a range of beam sizes. On small beams, all N behave without significant difference. On large beams, quality grows with N up to 4 and decreases for larger N .
An additional analysis of how effects of the Multi-Sentence Resampling vary with reference length and number of concatenated sentences is provided in Figure 7. Baseline unnormalized beamsearch with beam-width 400 works badly on long sentences: quality degradation increases starting from length 30. On the other hand, MSR with N " 4 has almost no degradation for long sentences. Additionally, experiments with N " 2 and N " 3 show that quality degradation on long sentences decreases as we increase N , likely because we start fitting the model to far longer sequence lengths than encountered in the test set. Table 5 examines the effects of MSR on a range of translation tasks. All scores are computed with Sacrebleu and default beam search from fairseq (Ott et al., 2019), except "WMT14, eval as in (Vaswani et al., 2017)". In this table, all MSR experiments are conducted with N "4, which is a simple baseline to choose N . We can make the following observations. Firstly, on all datasets and translation directions, models trained with Multi-Sentence Resampling statistically significantly outperform baselines: from 0.42 to 0.76 BLEU on Fr-En, En-Fr, Vi-En and En-Vi, and nearly 0.3 BLEU   Table 5: BLEU scores. Bold indicates the best score and all scores whose difference from the best is not statistically significant (with p-value of 0.05). (Statistical significance is computed via bootstrap (Koehn, 2004).) for En-De and De-En. Secondly, length-normalized models work significantly better than models without length-normalization only in 2 directions out of 6. We did not tune length-normalization hyperparameters in our experiments; however, our results suggest that length-normalization may be unnecessary in some cases.

Training time
As with any regularization, Multi-Sentence Resampling increases the training time of models. Although we expanded the dataset by 10x and 5x times for IWSLT and WMT, respectively, the training time in both cases did not increase with the size of the dataset. Table 5 shows that MSR with N " 4 increases training time before convergence on average 80% among used datasets, compared to the default training. This suggests that it is possible to make a more memory efficient MSR implementation as part of the data processing pipeline which does MSR on-the-fly, without the need to pre-process a training dataset which is 5-10x larger than the original. However, we leave this to future work.

Conclusions
In this work, we analyzed errors that cause quality degradation with growing beam size in NMT and ASR. We demonstrated that the major contribution to quality degradation on large beams comes from short translations, which are early terminated prefixes of hypotheses which are long when decoding with a small beams. In contrast to ASR, we showed that the reference length on which beam search degradation begins to grow is connected with the low number of sentences longer than this length during training. Thus, usual NMT datasets, that are biased towards short sentences, strongly contribute to the degradation with large beams. Based on this finding, we introduced Multi-Sentence Resampling -a simple data augmentation technique. It concatenates several sentences from a dataset, increasing the length of training examples. Models trained with Multi-Sentence Resampling were shown to consistently outperform baseline models on IWSLT15 En-Vi, IWSLT17 En-Fr, and WMT14 En-De datasets. Thus, we demonstrate that it is possible to mitigate beam search degradation with data augmentation. Future research directions include adapting Multi-Sentence Resampling to other domains like ASR and studying beam search problems for document-level Machine Translation, where adjacent sentences are naturally connected.