Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing tasks and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.


Introduction
Current state-of-the-art Transformer-based sequence generation models, either fine-tuned for chosen downstream tasks (Devlin et al., 2019), or trained from scratch for specific tasks such as machine translation (Vaswani et al., 2017) or speech recognition (Pham et al., 2019), more and more often achieve performance comparable to that of humans (Hassan et al., 2018;Popel et al., 2020;Nguyen et al., 2020). However, such models frequently require billions of trainable parameters together with huge amounts of data (billions of tokens) to reach such performance (Brown et al., 2020).
The good performance on held-out test sets seems to confirm the good generalization power of these models, although the inherent strong biases, sometimes leading to the use of a foul and toxic language, preserving stereotypes, etc., are well acknowledged (Gehman et al., 2020). Brown et al. (2020) claim that their Transformer model is also capable of simple arithmetics, however, it is yet to be validated whether the model truly learns the arithmetic algorithms or simply encodes a lookup table for a subset of specific examples.
In this paper, we argue that the assumed generalization power of the current state-of-the-art Transformer-based language generators does not come from the architecture itself but rather from the sheer volume of training data and the model's ability to exploit the similarities between the training and validation data. We demonstrate how the Transformer-based sequence-to-sequence models fail when the target sequence lengths of the training and validation data do not match. We show that this holds not only for very long test sequences but can be observed even with short sequences if they are omitted from the training data. Furthermore, we show that we can artificially improve the test performance on longer sequences by only using shorter training sequences and concatenating them into longer training examples.
We do not argue about Transformer's (in)ability to handle long-distance dependencies, but our results suggest that a considerably simpler reason of mismatching sequence length can also contribute to the performance drop. We think that our findings can lead to better understanding of the Transformer architecture and help to design better training schemes (e.g. curriculum learning).

Related Work
The problem of modeling very long sequences has been studied mainly in the context of recurrent neural networks (RNNs). Early studies showed that using LSTMs (Sutskever et al., 2014) and introducing attention (Bahdanau et al., 2014;Luong et al., 2015) can improve the model performance on long sequences. However, these models still got outperformed on long sequences by phrasebased models (Koehn and Knowles, 2017). This problem was not resolved with the introduction of Transformers (Vaswani et al., 2017). Surprisingly, even though there were previous studies explaining the weaknesses of RNNs with respect to long sequence modeling (Hochreiter and Schmidhuber, 1997;Hochreiter, 1998), similar analyses are yet to be done for Transformers which are fundamentally different from RNNs.
There is an ongoing debate about the proper way of splitting the available data to training and evaluation subsets. Gorman and Bedrick (2019) show that using only standard dataset splits can lead to a biased evaluation resulting in overestimating the generalization ability of the model. Furthermore, Søgaard et al. (2020) argue that even using randomly sampled dataset splits does not solve the overestimation problem. They instead suggest using multiple test sets possibly of an adversarial nature to properly evaluate the generalization ability of the model.
In the following experiments, we evaluate vanilla Transformer on such adversarial splits created with respect to the lengths of the modeled sequences. Although similar analyses were performed in the past (Neishi and Yoshinaga, 2019;Kondo et al., 2021), it was at a smaller scale and mainly in the context source-side length bucketing.

Experiments
We demonstrate the lack of ability to generalize to sequences of lengths not seen during training on two separate tasks: string editing and machine translation (MT).
We use Fairseq framework for sequence-tosequence learning (Ott et al., 2019) in our experiments. 1 Details about the model parameters and training are available in Appendix A.

String Editing Operations
In the first set of experiments, we focus on learning simple string editing algorithms. We chose this task because we think it is an interesting alternative to standard NLP tasks that often struggle with evaluation ambiguity (multiple possible outputs in MT or text generation with nuanced degree of quality) and proper training/validation separation (partial overlap between train and test sentences leading to lack of clarity how much model actually generalizes to new inputs). We chose to study the following tasks: • copy: copy the input sequence to the output, • unshift X, push X: add a single character (X) to the beginning or the end of the sequence respectively • shift, pop: remove a single character from the beginning or the end respectively, • reverse: reverse the character order in the input sequence As for the experiment setup, we generate a dataset of sequences consisting of two characters (e.g. 0 and 1), separated by whitespace, with no duplicate sequences. Then, we split the dataset into three separate buckets according to sequence  Given these data splits, we create datasets for each task by adding the task label, character (0, 1 for unshift and push, − for others) and a separator (|) to the beginning of each sequence. 2 We create target sequences for each task according to the respective task definition. Table 1 shows examples of the networks inputs.
For each task, we train a separate network on the 11-15 training data. Model details are available in Appendix A.1. We evaluate the models by measuring accuracy ACC = num_correct/num_examples, where num_correct is the number of exact matches between the hypothesis and reference strings. Table 2 shows the accuracy of the models trained on each task and evaluated on the varying test set buckets. We can see that the models generalize very well on the unseen sequences with length similar to the training sequences, all reaching the perfect accuracy except the reverse task. On the other hand, when facing shorter or longer sequences, the performance drops significantly. Table 2 also shows results of the training a network on all tasks simultaneously (all; by concatenating and shuffling respective training data and performing evaluation on the concatenation of the respective testsets). The resulting performance is similar to that of a single-task model.
These results suggest that the length distribution similarity between the training and validation data is important and that the vanilla Transformer decoder is prone to overfitting to the sequence lengths seen during training. 2 The arguments for unshift and push are sampled from a Bernoulli distribution with 0 character having p = 0.5.

Machine Translation
To see whether our findings within the string editing tasks also hold for natural language which has more complex structure, we perform similar experiments on English-Czech translation.
We use CzEng 2.0 3 (Kocmi et al., 2020) as our training corpus, a concatenation of WMT2020 (Barrault et al., 2020) newstest13-16 as held-out test set and a concatenation of newstest17-20 for final evaluation. 4 We tokenize our data using Moses tokenizer. 5 We use byte-pair encoding (Sennrich et al., 2016) on our training data, to create subword segmentation of size 30k. 6 We split all tokenized and BPE-segmented datasets into buckets of sizes 1-10, 11-20, ..., 91-100 (labeled as 10, 20, ..., 100 respectively) based on the number of tokens on the target side. Table 3 shows the sizes of the respective training corpora. We train a separate model for each training bucket. Details on the model hyper-parameters are available in Appendix A.2.
We evaluate how the length of the training data affects the performance with respect to the length of the test data using BLEU (Papineni et al., 2002), namely the SacreBLEU implementation (Post, 2018). 7 Figure 1 (Top) shows that regardless of the training bucket, the model performs best when presented with data of target-side length similar to the length of the training data. This confirms that the model overfits to the length of the training data, affecting its performance even on shorter sentences. The performance further decreases with increasing train-test length difference, although it needs to be noted that the BLEU scores between different testset buckets are not directly comparable due to the nature of the scoring metric and the fact that each testset bucket contains different test examples. Figure 1 (Bottom) explains the main reason behind the BLEU decrease: the increased hypothesis/reference length ratio, further supporting the length overfitting argument. Note that the lower performance of the models trained on the 70 and 80 buckets migth be due to significantly smaller size of training data (< 1M sentence pairs).
In Appendix B, we also provide a case study of the models trained on various length buckets.
The length-controlled experiment results presented by Neishi and Yoshinaga (2019), while not directly focused on exploring the target-side length overfitting phenomenon, point to a similar behavior of vanilla Transformers with regards to both longer and shorter test sentences. Based on their results, the replacement of the absolute positional embeddings with a variation of relative-position embeddings (Shaw et al., 2018; Neishi and Yoshinaga, 2019) seems like a promising approach towards alleviating the length overfitting problem.
To see whether we can exploit the target-side length overfitting behaviour, we also set up a separate experiment, similar to Kondo et al. (2021). We take the 10, 20 and 30 training buckets and concatenate the sentences in each of them to create synthetic datasets with target-side lengths 51-60 (containing on average 6, 3 and 2 sentences per training example, respectively). We apply the same training strategy using the synthetic data to see how strongly can the length of the training examples (although artificial) affect the model performance on the test examples of similar length. Figure 2 shows that the simple concatenation of shorter training sentence pairs can lead to a performance similar to the model trained on the genuinely longer sentences. Only the performance of the model trained on the concatenation of very short sentences (the line "TrainBucket.Concat=10" in Figure 2) drops significantly, possibly because the model does not learn to handle any dependencies beyond the length of 10 and such dependencies seem to emerge in test sentences with length over 40, where the model starts to underperform. Kondo et al. (2021) show that augmenting the existing training data with additional training examples that were created by concatenation of shorter sentences can help to improve model performance on very long sentences. Our results show that the synthetic concatenated data on their own can be sufficient to train a model that is competitive when applied to sentences from the similar target-length domain as the training examples. We also argue that due to a different bucket preparation strategy (based on the source-length in the previous work), the target-side length overfitting phenomenon is not as clear in Kondo et al. (2021) as in our work. In Appendix C, we provide additional results from the experiments where the dataset bucketing is based on the source-side length instead of the target-side length for comparison.

Conclusion
We showed in our targeted experiment that vanilla Transformer sequence-to-sequence models have a strong tendency to overfit with regard to the targetside length of the training sequences. On a simple algorithmic task, we documented that Transformer can generalize very well to unseen examples within the same length bucket but falls short if the same task is required for input of a different length, shorter or longer. The algorithm of the task, even if very simple, is not learned. We further confirmed the overfitting problem on the machine translation task. This suggests that long-distance dependencies are not the only reason behind the decreased performance when translating very long sentences. We think that our findings can shed a new light on specific areas of deep learning research, namely domain adaptation and curriculum learning.
We also showed that data augmentation can tackle the data sparsity in the domain of very long sentences.

A Model Details
In the following section, we provide the details of the used models and their training. All the described models are implemented in Fairseq (Ott et al., 2019). 8 During training, we use word-level cross-entropy loss with teacher forcing (Bahdanau et al., 2014;Vaswani et al., 2017) which is a current, widely used approach to the sequence-to-sequence Transformer training. During decoding, we use beam search with beam size 4 and length penalty 0.6.

A.2 Machine Translation
In the machine translation experiments, we use the transformer parameter setting with the following modifications: •  Figure 3 shows example outputs from models trained on various target-length training buckets (10-, 30-and 60-bucket) produced by translating a chosen 30-bucket testset inputs. The examples demonstrate that he models have tendencies to produce outputs with length similar to the training data while trying to satisfy the translation of the source sentence resulting in the longer, 60-bucket model repeating certain phrases or sentences while introducing grammatical errors (e.g. wrong agreement, preposition choice) or mistranslations. On the other hand, the shorter, 10-bucket model manages to drop parts of the input sentence while maintaining a reasonable fluency and grammatical correctness of the output. Figure 4 shows example of outputs from models trained on the synthetic 60-bucket data created by concatenation of the shorter training buckets. At first glance, all three hypotheses are very similar and are reasonably good translations of the source sentence, however, all systems made a wrong surface form and preposition choice for "na Vinohradech" (the same grammatical mistake as with "na Žižkově" in Figure 3), producing an incorrect but literal translation of the English "in Vinohrady". Additionally, all three systems chose a literal translation of the word "approach", which is incorrect in the given context. The incorrect surface form of the translation "založeno na doporučení" in Hyp1 suggests that training a model on a concatenation of very short sentences may lead to incorrect modeling of long-range dependencies. Surprisingly, the Hyp3 system mistranslated the phrase "work on the reconstruction" ("k rekonstrukci" in the output) while Hyp2 system produced a correct translation, though this error is most likely a result of different set of training sentences in the Hyp2 and Hyp3 training data rather than the lenght of the training sentences (before concatenation).

C Source-Side Bucketing Experiments
For comparison, we repeated the translation experiments using source-side length-based bucketing of the training and validation data. Figure 6 shows  The company does not collect its mail and it has closed its official headquarters in Žižkov more than six years ago.

Hyp1 (gloss)
The company does not gather mail and closed official headquarters.

Hyp2 (gloss)
The company does not collect mail and more than six years ago closed its official headquarters in Žižkov.

Hyp3 (gloss)
The company does not pick up mail and closed up its official its official headquarters in Žižkov more than six years ago. in Žižkov. The company does not collect mail and closes up official headquarters in Žižkov more than six years agr. o.

Source (30-bucket)
The perpetrators ended up in custody, said Marie Štrbáková, the spokeswoman of Olomouc police.

Hyp2 (gloss)
The perpetrators ended up in custody, said Marie Štrbákováová, the spokeswoman of Olomouc police.

Hyp3 (gloss)
The candidates ended up in storage, "introduced Marie Štrbákováová, the spokeswoman of Olomouc police, which became the spokeswoman of Olomouc army, and in storage.

Ref (gloss)
The perpetrators ended up in custody, informed Olomouc police spokeswoman Marie Štrbáková. the performance of the bucketed models with respect to testset of various bucket sizes. While the results are similar to the target-side bucketing experiments, the overfitting phenomenon is less clear in several cases (e.g. 20-bucket system reaching higher BLEU than 10-bucket system on 10-bucket testset or the relative system ranking on the 60bucket testset).
We think that the possible reason is the difference between the source-side length and the length of training/validation reference leading to possible overlap of target-side lengths between the different train/validation buckets. Figure 5 shows the length distributions of target-side lengths within each training and validation bucket. Although the length-wise overlap between the target-side of the training/validation examples is manifested mostly in the 1 st and 4 th quartile, we think that it helps to support the argument that the length-based overfitting should be studied with respect to the targetside length instead of the source-side. Furthermore, the length of the target-side (Czech) in the test dataset is generally smaller than the source-side (English), resulting in additional domain mismatch between the training-test buckets. Note that very long target-side outliers in the training data are most likely a result of an imperfect sentence-pair filtering after the inclusion of the additional synthetic parallel data (forward and backward translation) to the CzEng 2.0 corpus.
Based on the reviewer's suggestion, we also measured the effect of finetuning a system trained on the whole training dataset using a source-side bucketed training data. Each system was fine-tuned for  We have already worked with Lenka Langerová on our flat in the mountains based on a recommendation from another client and because everything worked well we decided to approach her to work on the reconstruction of our new flat in Vinohrady.  30 epochs, although, it is important to note that the validation BLEU of each fine-tuned system was dropping during training (compared to the BLEU of the initial model) when evaluated against the whole non-bucketed validation dataset. In Figure 7, we can see a growing effect of catastrophic forgetting (Kirkpatrick et al., 2017): all models initially saw all lengths during pretraining but specialized for a specific length bucket during finetuning. Interestingly, the forgetting effect is stronger for test buckets that are longer than the finetuning lengths while the models show much better retention of the ability to model shorter sentences. Lastly, we also performed a comparison between the baseline MT system and the combination of systems trained on a specific source-side length buckets training datasets. We extracted sentences from our test dataset with source-side length 0-80, translated them with the respective systems and computed the BLEU scores using MultEval (Clark et al., 2011). 9 We compared a system combination trained using only a specific length-bucket dataset (bucketed) applied on the respective "in-domain" parts of the test dataset. We also provide comparison with the system combination initialized by the baseline model and then fine-tuned on the respective lengthbucket datasets (bucketed.tuning). Additionally, we also trained a system using CzEng 2.0 with additional source-side labels indicating a length-9 https://github.com/jhclark/multeval BLEU baseline 19.1 ± 0.2 bucketed 18.9 ± 0.2 bucketed.tuning 17.1 ± 0.2 bucket.labels 16.7 ± 0.2 Table 4: Comparison of the translation performance of the baseline model trained on the whole CzEng 2.0, and source-length specialized models. bucketed is a combination of systems trained on the source-length bucketed training data, bucketed.tuning is a similar combination, where systems were first initialized by the baseline model and then fine-tuned for 30 epochs on their respective buckets. bucket.labels is a system trained on the whole CzEng 2.0 with inclusion of the source-side bucket length labels on the input. The systems were evaluated using MultEval (Clark et al., 2011) using a bootstrapping over a test dataset containing sentences of source-side lengths 0-80. Only a single optimizer run was performed for each evaluated system. bucket in which a given training example ended up after the source-side length-based dataset splitting (bucket.labels), e.g "<20> Example sentence..." for a sentence from a bucket 11-20. This model was evaluated on the same test dataset with inclusion of these source-side length-bucket labels.
The results in Table 4 suggest that the lengthbased specialization of the models does not outperform the baseline. One of the possible explanations is a fact that baseline system was trained on the whole CzEng 2.0 containing even sentences longer than 80. Although the bucket.labels was also trained using the whole CzEng 2.0, the results suggest that a simple inclusion of the source-length bucket information does not contribute towards a better translation performance. Varying performance of Transformers on test data trained on all of CzEng and fine-tuned only on the data from a specific source-side length bucket (various lines) when evaluated on a specific test bucket (x-axis). BLEU scores are not directly comparable across different test sets (i.e. horizontally). Bottom: Average ratio between a hypothesis and reference. Dashed line indicates a ratio of 1.0. We preserve the scaling of all the plots for better comparability across the figures.