Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Dusan Varis, Ondřej Bojar


Abstract
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing tasks and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
Anthology ID:
2021.emnlp-main.650
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8246–8257
Language:
URL:
https://aclanthology.org/2021.emnlp-main.650
DOI:
10.18653/v1/2021.emnlp-main.650
Bibkey:
Cite (ACL):
Dusan Varis and Ondřej Bojar. 2021. Sequence Length is a Domain: Length-based Overfitting in Transformer Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8246–8257, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Sequence Length is a Domain: Length-based Overfitting in Transformer Models (Varis & Bojar, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.650.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.650.mp4