Distributional Properties of Subword Regularization

Marco Cognetta; Vilém Zouhar; Naoaki Okazaki

doi:10.18653/v1/2024.emnlp-main.600

Distributional Properties of Subword Regularization

Marco Cognetta, Vilém Zouhar, Naoaki Okazaki

Abstract

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them.We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

Anthology ID:: 2024.emnlp-main.600
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10753–10763
Language:
URL:: https://aclanthology.org/2024.emnlp-main.600/
DOI:: 10.18653/v1/2024.emnlp-main.600
Bibkey:
Cite (ACL):: Marco Cognetta, Vilém Zouhar, and Naoaki Okazaki. 2024. Distributional Properties of Subword Regularization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10753–10763, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Distributional Properties of Subword Regularization (Cognetta et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.600.pdf

PDF Cite Search Fix data