Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Benoist Wolleb; Romain Silvestri; Georgios Vernikos; Ljiljana Dolamic; Andrei Popescu-Belis

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Benoist Wolleb, Romain Silvestri, Georgios Vernikos, Ljiljana Dolamic, Andrei Popescu-Belis

Abstract

Subword tokenization is the de-facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently put forward in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality, thanks to the use of Huffman coding, which tokenizes words using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for approximately 90% of the BLEU scores reached by BPE, hence compositionality has less importance than previously thought.

Anthology ID:: 2023.eamt-1.14
Volume:: Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Month:: June
Year:: 2023
Address:: Tampere, Finland
Editors:: Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation
Note:
Pages:: 137–146
Language:
URL:: https://aclanthology.org/2023.eamt-1.14/
DOI:
Bibkey:
Cite (ACL):: Benoist Wolleb, Romain Silvestri, Georgios Vernikos, Ljiljana Dolamic, and Andrei Popescu-Belis. 2023. Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 137–146, Tampere, Finland. European Association for Machine Translation.
Cite (Informal):: Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT (Wolleb et al., EAMT 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.eamt-1.14.pdf

PDF Cite Search Fix data