A Statistical Extension of Byte-Pair Encoding

David Vilar; Marcello Federico

doi:10.18653/v1/2021.iwslt-1.31

A Statistical Extension of Byte-Pair Encoding

Abstract

Sub-word segmentation is currently a standard tool for training neural machine translation (MT) systems and other NLP tasks. The goal is to split words (both in the source and target languages) into smaller units which then constitute the input and output vocabularies of the MT system. The aim of reducing the size of the input and output vocabularies is to increase the generalization capabilities of the translation model, enabling the system to translate and generate infrequent and new (unseen) words at inference time by combining previously seen sub-word units. Ideally, we would expect the created units to have some linguistic meaning, so that words are created in a compositional way. However, the most popular word-splitting method, Byte-Pair Encoding (BPE), which originates from the data compression literature, does not include explicit criteria to favor linguistic splittings nor to find the optimal sub-word granularity for the given training data. In this paper, we propose a statistically motivated extension of the BPE algorithm and an effective convergence criterion that avoids the costly experimentation cycle needed to select the best sub-word vocabulary size. Experimental results with morphologically rich languages show that our model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE’s generalization in the translation of sentences containing new words, as shown via human evaluation.

Anthology ID:: 2021.iwslt-1.31
Volume:: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Month:: August
Year:: 2021
Address:: Bangkok, Thailand (online)
Editors:: Marcello Federico, Alex Waibel, Marta R. Costa-jussà, Jan Niehues, Sebastian Stuker, Elizabeth Salesky
Venue:: IWSLT
SIG:: SIGSLT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 263–275
Language:
URL:: https://aclanthology.org/2021.iwslt-1.31/
DOI:: 10.18653/v1/2021.iwslt-1.31
Bibkey:
Cite (ACL):: David Vilar and Marcello Federico. 2021. A Statistical Extension of Byte-Pair Encoding. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 263–275, Bangkok, Thailand (online). Association for Computational Linguistics.
Cite (Informal):: A Statistical Extension of Byte-Pair Encoding (Vilar & Federico, IWSLT 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.iwslt-1.31.pdf

PDF Cite Search Fix data