HFT: High Frequency Tokens for Low-Resource NMT

Edoardo Signoroni; Pavel Rychlý

HFT: High Frequency Tokens for Low-Resource NMT

Abstract

Tokenization has been shown to impact the quality of downstream tasks, such as Neural Machine Translation (NMT), which is susceptible to out-of-vocabulary words and low frequency training data. Current state-of-the-art algorithms have been helpful in addressing the issues of out-of-vocabulary words, bigger vocabulary sizes and token frequency by implementing subword segmentation. We argue, however, that there is still room for improvement, in particular regarding low-frequency tokens in the training data. In this paper, we present “High Frequency Tokenizer”, or HFT, a new language-independent subword segmentation algorithm that addresses this issue. We also propose a new metric to measure the frequency coverage of a tokenizer’s vocabulary, based on a frequency rank weighted average of the frequency values of its items. We experiment with a diverse set of language corpora, vocabulary sizes, and writing systems and report improvements on both frequency statistics and on the average length of the output. We also observe a positive impact on downstream NMT.

Anthology ID:: 2022.loresmt-1.8
Volume:: Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Atul Kr. Ojha, Chao-Hong Liu, Ekaterina Vylomova, Jade Abbott, Jonathan Washington, Nathaniel Oco, Tommi A Pirinen, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao
Venue:: LoResMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 56–63
Language:
URL:: https://aclanthology.org/2022.loresmt-1.8
DOI:
Bibkey:
Cite (ACL):: Edoardo Signoroni and Pavel Rychlý. 2022. HFT: High Frequency Tokens for Low-Resource NMT. In Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), pages 56–63, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):: HFT: High Frequency Tokens for Low-Resource NMT (Signoroni & Rychlý, LoResMT 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.loresmt-1.8.pdf
Code: edoardosignoroni/hftoks-eval

PDF Cite Search Code