BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

Dipesh Kumar; Avijit Thawani

doi:10.18653/v1/2022.insights-1.24

BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

Abstract

BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (in_a), trigrams (out_of_the), and skip-grams (he . his). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., New_York, Statue_of_Liberty, neither . nor) which consistently improves translation performance.We release all code at https://github.com/pegasus-lynx/mwe-bpe.

Anthology ID:: 2022.insights-1.24
Volume:: Proceedings of the Third Workshop on Insights from Negative Results in NLP
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Shabnam Tafreshi, João Sedoc, Anna Rogers, Aleksandr Drozd, Anna Rumshisky, Arjun Akula
Venue:: insights
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 172–179
Language:
URL:: https://aclanthology.org/2022.insights-1.24
DOI:: 10.18653/v1/2022.insights-1.24
Bibkey:
Cite (ACL):: Dipesh Kumar and Avijit Thawani. 2022. BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation (Kumar & Thawani, insights 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.insights-1.24.pdf
Optional supplementary software:: 2022.insights-1.24.OptionalSupplementarySoftware.zip
Optional supplementary data:: 2022.insights-1.24.OptionalSupplementaryData.zip
Video:: https://aclanthology.org/2022.insights-1.24.mp4
Code: pegasus-lynx/mwe-bpe

PDF Cite Search Code Optional supplementary software Optional supplementary data Video