Finding the Optimal Byte-Pair Encoding Merge Operations for Neural Machine Translation in a Low-Resource Setting

Kristine Adlaon, Nelson Marcos


Abstract
This paper investigates the impact of different Byte Pair Encoding (BPE) configurations, specifically, merge operations on neural machine translation (NMT) performance for the Filipino-Cebuano language pair across various text domains. Results demonstrate that smaller BPE configurations, notably 2k, 5k, and 8k consistently yield higher BLEU scores, indicating improved translation quality through finer tokenization granularity. Conversely, larger BPE configurations and the absence of BPE result in lower BLEU scores, suggesting a decline in translation quality due to coarser tokenization. Additionally, these findings help us understand how the size of the model and how finely we break down words affect the quality of translations. This knowledge will be useful for improving translation systems, especially for languages that don’t have many parallel texts available for training.
Anthology ID:
2024.findings-emnlp.860
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14673–14682
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.860
DOI:
Bibkey:
Cite (ACL):
Kristine Adlaon and Nelson Marcos. 2024. Finding the Optimal Byte-Pair Encoding Merge Operations for Neural Machine Translation in a Low-Resource Setting. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14673–14682, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Finding the Optimal Byte-Pair Encoding Merge Operations for Neural Machine Translation in a Low-Resource Setting (Adlaon & Marcos, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.860.pdf