Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Langlin Huang, Yang Feng


Abstract
Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.
Anthology ID:
2024.findings-acl.583
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9794–9801
Language:
URL:
https://aclanthology.org/2024.findings-acl.583
DOI:
Bibkey:
Cite (ACL):
Langlin Huang and Yang Feng. 2024. Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation. In Findings of the Association for Computational Linguistics ACL 2024, pages 9794–9801, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation (Huang & Feng, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.583.pdf