MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Langlin Huang, Mengyu Bu, Yang Feng


Abstract
Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages, enabling broad language scalability. However, byte-level tokenization results in sequences that are hard to interpret due to limited semantic information per byte. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. Nevertheless, variations in encoding rules across languages necessitate an adaptive approach for effective contextualization. To this end, we propose Adaptive MultiScale-Headed Attention (Ada-MSHA), adaptively selecting and mixing attention heads, which are treated as contextualization experts. This enhances the flexibility of contextualization scales and improves the potential to discover a better strategy than previous methods. Experiment results show that our method outperforms existing methods without extensive manual adjustment of hyper-parameters and surpasses subword-based models with fewer parameters in Ted-59 dataset.
Anthology ID:
2025.naacl-long.47
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1011–1028
Language:
URL:
https://aclanthology.org/2025.naacl-long.47/
DOI:
Bibkey:
Cite (ACL):
Langlin Huang, Mengyu Bu, and Yang Feng. 2025. MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1011–1028, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation (Huang et al., NAACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.naacl-long.47.pdf