Text Filter Based on Automatically Acquired Vocabularies for Multilingual Machine Translation

Kenji Imamura, Masao Utiyama


Abstract
In this paper, we propose a text filter designed to support multiple languages. The method simply aggregates vocabulary from a monolingual corpus and compares it against the input. Despite its simplicity, the approach proves highly effective in removing code-mixed text.When combined with existing language identification techniques, our method can enhance the purity of the corpus in the target language. Consequently, applying it to parallel corpora for machine translation has the potential to improve translation quality.Additionally, the proposed method supports the incremental addition of new languages without the need to retrain those already learned. This feature easily enables our method to be applied to low-resource languages.
Anthology ID:
2026.loresmt-1.3
Volume:
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:
LoResMT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
37–48
Language:
URL:
https://aclanthology.org/2026.loresmt-1.3/
DOI:
Bibkey:
Cite (ACL):
Kenji Imamura and Masao Utiyama. 2026. Text Filter Based on Automatically Acquired Vocabularies for Multilingual Machine Translation. In Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026), pages 37–48, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Text Filter Based on Automatically Acquired Vocabularies for Multilingual Machine Translation (Imamura & Utiyama, LoResMT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.loresmt-1.3.pdf