Robust Neural Machine Translation for Abugidas by Glyph Perturbation

Hour Kaing, Chenchen Ding, Hideki Tanaka, Masao Utiyama


Abstract
Neural machine translation (NMT) systems are vulnerable when trained on limited data. This is a common scenario in low-resource tasks in the real world. To increase robustness, a solution is to intently add realistic noise in the training phase. Noise simulation using text perturbation has been proven to be efficient in writing systems that use Latin letters. In this study, we further explore perturbation techniques on more complex abugida writing systems, for which the visual similarity of complex glyphs is considered to capture the essential nature of these writing systems. Besides the generated noise, we propose a training strategy to improve robustness. We conducted experiments on six languages: Bengali, Hindi, Myanmar, Khmer, Lao, and Thai. By overcoming the introduced noise, we obtained non-degenerate NMT systems with improved robustness for low-resource tasks for abugida glyphs.
Anthology ID:
2024.eacl-short.27
Volume:
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
311–318
Language:
URL:
https://aclanthology.org/2024.eacl-short.27
DOI:
Bibkey:
Cite (ACL):
Hour Kaing, Chenchen Ding, Hideki Tanaka, and Masao Utiyama. 2024. Robust Neural Machine Translation for Abugidas by Glyph Perturbation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 311–318, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Robust Neural Machine Translation for Abugidas by Glyph Perturbation (Kaing et al., EACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eacl-short.27.pdf