CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data

Kung Hong, Lifeng Han, Riza Batista-Navarro, Goran Nenadic


Abstract
Neural Machine Translation (NMT) for low-resource languages remains a challenge for many NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction, i.e., Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation by three models: OpusMT, NLLB, and mBART.We carried out automatic evaluation using a range of different metrics including those that are lexical-based and embedding-based.Furthermore, we create a user-friendly interface for the models we included in this project, CantonMT, and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models to this platform via our open-source CantonMT toolkit, available at https://github.com/kenrickkung/CantoneseTranslation.
Anthology ID:
2024.eamt-1.49
Volume:
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Month:
June
Year:
2024
Address:
Sheffield, UK
Editors:
Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
Note:
Pages:
590–599
Language:
URL:
https://aclanthology.org/2024.eamt-1.49
DOI:
Bibkey:
Cite (ACL):
Kung Hong, Lifeng Han, Riza Batista-Navarro, and Goran Nenadic. 2024. CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 590–599, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):
CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data (Hong et al., EAMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eamt-1.49.pdf