KC4MT: A High-Quality Corpus for Multilingual Machine Translation

Vinh Van Nguyen, Ha Nguyen, Huong Thanh Le, Thai Phuong Nguyen, Tan Van Bui, Luan Nghia Pham, Anh Tuan Phan, Cong Hoang-Minh Nguyen, Viet Hong Tran, Anh Huu Tran


Abstract
The multilingual parallel corpus is an important resource for many applications of natural language processing (NLP). For machine translation, the size and quality of the training corpus mainly affects the quality of the translation models. In this work, we present the method for building high-quality multilingual parallel corpus in the news domain and for some low-resource languages, including Vietnamese, Laos, and Khmer, to improve the quality of multilingual machine translation in these areas. We also publicized this one that includes 500.000 Vietnamese-Chinese bilingual sentence pairs; 150.000 Vietnamese-Laos bilingual sentence pairs, and 150.000 Vietnamese-Khmer bilingual sentence pairs.
Anthology ID:
2022.lrec-1.588
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5494–5502
Language:
URL:
https://aclanthology.org/2022.lrec-1.588
DOI:
Bibkey:
Cite (ACL):
Vinh Van Nguyen, Ha Nguyen, Huong Thanh Le, Thai Phuong Nguyen, Tan Van Bui, Luan Nghia Pham, Anh Tuan Phan, Cong Hoang-Minh Nguyen, Viet Hong Tran, and Anh Huu Tran. 2022. KC4MT: A High-Quality Corpus for Multilingual Machine Translation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5494–5502, Marseille, France. European Language Resources Association.
Cite (Informal):
KC4MT: A High-Quality Corpus for Multilingual Machine Translation (Nguyen et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.588.pdf
Data
OPUS