PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama


Abstract
This work introduces PrahokBART, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
Anthology ID:
2025.coling-main.87
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1309–1322
Language:
URL:
https://aclanthology.org/2025.coling-main.87/
DOI:
Bibkey:
Cite (ACL):
Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, and Masao Utiyama. 2025. PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1309–1322, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation (Kaing et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.87.pdf