Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs

Yuqian Dai, Chun Fai Chan, Ying Ki Wong, Tsz Ho Pun


Abstract
Large Language Models (LLMs) have improved performance across various natural language processing tasks. Despite these improvements, LLMs continue to face significant challenges, such as grammatical issues and code-switching to English, when applied to low-resource languages like Cantonese in Machine Translation (MT) scenarios. By addressing the unique linguistic and contextual challenges of Cantonese, we present a novel strategy to improve the understanding and translation capabilities of LLMs for Cantonese-to-Mandarin MT. Our strategy comprises three key components: (1) Syntax and Part-of-Speech (POS) fine-tuning, where we use the Universal Dependencies (UD) corpus to fine-tune LLM, focusing on the linguistic structures of Cantonese; (2) Specialized Cantonese to Mandarin sentence pairs, collected from diverse sources such as Cantonese grammar textbooks and manually translated sentences across various domains, to expose the model to a wide range of linguistic contexts; (3) Post-processing with additional LLMs, where we introduce additional LLMs to improve the initial translations, correcting Mandarin grammar and punctuation. Empirical evaluations on human-created test sets show that our proposed strategy improves translation performance and outperforms existing commercial translation models with at least 3 BLEU scores. Additionally, our strategy also benefits other LLMs and a reversed translation direction, demonstrating its generalization and effectiveness.
Anthology ID:
2025.loreslm-1.32
Volume:
Proceedings of the First Workshop on Language Models for Low-Resource Languages
Month:
January
Year:
2025
Address:
Abu Dhabi, United Arab Emirates
Editors:
Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venues:
LoResLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
427–436
Language:
URL:
https://aclanthology.org/2025.loreslm-1.32/
DOI:
Bibkey:
Cite (ACL):
Yuqian Dai, Chun Fai Chan, Ying Ki Wong, and Tsz Ho Pun. 2025. Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 427–436, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs (Dai et al., LoResLM 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.loreslm-1.32.pdf