Cantonese Natural Language Processing in the Transformers Era

Rong Xiang, Ming Liao, Jing Li


Abstract
Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures.In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models.We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.
Anthology ID:
2024.sighan-1.8
Volume:
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, Runcong Zhao
Venues:
SIGHAN | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
69–79
Language:
URL:
https://aclanthology.org/2024.sighan-1.8
DOI:
Bibkey:
Cite (ACL):
Rong Xiang, Ming Liao, and Jing Li. 2024. Cantonese Natural Language Processing in the Transformers Era. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 69–79, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Cantonese Natural Language Processing in the Transformers Era (Xiang et al., SIGHAN-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sighan-1.8.pdf