The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection

Chaak-ming Lau, Mingfei Lau, Ann Wai Huen To


Abstract
This paper presents a linguistically-informed, non-machine-learning tool for classifying Written Cantonese, Standard Written Chinese, and the intermediate varieties used by Cantonese-speaking users from Hong Kong, which are often grouped into a single “Traditional Chinese” label. Our approach addresses the lack of textual materials for Cantonese NLP, a consequence of a lower sociolinguistic status of Written Cantonese and the interchangeable use of these varieties by users without sufficient language labeling. The tool utilizes key strings and quotation markers, which can be reduced to string operations, to effectively extract Written Cantonese sentences and documents from materials mixed with Standard Written Chinese. This allows for the flexible and efficient extraction of high-quality Cantonese data from large datasets, catering to specific classification needs. This implementation ensures that the tool can process large amounts of data at a low cost by bypassing model-inferencing, which is particularly significant for marginalized languages. The tool also aims to provide a baseline measure for future classification systems, and the approach may be applicable to other low-resource regional or diglossic languages.
Anthology ID:
2024.eurali-1.4
Volume:
Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Atul Kr. Ojha, Sina Ahmadi, Silvie Cinková, Theodorus Fransen, Chao-Hong Liu, John P. McCrae
Venues:
EURALI | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
24–29
Language:
URL:
https://aclanthology.org/2024.eurali-1.4
DOI:
Bibkey:
Cite (ACL):
Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024. The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection. In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.
Cite (Informal):
The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection (Lau et al., EURALI-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eurali-1.4.pdf