Bhawani Selvaretnam


2024

pdf bib
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
MohanRaj Chanthran | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions in Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. Unfortunately, most of the existing datasets are mainly based on Standard English, which is not sufficient to enhance NLP tasks in Malaysian English. To the best of our knowledge, there is no annotated dataset that can be used to improve the model. To address this issue, we have constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could significantly improve the performance of NER in Malaysian English. This paper presents our efforts to acquire data, the annotation methodology, and a detailed analysis of the annotated dataset. To ensure the quality of the annotation, we have measured the Inter-Annotator Agreement (IAA), and any disagreements were resolved by a subject matter expert through adjudication. After a rigorous quality check, we have developed a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss spaCy fine-tuning setup and analysis of NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction.

2023

pdf bib
How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction
Mohanraj Chanthran | Lay-Ki Soon | Ong Huey Fang | Bhawani Selvaretnam
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Recently, ChatGPT has attracted a lot of interest from both researchers and the general public. While the performance of ChatGPT in Named Entity Recognition and Relation Extraction from Standard English texts is satisfactory, it remains to be seen if it can perform similarly for Malaysian English. Malaysian English is unique as it exhibits morphosyntactic and semantical adaptation from local contexts. In this study, we assess ChatGPT’s capability in extracting entities and relations from the Malaysian English News (MEN) dataset. We propose a three-step methodology referred to as educate-predict-evaluate. The performance of ChatGPT is assessed using F1-Score across 18 unique prompt settings, which were carefully engineered for a comprehensive review. From our evaluation, we found that ChatGPT does not perform well in extracting entities from Malaysian English news articles, with the highest F1-Score of 0.497. Further analysis shows that the morphosyntactic adaptation in Malaysian English caused the limitation. However, interestingly, this morphosyntactic adaptation does not impact the performance of ChatGPT for relation extraction.