Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

Mohanraj Chanthran; Lay-Ki Soon; Huey Fang Ong; Bhawani Selvaretnam

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

MohanRaj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

Abstract

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions in Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. Unfortunately, most of the existing datasets are mainly based on Standard English, which is not sufficient to enhance NLP tasks in Malaysian English. To the best of our knowledge, there is no annotated dataset that can be used to improve the model. To address this issue, we have constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could significantly improve the performance of NER in Malaysian English. This paper presents our efforts to acquire data, the annotation methodology, and a detailed analysis of the annotated dataset. To ensure the quality of the annotation, we have measured the Inter-Annotator Agreement (IAA), and any disagreements were resolved by a subject matter expert through adjudication. After a rigorous quality check, we have developed a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss spaCy fine-tuning setup and analysis of NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction.

Anthology ID:: 2024.lrec-main.959
Original:: 2024.lrec-main.959v1
Version 2:: 2024.lrec-main.959v2
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 10999–11022
Language:
URL:: https://aclanthology.org/2024.lrec-main.959/
DOI:
Bibkey:
Cite (ACL):: MohanRaj Chanthran, Lay-Ki Soon, Huey Fang Ong, and Bhawani Selvaretnam. 2024. Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10999–11022, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction (Chanthran et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.959.pdf

PDF (v2) PDF (v1) Cite Search Fix data