Lay-Ki Soon

2025

2024

Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi — a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.

pdf bib
‘Finance Wizard’ at the FinLLM Challenge Task: Financial Text Summarization
Meisin Lee | Lay-Ki Soon
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

pdf bib abs
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
MohanRaj Chanthran | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions in Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. Unfortunately, most of the existing datasets are mainly based on Standard English, which is not sufficient to enhance NLP tasks in Malaysian English. To the best of our knowledge, there is no annotated dataset that can be used to improve the model. To address this issue, we have constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could significantly improve the performance of NER in Malaysian English. This paper presents our efforts to acquire data, the annotation methodology, and a detailed analysis of the annotated dataset. To ensure the quality of the annotation, we have measured the Inter-Annotator Agreement (IAA), and any disagreements were resolved by a subject matter expert through adjudication. After a rigorous quality check, we have developed a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss spaCy fine-tuning setup and analysis of NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction.

pdf bib abs
Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
MohanRaj Chanthran | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

Malaysian English is a low resource creole languages, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperforms when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52% and 26.27% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. While the overall performance for NER does not have significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published through this paper provide valuable resources for NLP research work focusing on Malaysian English.

2023

Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Australian Social Act for Dependent Child. ChatGPT is applied to perform analysis on the corpus using the IRAC method, which is a framework widely used by legal professionals for organizing legal analysis. Each scenario in the corpus is annotated with a complete IRAC analysis in a semi-structured format so that both machines and legal professionals are able to interpret and understand the annotations. In addition, we conducted the first empirical assessment of ChatGPT for IRAC analysis in order to understand how well it aligns with the analysis of legal professionals. Our experimental results shed lights on possible future research directions to improve alignments between LLMs and legal experts in terms of legal reasoning.

pdf bib abs
How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction
Mohanraj Chanthran | Lay-Ki Soon | Ong Huey Fang | Bhawani Selvaretnam
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Recently, ChatGPT has attracted a lot of interest from both researchers and the general public. While the performance of ChatGPT in Named Entity Recognition and Relation Extraction from Standard English texts is satisfactory, it remains to be seen if it can perform similarly for Malaysian English. Malaysian English is unique as it exhibits morphosyntactic and semantical adaptation from local contexts. In this study, we assess ChatGPT’s capability in extracting entities and relations from the Malaysian English News (MEN) dataset. We propose a three-step methodology referred to as educate-predict-evaluate. The performance of ChatGPT is assessed using F1-Score across 18 unique prompt settings, which were carefully engineered for a comprehensive review. From our evaluation, we found that ChatGPT does not perform well in extracting entities from Malaysian English news articles, with the highest F1-Score of 0.497. Further analysis shows that the morphosyntactic adaptation in Malaysian English caused the limitation. However, interestingly, this morphosyntactic adaptation does not impact the performance of ChatGPT for relation extraction.

2022

pdf bib abs
CrudeOilNews: An Annotated Crude Oil News Corpus for Event Extraction
Meisin Lee | Lay-Ki Soon | Eu Gene Siew | Ly Fie Sugianto
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we present CrudeOilNews, a corpus of English Crude Oil news for event extraction. It is the first of its kind for Commodity News and serves to contribute towards resource building for economic and financial text mining. This paper describes the data collection process, the annotation methodology, and the event typology used in producing the corpus. Firstly, a seed set of 175 news articles were manually annotated, of which a subset of 25 news was used as the adjudicated reference test set for inter-annotator and system evaluation. The inter-annotator agreement was generally substantial, and annotator performance was adequate, indicating that the annotation scheme produces consistent event annotations of high quality. Subsequently, the dataset is expanded through (1) data augmentation and (2) Human-in-the-loop active learning. The resulting corpus has 425 news articles with approximately 11k events annotated. As part of the active learning process, the corpus was used to train basic event extraction models for machine labeling; the resulting models also serve as a validation or as a pilot study demonstrating the use of the corpus in machine learning purposes. The annotated corpus is made available for academic research purpose at https://github.com/meisin/CrudeOilNews-Corpus

2021

pdf bib abs
Effective Use of Graph Convolution Network and Contextual Sub-Tree for Commodity News Event Extraction
Meisin Lee | Lay-Ki Soon | Eu-Gene Siew
Proceedings of the Third Workshop on Economics and Natural Language Processing

Event extraction in commodity news is a less researched area as compared to generic event extraction. However, accurate event extraction from commodity news is useful in abroad range of applications such as under-standing event chains and learning event-event relations, which can then be used for commodity price prediction. The events found in commodity news exhibit characteristics different from generic events, hence posing a unique challenge in event extraction using existing methods. This paper proposes an effective use of Graph Convolutional Networks(GCN) with a pruned dependency parse tree, termed contextual sub-tree, for better event ex-traction in commodity news. The event ex-traction model is trained using feature embed-dings from ComBERT, a BERT-based masked language model that was produced through domain-adaptive pre-training on a commodity news corpus. Experimental results show the efficiency of the proposed solution, which out-performs existing methods with F1 scores as high as 0.90. Furthermore, our pre-trained language model outperforms GloVe by 23%, and BERT and RoBERTa by 7% in terms of argument roles classification. For the goal of re-producibility, the code and trained models are made publicly available.

2019

pdf bib abs
Hybrid Models for Aspects Extraction without Labelled Dataset
Wai-Howe Khong | Lay-Ki Soon | Hui-Ngo Goh
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

One of the important tasks in opinion mining is to extract aspects of the opinion target. Aspects are features or characteristics of the opinion target that are being reviewed, which can be categorised into explicit and implicit aspects. Extracting aspects from opinions is essential in order to ensure accurate information about certain attributes of an opinion target is retrieved. For instance, a professional camera receives a positive feedback in terms of its functionalities in a review, but its overly high price receives negative feedback. Most of the existing solutions focus on explicit aspects. However, sentences in reviews normally do not state the aspects explicitly. In this research, two hybrid models are proposed to identify and extract both explicit and implicit aspects, namely TDM-DC and TDM-TED. The proposed models combine topic modelling and dictionary-based approach. The models are unsupervised as they do not require any labelled dataset. The experimental results show that TDM-DC achieves F1-measure of 58.70%, where it outperforms both the baseline topic model and dictionary-based approach. In comparison to other existing unsupervised techniques, the proposed models are able to achieve higher F1-measure by approximately 3%. Although the supervised techniques perform slightly better, the proposed models are domain-independent, and hence more versatile.