Vijay Mago

2024

Classification of Buddhist Verses: The Efficacy and Limitations of Transformer-Based Models
Nikita Neveditsin | Ambuja Salgaonkar | Pawan Lingras | Vijay Mago
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This study assesses the ability of machine learning to classify verses from Buddhist texts into two categories: Therigatha and Theragatha, attributed to female and male authors, respectively. It highlights the difficulties in data preprocessing and the use of Transformer-based models on Devanagari script due to limited vocabulary, demonstrating that simple statistical models can be equally effective. The research suggests areas for future exploration, provides the dataset for further study, and acknowledges existing limitations and challenges.

2023

pdf bib abs

Facilitating learning outcome assessment– development of new datasets and analysis of pre-trained language models
Akriti Jindal | Kaylin Kainulainen | Andrew Fisher | Vijay Mago
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

Student mobility reflects academic transfer from one postsecondary institution to another and facilitates students’ educational goals of obtaining multiple credentials and/or advanced training in their field. This process often relies on transfer credit assessment, based on the similarity between learning outcomes, to determine what knowledge and skills were obtained at the sending institution as well as what knowledge and skills need to still be acquired at the receiving institution. As human evaluation can be both a challenging and time-consuming process, algorithms based on natural language processing can be a reliable tool for assessing transfer credit. In this article, we propose two novel datasets in the fields of Anatomy and Computer Science. Our aim is to probe the similarity between learning outcomes utilising pre-trained embedding models and compare their performance to human-annotated results. We found that ALBERT, MPNeT and DistilRoBERTa demonstrated the best ability to predict the similarity between pairs of learning outcomes. However, Davinci - a GPT-3 model which is expected to predict better results - is only able to provide a good qualitative explanation and not an accurate similarity score. The codes and datasets are available at https://github.com/JAkriti/New-Dataset-and-Performance-of-Embedding-Models.

pdf bib abs

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health
Chandreen Liyanage | Muskan Garg | Vijay Mago | Sunghwan Sohn
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Amid ongoing health crisis, there is a growing necessity to discern possible signs of Wellness Dimensions (WD) manifested in self-narrated text. As the distribution of WD on social media data is intrinsically imbalanced, we experiment the generative AI techniques for data augmentation to enable further improvement in the pre-screening task of classifying WD. To this end, we propose a simple yet effective data augmentation approach through prompt-based Generative AI models, and evaluate the ROUGE scores and syntactic/ semantic similarity among existing interpretations and augmented data. Our approach with ChatGPT model surpasses all the other methods and achieves improvement over baselines such as Easy-Data Augmentation (EDA) and Backtranslation (BT).

pdf bib abs

An Annotated Dataset for Explainable Interpersonal Risk Factors of Mental Disturbance in Social Media Posts
Muskan Garg | Amirmohammad Shahbandegan | Amrit Chadha | Vijay Mago
Findings of the Association for Computational Linguistics: ACL 2023

With a surge in identifying suicidal risk and its severity in social media posts, we argue that a more consequential and explainable research is required for optimal impact on clinical psychology practice and personalized mental healthcare. The success of computational intelligence techniques for inferring mental illness from social media resources, points to natural language processing as a lens for determining Interpersonal Risk Factors (IRF) in human writings. Motivated with limited availability of datasets for social NLP research community, we construct and release a new annotated dataset with human-labelled explanations and classification of IRF affecting mental disturbance on social media: (i) Thwarted Belongingness (TBe), and (ii) Perceived Burdensomeness (PBu). We establish baseline models on our dataset facilitating future research directions to develop real-time personalized AI models by detecting patterns of TBe and PBu in emotional spectrum of user’s historical social media profile.

2022

pdf bib abs

The social NLP researchers and mental health practitioners have witnessed exponential growth in the field of mental health detection and analysis on social media. It has become important to identify the reason behind mental illness. In this context, we introduce a new dataset for Causal Analysis of Mental health in Social media posts (CAMS). We first introduce the annotation schema for this task of causal analysis. The causal analysis comprises of two types of annotations, viz, causal interpretation and causal categorization. We show the efficacy of our scheme in two ways: (i) crawling and annotating 3155 Reddit data and (ii) re-annotate the publicly available SDCNL dataset of 1896 instances for interpretable causal analysis. We further combine them as CAMS dataset and make it available along with the other source codes https://anonymous.4open.science/r/CAMS1/. Our experimental results show that the hybrid CNN-LSTM model gives the best performance over CAMS dataset.