Sang Yun Kwon


2024

pdf bib
On the Utility of Pretraining Language Models on Synthetic Data
Alcides Alcoba Inciarte | Sang Yun Kwon | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed
Proceedings of The Second Arabic Natural Language Processing Conference

Development of pre-trained language models has predominantly relied on large amounts of datasets. However, this dependence on abundant data has limited the applicability of these models in low-resource settings. In this work, we investigate the utility of exploiting synthetic datasets acquired from different sources to pre-train language models for Arabic. Namely, we leverage data derived based on four different methods: optical character recognition (OCR), automatic speech recognition (ASR), machine translation (MT), and generative language models. We use these datasets to pre-train models in three different architectures: encoder-only (BERTBase), encoder-decoder (T5), and decoder-only (GPT-2). We test the capabilities of resulting models on Arabic natural language understanding (NLU) tasks using the ORCA benchmark. Our results show that utilizing synthetic data can achieve performance comparable to, or even surpassing, those trained on gold data. For example, our model based on a GPT-2 architecture trained on a combined synthetic dataset surpasses the baseline model ARBERTv2. Overall, our models pre-trained on synthetic data demonstrate robust performance across various tasks. This highlights the potential of synthetic datasets in augmenting language model training in low-resource settings.

2023

pdf bib
SIDLR: Slot and Intent Detection Models for Low-Resource Language Varieties
Sang Yun Kwon | Gagan Bhatia | Elmoatez Billah Nagoudi | Alcides Alcoba Inciarte | Muhammad Abdul-mageed
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Intent detection and slot filling are two critical tasks in spoken and natural language understandingfor task-oriented dialog systems. In this work, we describe our participation in slot and intent detection for low-resource language varieties (SID4LR) (Aepli et al., 2023). We investigate the slot and intent detection (SID) tasks using a wide range of models and settings. Given the recent success of multitask promptedfinetuning of the large language models, we also test the generalization capability of the recent encoder-decoder model mT0 (Muennighoff et al., 2022) on new tasks (i.e., SID) in languages they have never intentionally seen. We show that our best model outperforms the baseline by a large margin (up to +30 F1 points) in both SID tasks.