Md Tawkat Islam Khondaker

Also published as: Md Tawkat Islam Khondaker

2024

pdf bib abs
Benchmarking LLaMA-3 on Arabic Language Generation Tasks
Md Tawkat Islam Khondaker | Numaan Naeem | Fatimah Khan | AbdelRahim Elmadany | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference

Open-sourced large language models (LLMs) have exhibited remarkable performance in a variety of NLP tasks, often catching up with the closed-sourced LLMs like ChatGPT. Among these open LLMs, LLaMA-3-70B has emerged as the most recent and the most prominent one. However, how LLaMA-3-70B would situate itself in multilingual settings, especially in a rich morphological language like Arabic, has yet to be explored. In this work, we focus to bridge this gap by evaluating LLaMA-3-70B on a diverse set of Arabic natural language generation (NLG) benchmarks. To the best of our knowledge, this is the first study that comprehensively evaluates LLaMA-3-70B on tasks related to Arabic natural language generation. Our study reveals that LLaMA-3-70B lags behind the closed LLMs like ChatGPT, both in modern standard Arabic (MSA) and dialectal Arabic (DA). We further compare the performance of LLaMA-3-70B with our smaller and dedicated finetuned Arabic models. We find that both LLaMA-3-70B and ChatGPT are outperformed by comparatively smaller dedicated Arabic models, indicating the scope for potential improvement with Arabic-focused LLMs.

pdf bib abs
DetoxLLM: A Framework for Detoxification with Explanations
Md Tawkat Islam Khondaker | Muhammad Abdul-Mageed | Laks V. S. Lakshmanan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose DetoxLLM, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. DetoxLLM additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of DetoxLLM against adversarial toxicity.

2023

Despite the purported multilingual proficiency of instruction-finetuned large language models (LLMs) such as ChatGPT and Bard, the linguistic inclusivity of these models remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level dialectal variants. Our analysis indicates that LLMs may encounter challenges with dialects for which minimal public datasets exist, but on average are better translators of dialects than existing commercial systems. On CA and MSA, instruction-tuned LLMs, however, trail behind commercial systems such as Google Translate. Finally, we undertake a human-centric study to scrutinize the efficacy of the relatively recent model, Bard, in following human instructions during translation tasks. Our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.

pdf bib abs
GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Md Tawkat Islam Khondaker | Abdul Waheed | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

ChatGPT’s emergence heralds a transformative phase in NLP, particularly demonstrated through its excellent performance on many English benchmarks. However, the model’s efficacy across diverse linguistic contexts remains largely uncharted territory. This work aims to bridge this knowledge gap, with a primary focus on assessing ChatGPT’s capabilities on Arabic languages and dialectal varieties. Our comprehensive study conducts a large-scale automated and human evaluation of ChatGPT, encompassing 44 distinct language understanding and generation tasks on over 60 different datasets. To our knowledge, this marks the first extensive performance analysis of ChatGPT’s deployment in Arabic NLP. Our findings indicate that, despite its remarkable performance in English, ChatGPT is consistently surpassed by smaller models that have undergone finetuning on Arabic. We further undertake a meticulous comparison of ChatGPT and GPT-4’s Modern Standard Arabic (MSA) and Dialectal Arabic (DA), unveiling the relative shortcomings of both models in handling Arabic dialects compared to MSA. Although we further explore and confirm the utility of employing GPT-4 as a potential alternative for human evaluation, our work adds to a growing body of research underscoring the limitations of ChatGPT.

pdf bib abs
JASMINE: Arabic GPT Models for Few-Shot Learning
El Moatez Billah Nagoudi | Muhammad Abdul-Mageed | AbdelRahim Elmadany | Alcides Inciarte | Md Tawkat Islam Khondaker
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties with more than 400 million population, by introducing JASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-6.7 billion parameters pretrained on a large and diverse dataset ( 235 GB of text). We also carefully design and release a comprehensive benchmark for both automated and human evaluation of Arabic autoregressive models, with coverage of potential social biases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE extensively showing powerful performance intrinsically as well as in few-shot learning on a wide range of NLP tasks. We aim to responsibly release our models and evaluation benchmark with interested researchers, along with code for experimenting with them.

pdf bib
PACT: Pretraining with Adversarial Contrastive Learning for Text Classification
Md Tawkat Islam Khondaker | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs
Cross-Platform and Cross-Domain Abusive Language Detection with Supervised Contrastive Learning
Md Tawkat Islam Khondaker | Muhammad Abdul-mageed | Laks Lakshmanan, V.s.
The 7th Workshop on Online Abuse and Harms (WOAH)

The prevalence of abusive language on different online platforms has been a major concern that raises the need for automated cross-platform abusive language detection. However, prior works focus on concatenating data from multiple platforms, inherently adopting Empirical Risk Minimization (ERM) method. In this work, we address this challenge from the perspective of domain generalization objective. We design SCL-Fish, a supervised contrastive learning integrated meta-learning algorithm to detect abusive language on unseen platforms. Our experimental analysis shows that SCL-Fish achieves better performance over ERM and the existing state-of-the-art models. We also show that SCL-Fish is data-efficient and achieves comparable performance with the large-scale pre-trained models upon finetuning for the abusive language detection task.

2022

pdf bib abs
A Benchmark Study of Contrastive Learning for Arabic Social Meaning
Md Tawkat Islam Khondaker | El Moatez Billah Nagoudi | AbdelRahim Elmadany | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Contrastive learning (CL) has brought significant progress to various NLP tasks. Despite such a progress, CL has not been applied to Arabic NLP. Nor is it clear how much benefits it could bring to particular classes of tasks such as social meaning (e.g., sentiment analysis, dialect identification, hate speech detection). In this work, we present a comprehensive benchmark study of state-of-the-art supervised CL methods on a wide array of Arabic social meaning tasks. Through an extensive empirical analysis, we show that CL methods outperform vanilla finetuning on most of the tasks. We also show that CL can be data efficient and quantify this efficiency, demonstrating the promise of these methods in low-resource settings vis-a-vis the particular downstream tasks (especially label granularity).

Co-authors

Laks V. S. Lakshmanan 1

Samar Magdy 1

Numaan Naeem 1

Venues

emnlp3
arabicnlp2
ws2
aacl1
ijcnlp1
show all...

wanlp1

woah1

Fix data