Md. Arid Hasan

Also published as: Md Arid Hasan, Md Arid Hasan

2025

MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Mohamed Bayan Kmainasi | Abul Hasnat | Md Arid Hasan | Ali Ezzat Shahroor | Firoj Alam
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to label detection and the generation of explanation-based rationales for predicted labels. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propaganda memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a novel multi-stage optimization approach and train Vision-Language Models (VLMs). Our results demonstrate that this approach significantly improves performance over the base model for both label detection and explanation generation, outperforming the current state-of-the-art with an absolute improvement of approximately 3% on ArMeme and 7% on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available.

pdf bib abs

Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work done in parallel, there is a notable lack of a framework and large-scale region-specific datasets queried by native users in their own languages. This gap hinders effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of approximately ~64K manually annotated QA pairs in seven languages, ranging from high- to extremely low-resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark both open- and closed-source LLMs using the MultiNativQA dataset. The dataset and related experimental scripts are publicly available for the community at: https://huggingface.co/datasets/QCRI/MultiNativQAand https://gitlab.com/nativqa/multinativqa.

pdf bib abs

Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ≈45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We have released the dialectal translation models and benchmarks developed in this study (https://huggingface.co/datasets/QCRI/AraDiCE)

pdf bib

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation
Hunzalah Hassan Bhatti | Youssef Ahmed | Md Arid Hasan | Firoj Alam
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib abs

Overview of BLP-2025 Task 1: Bangla Hate Speech Identification
Md Arid Hasan | Firoj Alam | Md Fahad Hossain | Usman Naseem | Syed Ishtiaque Ahmed
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

Online discourse in Bangla is rife with nuanced toxicity expressed through code-mixing, dialectal variation, and euphemism. Effective moderation thus requires fine-grained detection of hate type, target, and severity, rather than a binary label. To address this, we organized the Bangla Hate Speech Identification Shared Task at the BLP 2025 workshop, co-located with IJCNLP-AACL 2025, comprising three subtasks: (1A) hate-type detection, (1B) hate-target detection, and (1C) joint prediction of type, target, and severity in a multi-task setup. The subtasks attracted 161, 103, and 90 participants, with 36, 23, and 20 final submissions, respectively, while a total of 19 teams submitted system description papers. The submitted systems employed a wide range of approaches, ranging from classical machine learning to fine-tuned pretrained models and zero-/few-shot LLMs. We describe the task setup, datasets, and evaluation framework, and summarize participant systems. All datasets and evaluation scripts are publicly released.

pdf bib abs

There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community (https://github.com/firojalam/PropXplain).

2024

pdf bib abs

We present an overview of the second edition of the ArAIEval shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. In this edition, ArAIEval offers two tasks: (i) detection of propagandistic textual spans with persuasion techniques identification in tweets and news articles, and (ii) distinguishing between propagandistic and non-propagandistic memes. A total of 14 teams participated in the final evaluation phase, with 6 and 9 teams participating in Tasks 1 and 2, respectively. Finally, 11 teams submitted system description papers. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further provide a brief overview of the participating systems. All datasets and evaluation scripts are released to the research community. We hope this will enable further research on these important tasks in Arabic.

pdf bib abs

ArMeme: Propagandistic Content in Arabic Memes
Firoj Alam | Abul Hasnat | Fatema Ahmad | Md. Arid Hasan | Maram Hasanain
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

With the rise of digital communication memes have become a significant medium for cultural and political expression that is often used to mislead audience. Identification of such misleading and persuasive multimodal content become more important among various stakeholders, including social media platforms, policymakers, and the broader society as they often cause harm to the individuals, organizations and/or society. While there has been effort to develop AI based automatic system for resource rich languages (e.g., English), it is relatively little to none for medium to low resource languages. In this study, we focused on developing an Arabic memes dataset with manual annotations of propagandistic content. We annotated ∼6K Arabic memes collected from various social media platforms, which is a first resource for Arabic multimodal research. We provide a comprehensive analysis aiming to develop computational tools for their detection. We made the dataset publicly available for the community.

pdf bib abs

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis
Md. Arid Hasan | Shudipta Das | Afiyat Anjum | Firoj Alam | Anika Anjum | Avijit Sarker | Sheak Rashed Haider Noori
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community.

2023

pdf bib abs

Z-Index at BLP-2023 Task 2: A Comparative Study on Sentiment Analysis
Prerona Tarannum | Md. Arid Hasan | Krishno Dey | Sheak Rashed Haider Noori
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this study, we report our participation in Task 2 of the BLP-2023 shared task. The main objective of this task is to determine the sentiment (Positive, Neutral, or Negative) of a given text. We first removed the URLs, hashtags, and other noises and then applied traditional and pretrained language models. We submitted multiple systems in the leaderboard and BanglaBERT with tokenized data provided thebest result and we ranked 5th position in the competition with an F1-micro score of 71.64. Our study also reports that the importance of tokenization is lessening in the realm of pretrained language models. In further experiments, our evaluation shows that BanglaBERT outperforms, and predicting the neutral class is still challenging for all the models.

pdf bib abs

BLP-2023 Task 2: Sentiment Analysis
Md. Arid Hasan | Firoj Alam | Anika Anjum | Shudipta Das | Afiyat Anjum
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

We present an overview of the BLP Sentiment Shared Task, organized as part of the inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is defined as the detection of sentiment in a given piece of social media text. This task attracted interest from 71 participants, among whom 29 and 30 teams submitted systems during the development and evaluation phases, respectively. In total, participants submitted 597 runs. However, only 15 teams submitted system description papers. The range of approaches in the submitted systems spans from classical machine learning models, fine-tuning pre-trained models, to leveraging Large Language Model (LLMs) in zero- and few-shot settings. In this paper, we provide a detailed account of the task setup, including dataset development and evaluation setup. Additionally, we provide a succinct overview of the systems submitted by the participants. All datasets and evaluation scripts from the shared task have been made publicly available for the research community, to foster further research in this domain.

pdf bib abs

Semantics Squad at BLP-2023 Task 2: Sentiment Analysis of Bangla Text with Fine Tuned Transformer Based Models
Krishno Dey | Md. Arid Hasan | Prerona Tarannum | Francis Palma
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Sentiment analysis (SA) is a crucial task in natural language processing, especially in contexts with a variety of linguistic features, like Bangla. We participated in BLP-2023 Shared Task 2 on SA of Bangla text. We investigated the performance of six transformer-based models for SA in Bangla on the shared task dataset. We fine-tuned these models and conducted a comprehensive performance evaluation. We ranked 20th on the leaderboard of the shared task with a blind submission that used BanglaBERT Small. BanglaBERT outperformed other models with 71.33% accuracy, and the closest model was BanglaBERT Large, with an accuracy of 70.90%. BanglaBERT consistently outperformed others, demonstrating the benefits of models developed using sizable datasets in Bangla.

pdf bib abs

Semantics Squad at BLP-2023 Task 1: Violence Inciting Bangla Text Detection with Fine-Tuned Transformer-Based Models
Krishno Dey | Prerona Tarannum | Md. Arid Hasan | Francis Palma
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This study investigates the application of Transformer-based models for violence threat identification. We participated in the BLP-2023 Shared Task 1 and in our initial submission, BanglaBERT large achieved 5th position on the leader-board with a macro F1 score of 0.7441, approaching the highest baseline of 0.7879 established for this task. In contrast, the top-performing system on the leaderboard achieved an F1 score of 0.7604. Subsequent experiments involving m-BERT, XLM-RoBERTa base, XLM-RoBERTa large, BanglishBERT, BanglaBERT, and BanglaBERT large models revealed that BanglaBERT achieved an F1 score of 0.7441, which closely approximated the baseline. Remarkably, m-BERT and XLM-RoBERTa base also approximated the baseline with macro F1 scores of 0.6584 and 0.6968, respectively. A notable finding from our study is the under-performance by larger models for the shared task dataset, which requires further investigation. Our findings underscore the potential of transformer-based models in identifying violence threats, offering valuable insights to enhance safety measures on online platforms.

2022

pdf bib abs

SemEval-2022 Task 3: PreTENS-Evaluating Neural Networks on Presuppositional Semantic Knowledge
Roberto Zamparelli | Shammur Chowdhury | Dominique Brunato | Cristiano Chesi | Felice Dell’Orletta | Md. Arid Hasan | Giulia Venturi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We report the results of the SemEval 2022 Task 3, PreTENS, on evaluation the acceptability of simple sentences containing constructions whose two arguments are presupposed to be or not to be in an ordered taxonomic relation. The task featured two sub-tasks articulated as: (i) binary prediction task and (ii) regression task, predicting the acceptability in a continuous scale. The sentences were artificially generated in three languages (English, Italian and French). 21 systems, with 8 system papers were submitted for the task, all based on various types of fine-tuned transformer systems, often with ensemble methods and various data augmentation techniques. The best systems reached an F1-macro score of 94.49 (sub-task1) and a Spearman correlation coefficient of 0.80 (sub-task2), with interesting variations in specific constructions and/or languages.

Md. Arid Hasan

2025

2024

2023

2022

Co-authors

Venues