Md Fahim

2025

pdf bib abs
BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses
Shadman Rohan | Ishita Sur Apan | Muhtasim Ibteda Shochcho | Md Fahim | Mohammad Ashfaq Ur Rahman | AKM Mahbubur Rahman | Amin Ahsan Ali
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

We present Team BD’s submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues – determining if a tutor correctly recognizes a student’s mistake (Track 1) and whether the tutor pinpoints the mistake’s location (Track 2). Our system is built on MPNet, a Transformer-based language modelthat combines BERT and XLNet’s pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Ourapproach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system’s performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.

pdf bib abs
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
Fabiha Haider | Fariha Tanjim Shifat | Md Farhan Ishmam | Md Sakib Ul Rahman Sourove | Deeparghya Dutta Barua | Md Fahim | Md Farhad Alam Bhuiyan
Findings of the Association for Computational Linguistics: NAACL 2025

The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in understanding hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We propose a novel translation-based LLM prompting strategy that translates or transliterates under-resourced text to higher-resourced text before classifying the hate group(s). Experiments reveal further pre-trained encoders achieving state-of-the-art performance on the BanTH dataset while translation-based prompting outperforms other strategies in the zero-shot setting. We address a critical gap in Bangla hate speech and set the stage for further exploration into code-mixed and multi-label classification in underrepresented languages.

2024

pdf bib abs
BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla
Md Fahim | Fariha Tanjim Shifat | Fabiha Haider | Deeparghya Dutta Barua | MD Sakib Ul Rahman Sourove | Md Farhan Ishmam | Md Farhad Alam Bhuiyan
Findings of the Association for Computational Linguistics: EMNLP 2024

Low-resource languages like Bangla are severely limited by the lack of datasets. Romanized Bangla texts are ubiquitous on the internet, offering a rich source of data for Bangla NLP tasks and extending the available data sources. However, due to the informal nature of romanized text, they often lack the structure and consistency needed to provide insights. We address these challenges by proposing: (1) BanglaTLit, the large-scale Bangla transliteration dataset consisting of 42.7k samples, (2) BanglaTLit-PT, a pre-training corpus on romanized Bangla with 245.7k samples, (3) encoders further-pretrained on BanglaTLit-PT achieving state-of-the-art performance in several romanized Bangla classification tasks, and (4) multiple back-transliteration baseline methods, including a novel encoder-decoder architecture using further pre-trained encoders. Our results show the potential of automated Bangla back-transliteration in utilizing the untapped sources of romanized Bangla to enrich this language. The code and datasets are publicly available: https://github.com/farhanishmam/BanglaTLit.

2023

pdf bib abs
Contextual Bangla Neural Stemmer: Finding Contextualized Root Word Representations for Bangla Words
Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Stemmers are commonly used in NLP to reduce words to their root form. However, this process may discard important information and yield incorrect root forms, affecting the accuracy of NLP tasks. To address these limitations, we propose a Contextual Bangla Neural Stemmer for Bangla language to enhance word representations. Our method involves splitting words into characters within the Neural Stemming Block, obtaining vector representations for both stem words and unknown vocabulary words. A loss function aligns these representations with Word2Vec representations, followed by contextual word representations from a Universal Transformer encoder. Mean Pooling generates sentence-level representations that are aligned with BanglaBERT’s representations using a MLP layer. The proposed model also tries to build good representations for out-of-vocabulary (OOV) words. Experiments with our model on five Bangla datasets shows around 5% average improvement over the vanilla approach. Notably, our method avoids BERT retraining, focusing on root word detection and addressing OOV and sub-word issues. By incorporating our approach into a large corpus-based Language Model, we expect further improvements in aspects like explainability.

pdf bib abs
Investigating the Effectiveness of Graph-based Algorithm for Bangla Text Classification
Farhan Dehan | Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this study, we examine and analyze the behavior of several graph-based models for Bangla text classification tasks. Graph-based algorithms create heterogeneous graphs from text data. Each node represents either a word or a document, and each edge indicates relationship between any two words or word and document. We applied the BERT model and different graph-based models including TextGCN, GAT, BertGAT, and BertGCN on five different datasets including SentNoB, Sarcasm detection, BanFakeNews, Hate speech detection, and Emotion detection datasets for Bangla text. BERT’s model bested the TextGCN and the GAT models by a large difference in terms of accuracy, Macro F1 score, and weighted F1 score. BertGCN and BertGAT are shown to outperform standalone graph models and BERT model. BertGAT excelled in the Emotion detection dataset and achieved a 1%-2% performance boost in Sarcasm detection, Hate speech detection, and BanFakeNews datasets from BERT’s performance. Whereas, BertGCN outperformed BertGAT by 1% for SetNoB, and BanFakeNews datasets while beating BertGAT by 2% for Sarcasm detection, Hate Speech, and Emotion detection datasets. We also examined different variations in graph structure and analyzed their effects.

pdf bib abs
BaTEClaCor: A Novel Dataset for Bangla Text Error Classification and Correction
Nabilah Oshin | Syed Hoque | Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In the context of the dynamic realm of Bangla communication, online users are often prone to bending the language or making errors due to various factors. We attempt to detect, categorize, and correct those errors by employing several machine learning and deep learning models. To contribute to the preservation and authenticity of the Bangla language, we introduce a meticulously categorized organic dataset encompassing 10,000 authentic Bangla comments from a commonly used social media platform. Through rigorous comparative analysis of distinct models, our study highlights BanglaBERT’s superiority in error-category classification and underscores the effectiveness of BanglaT5 for text correction. BanglaBERT achieves accuracy of 79.1% and 74.1% for binary and multiclass error-category classification while the BanglaBERT is fine-tuned and tested with our proposed dataset. Moreover, BanglaT5 achieves the best Rouge-L score (0.8459) when BanglaT5 is fine-tuned and tested with our corrected ground truths. Beyond algorithmic exploration, this endeavor represents a significant stride in enhancing the quality of digital discourse in the Bangla-speaking community, fostering linguistic precision and coherence in online interactions. The dataset and code is available at https://github.com/SyedT1/BaTEClaCor.

pdf bib abs
Aambela at BLP-2023 Task 1: Focus on UNK tokens: Analyzing Violence Inciting Bangla Text with Adding Dataset Specific New Word Tokens
Md Fahim
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

The BLP-2023 Task 1 aims to develop a Natural Language Inference system tailored for detecting and analyzing threats from Bangla YouTube comments. Bangla language models like BanglaBERT have demonstrated remarkable performance in various Bangla natural language processing tasks across different domains. We utilized BanglaBERT for the violence detection task, employing three different classification heads. As BanglaBERT’s vocabulary lacks certain crucial words, our model incorporates some of them as new special tokens, based on their frequency in the dataset, and their embeddings are learned during training. The model achieved the 2nd position on the leaderboard, boasting an impressive macro-F1 Score of 76.04% on the official test set. With the addition of new tokens, we achieved a 76.90% macro-F1 score, surpassing the top score (76.044%) on the test set.

pdf bib abs
Aambela at BLP-2023 Task 2: Enhancing BanglaBERT Performance for Bangla Sentiment Analysis Task with In Task Pretraining and Adversarial Weight Perturbation
Md Fahim
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This paper introduces the top-performing approachof “Aambela” for the BLP-2023 Task2: “Sentiment Analysis of Bangla Social MediaPosts”. The objective of the task was tocreate systems capable of automatically detectingsentiment in Bangla text from diverse socialmedia posts. My approach comprised finetuninga Bangla Language Model with threedistinct classification heads. To enhance performance,we employed two robust text classificationtechniques. To arrive at a final prediction,we employed a mode-based ensemble approachof various predictions from different models,which ultimately resulted in the 1st place in thecompetition.

pdf bib
EDAL: Entropy based Dynamic Attention Loss for HateSpeech Classification
Md Fahim | Dr. Amin Ahsan Ali | Md Ashraful Amin | Akm Mahbubur Rahman
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation