Md Farhan Ishmam


2025

pdf bib
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Sadia Alam | Md Farhan Ishmam | Navid Hasin Alvee | Md Shahnewaz Siddique | Md Azam Hossain | Abu Raihan Mostofa Kamal
Proceedings of the First Workshop on Language Models for Low-Resource Languages

The widespread availability of code-mixed data in digital spaces can provide valuable insights into low-resource languages like Bengali, which have limited annotated corpora. Sentiment analysis, a pivotal text classification task, has been explored across multiple languages, yet code-mixed Bengali remains underrepresented with no large-scale, diverse benchmark. Code-mixed text is particularly challenging as it requires the understanding of multiple languages and their interaction in the same text. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali comprising 20,000 samples with 4 sentiment labels, sourced from Facebook, YouTube, and e-commerce sites. By aggregating multiple sources, we ensure linguistic diversity reflecting realistic code-mixed scenarios. We implement a novel automated text filtering pipeline using fine-tuned language models to detect code-mixed samples and expand code-mixed text corpora. We further propose baselines using machine learning, neural networks, and transformer-based language models. The availability of a diverse dataset is a critical step towards democratizing NLP and ultimately contributing to a better understanding of code-mixed languages.

2024

pdf bib
BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla
Md Fahim | Fariha Tanjim Shifat | Fabiha Haider | Deeparghya Dutta Barua | MD Sakib Ul Rahman Sourove | Md Farhan Ishmam | Md Farhad Alam Bhuiyan
Findings of the Association for Computational Linguistics: EMNLP 2024

Low-resource languages like Bangla are severely limited by the lack of datasets. Romanized Bangla texts are ubiquitous on the internet, offering a rich source of data for Bangla NLP tasks and extending the available data sources. However, due to the informal nature of romanized text, they often lack the structure and consistency needed to provide insights. We address these challenges by proposing: (1) BanglaTLit, the large-scale Bangla transliteration dataset consisting of 42.7k samples, (2) BanglaTLit-PT, a pre-training corpus on romanized Bangla with 245.7k samples, (3) encoders further-pretrained on BanglaTLit-PT achieving state-of-the-art performance in several romanized Bangla classification tasks, and (4) multiple back-transliteration baseline methods, including a novel encoder-decoder architecture using further pre-trained encoders. Our results show the potential of automated Bangla back-transliteration in utilizing the untapped sources of romanized Bangla to enrich this language. The code and datasets are publicly available: https://github.com/farhanishmam/BanglaTLit.