Shrestha Datta
2026
BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali
Jakir Hasan | Shrestha Datta | Md Saiful Islam | Shubhashis Roy Dipta | Ameya Debnath
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Jakir Hasan | Shrestha Datta | Md Saiful Islam | Shubhashis Roy Dipta | Ameya Debnath
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Despite its widespread use, Bengali lacks a robust automated International Phonetic Alphabet (IPA) transcription system that effectively supports both standard language and regional dialectal texts. Existing approaches struggle to handle regional variations, numerical expressions, and generalize poorly to previously unseen words. To address these limitations, we propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment. The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects. BanglaIPA improves inference efficiency by leveraging a precomputed word-to-IPA mapping dictionary for previously observed words. The system is evaluated on the standard Bengali and six regional variations of the DUAL-IPA dataset. Experimental results show that BanglaIPA outperforms baseline IPA transcription models by 58.4-78.7% and achieves an overall mean word error rate of 11.4%, highlighting its robustness in phonetic transcription generation for the Bengali language.
2025
From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs
Hrithik Majumdar Shibu | Shrestha Datta | Md. Sumon Miah | Nasrullah Sami | Mahruba Sharmin Chowdhury | Md Saiful Islam
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Hrithik Majumdar Shibu | Shrestha Datta | Md. Sumon Miah | Nasrullah Sami | Mahruba Sharmin Chowdhury | Md Saiful Islam
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87%) and Large Language Models with Quantized Low-Rank Approximation (F1-89%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on GitHub to foster research in this direction.
2023
SUST_Black Box at BLP-2023 Task 1: Detecting Communal Violence in Texts: An Exploration of MLM and Weighted Ensemble Techniques
Hrithik Shibu | Shrestha Datta | Zhalok Rahman | Shahrab Sami | Md. Sumon Miah | Raisa Fairooz | Md Mollah
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Hrithik Shibu | Shrestha Datta | Zhalok Rahman | Shahrab Sami | Md. Sumon Miah | Raisa Fairooz | Md Mollah
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
In this study, we address the shared task of classifying violence-inciting texts from YouTube comments related to violent incidents in the Bengal region. We seamlessly integrated domain adaptation techniques by meticulously fine-tuning pre-existing Masked Language Models on a diverse array of informal texts. We employed a multifaceted approach, leveraging Transfer Learning, Stacking, and Ensemble techniques to enhance our model’s performance. Our integrated system, amalgamating the refined BanglaBERT model through MLM and our Weighted Ensemble approach, showcased superior efficacy, achieving macro F1 scores of 71% and 72%, respectively, while the MLM approach secured the 18th position among participants. This underscores the robustness and precision of our proposed paradigm in the nuanced detection and categorization of violent narratives within digital realms.