BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam; Md Farhan Ishmam; Navid Hasin Alvee; Md Shahnewaz Siddique; MD Azam Hossain; Abu Raihan Mostofa Kamal

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

Abstract

The widespread availability of code-mixed data in digital spaces can provide valuable insights into low-resource languages like Bengali, which have limited annotated corpora. Sentiment analysis, a pivotal text classification task, has been explored across multiple languages, yet code-mixed Bengali remains underrepresented with no large-scale, diverse benchmark. Code-mixed text is particularly challenging as it requires the understanding of multiple languages and their interaction in the same text. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali comprising 20,000 samples with 4 sentiment labels, sourced from Facebook, YouTube, and e-commerce sites. By aggregating multiple sources, we ensure linguistic diversity reflecting realistic code-mixed scenarios. We implement a novel automated text filtering pipeline using fine-tuned language models to detect code-mixed samples and expand code-mixed text corpora. We further propose baselines using machine learning, neural networks, and transformer-based language models. The availability of a diverse dataset is a critical step towards democratizing NLP and ultimately contributing to a better understanding of code-mixed languages.

Anthology ID:: 2025.loreslm-1.4
Volume:: Proceedings of the First Workshop on Language Models for Low-Resource Languages
Month:: January
Year:: 2025
Address:: Abu Dhabi, United Arab Emirates
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venues:: LoResLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 68–77
Language:
URL:: https://aclanthology.org/2025.loreslm-1.4/
DOI:
Bibkey:
Cite (ACL):: Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, and Abu Raihan Mostofa Kamal. 2025. BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 68–77, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis (Alam et al., LoResLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.loreslm-1.4.pdf
Optionalsupplementarymaterial:: 2025.loreslm-1.4.OptionalSupplementaryMaterial.zip

PDF Cite Search Optionalsupplementarymaterial Fix data