Generative Data Augmentation for Improving Semantic Classification

Shadman Rohan; Mahmud Elahi Akhter; Ibraheem Muhammad Moosa; Nabeel Mohammed; Amin Ahsan Ali; Akmmahbubur Rahman

Generative Data Augmentation for Improving Semantic Classification

Shadman Rohan, Mahmud Elahi Akhter, Ibraheem Muhammad Moosa, Nabeel Mohammed, Amin Ahsan Ali, Akmmahbubur Rahman

Abstract

We study sentence-level generative data augmentation for Bangla semantic classification across four public datasets and three pretrained model families (BanglaBERT, XLM-Indic, mBERT). We evaluate two widely used, reproducible techniques—paraphrasing (mT5-based) and round-trip backtranslation (Bn–En–Bn)—and analyze their impact under realistic class imbalance. Overall, augmentation often helps, but gains are tightly coupled to label quality: paraphrasing typically outperforms backtranslation and yields the most consistent improvements for the monolingual model, whereas multilingual encoders benefit less and can be more sensitive to noisy minority-class expansions. A key empirical observation is that the neutral class appears to be a major source of annotation noise, which degrades decision boundaries and can cap the benefits of augmentation even when positive/negative classes are clean and polarized. We provide practical guidance for Bangla sentiment pipelines: (i) use simple sentence-level augmentation to rebalance classes when labels are reliable; (ii) allocate additional curation and higher inter-annotator agreement targets to the neutral class. Our results indicate when augmentation helps and suggest that data quality—not model choice alone—can become the limiting factor.

Anthology ID:: 2025.banglalp-1.28
Volume:: Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:: BanglaLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 347–356
Language:
URL:: https://aclanthology.org/2025.banglalp-1.28/
DOI:
Bibkey:
Cite (ACL):: Shadman Rohan, Mahmud Elahi Akhter, Ibraheem Muhammad Moosa, Nabeel Mohammed, Amin Ahsan Ali, and Akmmahbubur Rahman. 2025. Generative Data Augmentation for Improving Semantic Classification. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 347–356, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Generative Data Augmentation for Improving Semantic Classification (Rohan et al., BanglaLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.banglalp-1.28.pdf

PDF Cite Search Fix data