Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method

Masoumeh Mohammadi; Mohammad Ruhul Amin; Shadi Tavakoli

Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method

Masoumeh Mohammadi, Mohammad Ruhul Amin, Shadi Tavakoli

Abstract

This paper presents a novel Sentiment Analysis (SA) dataset in the low-resource Persian language including a data augmentation technique using Generative Adversarial Networks (GANs) to generate synthetic data, boosting the volume and variety of data, for achieving state-of-the-art performance. We propose a novel annotated SA dataset, called Senti-Persian, made of 67,743 public comments on movie reviews from Iranian websites (Namava, Filimo and Aparat) and social media (YouTube, Twitter and Instagram). These reviews are labeled with one of the polarity labels, namely positive, negative, and neutral. Our study includes a novel text augmentation model based on GANs. The generator was designed following the linguistic properties of Persian linguistics, while the discriminator was designed based on the cosine similarity of the vectorized original and generated sentences, i.e. using CLS-embedings of BERT. A SA task applied on both collected and augmented datasets for which we observed a significant improvement in the accuracy from 88.4% for the original dataset to the 96% when augmented with synthetic data. Senti-Parsian dataset including the original and the augmented ones will be available on github.

Anthology ID:: 2025.abjadnlp-1.7
Volume:: Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editor:: Mo El-Haj
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 54–63
Language:
URL:: https://aclanthology.org/2025.abjadnlp-1.7/
DOI:
Bibkey:
Cite (ACL):: Masoumeh Mohammadi, Mohammad Ruhul Amin, and Shadi Tavakoli. 2025. Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method. In Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, pages 54–63, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method (Mohammadi et al., AbjadNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.abjadnlp-1.7.pdf

PDF Cite Search Fix data