Shadi Tavakoli
2025
Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method
Masoumeh Mohammadi
|
Mohammad Ruhul Amin
|
Shadi Tavakoli
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
This paper presents a novel Sentiment Analysis (SA) dataset in the low-resource Persian language including a data augmentation technique using Generative Adversarial Networks (GANs) to generate synthetic data, boosting the volume and variety of data, for achieving state-of-the-art performance. We propose a novel annotated SA dataset, called Senti-Persian, made of 67,743 public comments on movie reviews from Iranian websites (Namava, Filimo and Aparat) and social media (YouTube, Twitter and Instagram). These reviews are labeled with one of the polarity labels, namely positive, negative, and neutral. Our study includes a novel text augmentation model based on GANs. The generator was designed following the linguistic properties of Persian linguistics, while the discriminator was designed based on the cosine similarity of the vectorized original and generated sentences, i.e. using CLS-embedings of BERT. A SA task applied on both collected and augmented datasets for which we observed a significant improvement in the accuracy from 88.4% for the original dataset to the 96% when augmented with synthetic data. Senti-Parsian dataset including the original and the augmented ones will be available on github.