Cross-lingual sentiment classification in low-resource Bengali language

Salim Sazzed


Abstract
Sentiment analysis research in low-resource languages such as Bengali is still unexplored due to the scarcity of annotated data and the lack of text processing tools. Therefore, in this work, we focus on generating resources and showing the applicability of the cross-lingual sentiment analysis approach in Bengali. For benchmarking, we created and annotated a comprehensive corpus of around 12000 Bengali reviews. To address the lack of standard text-processing tools in Bengali, we leverage resources from English utilizing machine translation. We determine the performance of supervised machine learning (ML) classifiers in machine-translated English corpus and compare it with the original Bengali corpus. Besides, we examine sentiment preservation in the machine-translated corpus utilizing Cohen’s Kappa and Gwet’s AC1. To circumvent the laborious data labeling process, we explore lexicon-based methods and study the applicability of utilizing cross-domain labeled data from the resource-rich language. We find that supervised ML classifiers show comparable performances in Bengali and machine-translated English corpus. By utilizing labeled data, they achieve 15%-20% higher F1 scores compared to both lexicon-based and transfer learning-based methods. Besides, we observe that machine translation does not alter the sentiment polarity of the review for most of the cases. Our experimental results demonstrate that the machine translation based cross-lingual approach can be an effective way for sentiment classification in Bengali.
Anthology ID:
2020.wnut-1.8
Volume:
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:
November
Year:
2020
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
50–60
Language:
URL:
https://aclanthology.org/2020.wnut-1.8
DOI:
10.18653/v1/2020.wnut-1.8
Bibkey:
Cite (ACL):
Salim Sazzed. 2020. Cross-lingual sentiment classification in low-resource Bengali language. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 50–60, Online. Association for Computational Linguistics.
Cite (Informal):
Cross-lingual sentiment classification in low-resource Bengali language (Sazzed, WNUT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wnut-1.8.pdf
Code
 sazzadcsedu/bn-dataset
Data
IMDb Movie Reviews