RSM-NLP at BLP-2023 Task 2: Bangla Sentiment Analysis using Weighted and Majority Voted Fine-Tuned Transformers

This paper describes our approach to submissions made at Shared Task 2 at BLP Workshop - Sentiment Analysis of Bangla Social Media Posts. Sentiment Analysis is an action research area in the digital age. With the rapid and constant growth of online social media sites and services and the increasing amount of textual data, the application of automatic Sentiment Analysis is on the rise. However, most of the research in this domain is based on the English language. Despite being the world’s sixth most widely spoken language, little work has been done in Bangla. This task aims to promote work on Bangla Sentiment Analysis while identifying the polarity of social media content by determining whether the sentiment expressed in the text is Positive, Negative, or Neutral. Our approach consists of experimenting and finetuning various multilingual and pre-trained BERT-based models on our downstream tasks and using a Majority Voting and Weighted ensemble model that outperforms individual baseline model scores. Our system scored 0.711 for the multiclass classification task and scored 10th place among the participants on the leaderboard for the shared task. Our code is available at https://github.com/ptnv-s/RSM-NLP-BLP-Task2 .


Introduction
In the era of a high influx of social media platforms, blogs, and online reviews, sentiment analysis has become the need of the hour.Also known as opinion mining, sentiment analysis is a computational linguistic task that is aimed at determining whether a text contains a positive, negative, or neutral sentiment behind it (Khan et al., 2020)     Sentiment analysis has diverse uses, including preventing adolescent suicide by detecting cyberbullying and mitigating unjust actions that target specific communities through hate speech detection, among numerous other applications (Islam et al., 2020).Approximately 284.3 million people worldwide speak Bangla as their primary language.Individuals speaking Bangla increasingly engage in social media platforms like Instagram, Facebook, Reddit, and Twitter and express opinions on microblogging platforms, commenting on news portals and online shopping.However, analyzing vast volumes of rapidly generated data in the digital age is a very tedious job to do.This is where sentiment analysis can be applied (Hassan et al., 2016).Most sentiment analysis research predominantly focuses on English, leaving Bangla Sentiment analysis in its nascent stages.Recently, some works have addressed this issue.However, none of these studies have fully embraced the different perspectives of Bangla.
To address this problem, we present our contributions to Shared Task 2 at BLP Workshop -Sentiment Analysis of Bangla Social Media Posts.This task aims to detect the polarity associated with a given social media text.This multiclass classification task involves determining whether the sentiment expressed in the text is Positive, Negative, or Neutral.For this problem statement, we have conducted various experiments using multi-lingual berts (Bhattacharjee et al., 2022;Sanh et al., 2019a;Das et al., 2022;Sarker, 2020) and various pre-trained transformers (Liu et al., 2019a) by finetuning them on downstream tasks.We also apply Majority Voting and Weighted ensembling on the top-k models to show how these methods affect the models' performance and how an ensemble of these models performs better than the individual baselines.

Problem and Data Description
The EMNLP 2023 Bangla Workshop Task 2: Sentiment Analysis of Bangla Social Media Posts (Hasan et al., 2023a;Islam et al., 2021;Hasan et al., 2023b) aims to detect the polarity of the sentiment associated with a given text extracted from social media.From the entire set of labels, over 14,000 were classified as negative, approximately 12,000 as positive, and roughly 6,000 as neutral, as indicated in the distribution chart in Figure 1 and a few samples of the Dataset are shown in Table 1.The dataset includes the MUltiplatform BAngla SEntiment (MUBASE) dataset and the SentNob dataset (Islam et al., 2021).SentNob comprises public comments from social media on news and videos across 13 domains, such as agriculture, politics, and education.It is manually annotated with a moderate agreement score of 0.53.On the other hand, MUBASE is a sizable compilation of multiplatform data, including Facebook posts and tweets, each manually tagged for sentiment polarity.These datasets provide a comprehensive and diverse landscape for studying Bangla sentiment analysis.

Sentiment Analysis
Sentiment analysis is an NLP task that uses computational methods to determine and extract the emotional tone expressed in a piece of text (Hogenboom et al., 2014).There are several different approaches to sentiment analysis.Early sentiment analysis approaches primarily employed rule-based methods and lexicon-based techniques (Obaidat et al., 2015) to determine the sentiment context of texts.One of the significant areas of application of Sentiment Analysis is in Social Media Posts as in (Tang et al., 2014) and (Taboada et al., 2011), a sentiment lexicon with a linguistic rulebased approach was used to create a sentiment detection mechanism from tweets(Reckman et al., .Following this, contemporary advancements have introduced machine learning and deep learning techniques that significantly boost accuracy by extracting intricate patterns from annotated datasets.Due to human language's complexity and sentiment expression nuances, it is a challenging task.The accuracy of the task may be improved by using larger datasets, more complex and finetuned models (Hassan et al., 2016), ensembling, etc. Modern approaches leverage large-scale Pretrained Language Models (PLMs), such as Transformers, BERTs (Devlin et al., 2018), and NLUs (Bender and Koller, 2020), alongside refined finetuning mechanisms (Hasan et al., 2023b).They excel at capturing the intricate associations between words within the text and their corresponding polarity.In today's world, with the introduction of free-to-use models like ChatGPT, sentiment analysis has opened to new possibilities (Wang et al., 2023).

Bangla Language Processing
The Bangla language is the 7th most spoken language, with 265 million speakers worldwide (Sen et al., 2022).However, since English is the predominant language used for technical knowledge, journals, and documentation, many Bangla-speaking people face hurdles in utilizing these resources.Research on Bangla Natural Language Processing (BNLP) began in the early 1990s, focusing on rule-based lexical and morphological analysis (Alam et al., 2021).From the modeling perspective, most earlier endeavors are either rule-based, statistical, or classical machine learning-based approaches (Kudo and Matsumoto, 2001).As for the sequence tagging tasks, such as NER and G2P, the algorithms, including Hidden Markov Models (HMMs) (Brants, 2000), Conditional Random Fields (CRFs) (Lafferty et al., 2001), Maximum Entropy (ME) (Ratnaparkhi, 1996) and Maximum Entropy Markov Models (MEMMs) (McCallum et al., 2000) have been used successfully.It is only very recently that a small number of studies have explored deep learning-based approaches.As depicted in (Alam et al., 2021), there has been significant work in resource and model development in Bangla sentiment analysis.In (Das and Bandyopadhyay, 2010), the authors proposed a computational technique of generating an equivalent SentiWord-Net (Bangla) from publicly available English sentiment lexicons and an English-Bangla bilingual dictionary with few easily adaptable noise reduction techniques.However, with the Introduction of BERTs many works focused on fine-tuning multilingual BERTs (Ashrafi et al., 2020;Das et al., 2021), but BanglaBERT (Sarker, 2020) being the first model pre-trained on Bangla text corpus.

Bangla Sentiment Analysis
Sentiment analysis is a tool to extract the emotional tone of the text.It is used for cyberbullying detection, hate speech mitigation and market research.Bangla is the 7th most spoken language, and sentiment analysis for Bangla is still in its early stages.The first attempt to perform sentiment analysis in the context of Indian Languages, including Bangla, was done as recently as in 2015 (Patra et al., 2015).The lack of accurately annotated data is one of the biggest bottlenecks to advancing Bangla Sentiment Analysis.(Islam et al., 2021) and (Rahman et al., 2018) describe the creation of datasets for this purpose.A word2vec model was tuned with word co-occurrence scores for sentiment analysis in (Al-Amin et al., 2017), achieving an accuracy of 75.5%.In (Wahid et al., 2019), aspect-based sentiment analysis data was examined, boasting a remarkable 95% accuracy.However, challenges were encountered when rephrasing common and proper nouns in Bangla.Among most studies, however, transformer models have consistently outperformed other algorithms and models, inciting a significant amount of research into the area.In (Chowdhury et al., 2019), Opinion Mining was conducted on a dataset of 4,000 manually translated Bangla movie reviews, with the objective of classifying them as positive or negative.The LSTM approach had achieved an accuracy of 82.42%.A Bi-LSTM architecture was applied by (Sharfuddin et al., 2018) to a labeled dataset of 10,000 Facebook comments in Bangla, resulting in an accuracy of 85.67%.However, the study faced significant data preprocessing difficulties.In (Tripto and Ali, 2018), a combination of CNN and LSTM was employed to extract six distinct emotions from various types of Bangla YouTube video comments.The reported accuracies were 65.97% and 54.24% for three and five-label sentiment classification, respectively.A common issue faced by authors while using CNNs was that proper tuning between layers could not be achieved.In another study (Hossain et al., 2020), 1000 online restaurant reviews were collected from the Foodpanda website for performing SA and deployed, thus combining CNN with LSTM architecture with a 300 dimensional Word2Vec pretrained model having validation accuracy of 75.01%.(Rezaul Karim et al., 2020) developed a novel word embedding system for Bangla texts, BanglaFastText, incorporating it into a Multichannel Convolutional LSTM (MConv-LSTM).In (Islam et al., 2020) authors performed SA on 1002 public comments from newspapers with the help of the BERT pretrained model and achieved accuracy on GRU at 71% on 2 class sentiments.In (Hasan et al., 2020a), the performance of multiple classical machine learning algorithms and deep learning models were compared on several sentimentlabeled datasets, showing that pre-trained transformer models such as BERT and XLM-RoBERTa yielded the highest scores.

System Overview
We conducted extensive experiments for the given task involving Bangla Sentiment analysis.We finetuned various multilingual and pre-trained transformer architectures, including BERT (Kenton and Toutanova, 2019), DistillBERT (Sanh et al., 2019b), RoBERTa (Liu et al., 2019b), and Various Pre-Trained BERT models (Das et al., 2022;Sarker, 2020) on our downstream task of polarity classification.We shortlist the top-k model based on the performance metrics and ensemble the predictions using Majority Voted and Weighted Ensemble.

Fine-Tuning Transformers
We used multiple transformer architectures to observe the effect of the model architecture and the pre-trained dataset on the downstream task.For multiclass classification, we added a linear layer acting as a classification head to fine-tune the models for the multiclass classification.
We have used various models for our experiments, including BERT (Kenton and Toutanova,  (Khanuja et al., 2021) and multilingual BERT models, trained to detect abusive speech using multiple datasets in 8 Indian languages.

Model
Bengali-abusive-MuRIL (Das et al., 2022) is also finetuned from MuRIL (Khanuja et al., 2021), trained specifically on the Bangla abusive speech dataset.These have been referred to as HF-PT-BERT-1 and HF-PT-BERT-3 in Table 1, respectively.BanglaBERT (Bhattacharjee et al., 2022)is a fine-tuned ELECTRA (Clark et al., 2020) model which is trained on Bangla Wikipedia dump dataset as well as data from 110 Bangla websites.Ban-glishBERT (Bhattacharjee et al., 2022)is similar to BanglaBERT; instead, it was trained on both English and Bangla data to allow zero-shot crosslingual transfer.

Ensembling Predictions
To increase the overall performance of the predictions and robustness of the predictive model, models were first individually tuned on the downstream task dataset.The predictions from these models were combined using the two ensembling methods on top-3,top-5, and all model predictions: Majority Voting: The most frequently occurring prediction from all the models for each training instance was chosen as the final label.
Weighted: Each model was assigned a weight based on its accuracy score on the training dataset.Each model voted on the prediction class with its weight, and the prediction with the highest final vote was chosen as the final label.
Here, y i denotes the Weighted ensemble prediction of the ith sample, p ij the ith probabilistic prediction for each polarity made by the jth model, a j the accuracy of the jth model on the training set and k is the number of models being considered for the ensemble.

Experiments & Results
The dataset used for the task is organized in 3 columns, with id, text, and label.It has also been partitioned into a train set with 35266 samples, a dev set with 3935 samples, and a dev-test set with 3427 samples.The distribution in the training set is shown in Figure 1.
The preprocessing pipeline before model training included padding, tokenizing, and truncating text data to ensure uniformity and manage lengthy inputs.We used the AdamW optimizer, a learning rate of 2x10 −5 and a batch size of 32 over 32 epochs was chosen to strike a balance between convergence speed and stability with a maximum sequence length of 512 tokens used with Huggingface AutoTokenizer to tokenize the data.
We evaluated models using four metrics: accuracy, precision, recall, and F1-score.F1-score is We did an ensemble of both types (Majority-Voted and Weighted) with the top 3 ( BanglaBERT (Sarker, 2020), BanglishBERT, HF-PT-BERT-1 (Das et al., 2022) ), top 5 (HF-PT-BERT-2, Ban-glishBERT, HF-PT-BERT-1 (Das et al., 2022) , BanglaBERT(Base), HF-PT-BERT-3 (Das et al., 2022) ) and lastly using all the models.As in Table 3 for ensembles, we observe that the majority ensemble shows a better performance in general as compared to the weighted models.The majority voted ensemble using predictions from all the models had the highest F1 score of 0.711.Furthermore, an ensemble of 3 models yielded almost optimal results.The use of more than three models resulted in a marginal increase in performance but significantly increased resource utilization.Thus, the use of more than three models seems unproductive.

Conclusion
In this work, we benchmarked various multilingual and pre-trained BERT-based models -RoBERTa (Liu et al., 2019a), DistillBERT (Sanh et al., 2019a), BanglaBERT (Bhattacharjee et al., 2022), BanglishBERT (Hasan et al., 2020b) and Various Pre-Trained BERT models (Das et al., 2022;Sarker, 2020) for Bangla Sentiment Analysis (Hasan et al., 2023a;Islam et al., 2021;Hasan et al., 2023b) while identifying the polarity of social media content by determining whether the sentiment expressed in the text is Positive, Negative, or Neutral as our downstream tasks and using a Ma-jority Voting and Weighted ensemble model that outperforms individual baseline model scores.
Our system achieved a micro F1-Score of 0.711 for the multiclass classification task and scored 10th among the participants on the leaderboard for the shared task.

Figure 1 :
Figure 1: Frequency of Task 2 labels in training set

Table 2 :
Results of Base-Models on Test-Set of Shared-Task Dataset where Acc. is Accuracy, Pre. is Precision,

Table 3 :
Results of ensemble models on Test-Set of Shared-Task Dataset where Method is the method of ensembling, Top refers to top-k models chosen, Acc. is Accuracy, Pre. is Precision, Rec. is Recall & F1 refers to F1-Score a good metric for imbalanced datasets because it takes into account both precision and recall.The results of our experiments over the official Test set are shown in Table 2 & 3.For Individual Models as shown in Table 2 we observe DistilBERT and BanglaBERT(Base) show the best performance on the test data, with an F1-Score of 0.701.