LowResource at BLP-2023 Task 2: Leveraging BanglaBert for Low Resource Sentiment Analysis of Bangla Language

This paper describes the system of the LowResource Team for Task 2 of BLP-2023, which involves conducting sentiment analysis on a dataset composed of public posts and comments from diverse social media platforms. Our primary aim was to utilize BanglaBert, a BERT model pre-trained on a large Bangla corpus, using various strategies including fine-tuning, dropping random tokens, and using several external datasets. Our final model is an ensemble of the three best BanglaBert variations. Our system achieved overall 3rd in the Test Set among 30 participating teams with a score of 0.718. Additionally, we discuss the promising systems that didn’t perform well namely task-adaptive pertaining and paraphrasing using BanglaT5. Our training codes are publicly available at https://github.com/Aunabil4602/bnlp-workshop-task2-2023


Introduction
In the field of Natural Language Processing, Sentiment Analysis has earned significant attention as a research area dedicated to the analysis of textual content.A considerable body of research on Sentiment Analysis in Bangla has been conducted.Some of these works (e.g.Islam et al. (2021), Kabir et al. (2023)) are based on introducing new datasets.In parallel, other works(e.g.Amin et al. (2019), Al-Amin et al. (2017)) are done on novel approaches.In spite of these numerous works, different opportunities still exist to improve the Analysis of Sentiments.
In this paper, we describe our system for task 2 of the Bangla Language Processing Workshop @EMNLP-2023 (Hasan et al., 2023a).We employ various systems based on BanglaBert and BanglaBert-Large (Bhattacharjee et al., 2022).Our experimental systems include fine-tuning, increasing the generalization based on dropping random Additionally, we describe alternate potential methods that have not scored well in the result section 6.To illustrate, we explore Task Adaptive Pre-Training (Gururangan et al., 2020), in fact, has been used by this year's winner of SemEval Task 12 (Muhammad et al., 2023) on sentiment analysis of African Language, and generating paraphrases using BanglaT5 (Bhattacharjee et al., 2023).Moreover, we notice a significant drop in our score in the final test set of our best model.We describe this as our limitations in the section 7.

Related Works
Many of the related works are primarily focused on novel datasets covering diverse domains.Islam et al. (2022) have developed a dataset comprised of various public comments from social media platforms.Rahman and Dey (2018) have created their datasets based on Cricket and Restaurant reviews.Most recently, (Kabir et al., 2023)  In recent years, Large Language Models(LLM), trained on huge corpus, have become popular for their capability to understand the language and can easily fine-tuned for any task like Sentiment Analysis.LLMs based on the Bangla language(e.g.BanglaBert (Bhattacharjee et al., 2022), shaha-jBert (Diskin et al., 2021), BanglaT5 (Bhattacharjee et al., 2023)) are also available, which opens opportunities to work on various tasks for Bangla.

Task Description
This is a multi-class classification task where the objective is to detect the sentiment of the given text into 3 different classes: Positive, Negative, and Neutral.The score will be calculated using the micro-f1.The task consists of two phases: a development phase followed by a test phase.The final standing is based on the score of the test set provided during the test phase.

Dataset Description
The dataset is comprised of MUBASE (Hasan et al., 2023b) and SentiNob (Islam et al., 2021) datasets.The SentiNob dataset consists of various public comments collected from social media platforms.It covers 13 different domains, for example, politics, education, agriculture, etc.On the other hand, the MUBASE dataset consists of posts collected from Twitter and Facebook.The sample sizes of different sets given for training, validation, and testing are shown in Table 2.

System Description
Here, we discuss several systems that we have experimented with for the task including the preprocessing of the dataset.(Clark et al., 2020), which was originally used to pre-train these models.We don't perform DAPT since the models already cover the domains.

2-Stage Fine-Tuning of LLMs
In the first stage, we fine-tune BanglaBert using the external data only.Here, we don't include any given data from the task.In the next stage, we do regular fine-tuning on the train set.We use the term "2FT" as a short form of this approach.The list of the external datasets and sample sizes are shown in table 10.

Data augmentation
We experiment with 2 data augmentation techniques to improve the generalization.First, instead of dropping random words (Bayer et al., 2022), we drop random tokens(RTD) since dropping words might change the meaning.We apply RTD on the fly during the training.Second, we employ para-phrasing as data augmentations using BanglaT5 (Bhattacharjee et al., 2023).

Preprocessing of Data
We remove the duplicates found in the training set and development set.We replace any url and username with URL and USER tag respectively similar to Nguyen et al. (2020).While using BanglaBert we normalize the sentence by their specific normalizer1 as required by their model.All of the sentences are tokenized by the individual tokenizer required by each model.We set the max length of tokenization to 128 for each text.
We use several external data.However, most of the labels don't match the labels of this task.For the initial fine-tuning of the LLMs, we first map different labels to the three labels for this task.The label mapping is shown in table 11.For TAPT, we didn't need any of these labels since we do masked language modeling.Finally, we also remove the duplicates found in the external datasets.

Experimental Setup
We have used Models and Trainer from Huggingface2 (PyTorch version).We employ mixed precision training (Micikevicius et al., 2017) that enables faster training and consumes low GPU memory.Moreover, we built a code such that the results are reproducible.All of the experiments are done using a single V100 GPU in Google Colaboratory3 .We do hyper-parameters search on learning rate, batch size, dropout ratios, and total epochs.We start the search with the parameter settings as suggested Gururangan et al. (2020)

Results
To begin, we discuss the systems that have scored well on the Development-Test's score.The top individual model is BanglaBert-Large with a random token drop that has scored 0.733, and even without any enhancement, it can score 0.723.The next best single model is BanglaBert with random token drop(RTD) and 2-stage fine-tuning that has scored 0.729.Table 3 shows the scores of our selected models in the Development-Test Set.Here, we see that both usages of external datasets and RTD have benefited the BanglaBert and BanglaBert-Large.We have built an ensemble of 3 best individual models(model ID 3, 5, and 6) that has scored 0.734, where we decide the class based on majority voting, and in case of a tie, we use the class predicted by the best model.We chose only the 3 best models for the ensemble because the other model's score was low and taking an odd number of models helps to decide the output class in case of a tie.
We have submitted the ensembled model as our best model in the test phase and has scored 0.718.Moreover, We have submitted the 3 individual best models.Our scores on the Test Set are shown in table 4. Here, we have found some inconsistency: BanglaBert-Large with random token drop, which we have considered the best model based on the Development-Test set, performed worst among the other 2 models, and BanglaBert with random token drop and pre-fine-tuned with external data, our 2nd best model, has performed the best.More importantly, every variant of BanglaBert-Large has scored low on the Test set.We discuss some analysis more in section 7. Finally, table 6 shows the confusion matrix of our ensembled model on Test set.We see that our model performed worst on detecting the Neutral class, i.e.only 412 out of 1277 samples have been correct having an accuracy of 32%, where the accuracy of Positive and Neutral classes are 78% and 83% respectively.
There are some systems that didn't achieve favorable performance from the beginning of our experiments.Firstly, TAPT didn't improve our results but rather declined the score by 0.039 with respect to simple fine-tuning as shown in table 5. What we can infer is that TAPT is supposed to help adapt the BanglaBert to the task domain, but it overfitted on the training samples, where the original model is already in a good optima that covered the task domain better.
Paraphrasing to create additional data using BanglaT5 also didn't work well.Its score is shown in table 5.The most perceptible reason is that paraphrased sentences, although good, were not diverse enough from the original sentences.Examples of generated paraphrases are shown in figure 1.
Other than BanglaBert, we try the XLM-Roberta-Large, a multi-lingual model, which is used by several task winners (e.g.(Wang et al., 2022)).
However, it has scored low on the Development-Test set even with all enhancements.Its score is also shown in Table 3.  BanglaBert-Large on the Test set.As anticipated, models show varying performance when initialized with different seeds.Table 7 shows the results of this experiment.Moreover, we have found that the average score of the BanglaBert is better than the BanglaBert-Large.In fact, this result is consistent with the result found by the authors of BanglaBert that BanglaBert-Large performs lower than BanglaBert on Sentiment Analysis on Senti-Nob dataset4 .BangalThus, before considering a model, the average score from different seeds needs to be evaluated when the training data is small.

ID
TAPT is a popular method for pre-training, but it has been ineffective for our task.However, we have inferred this based on a few experiments.Thus, we suggest that more research needs to be done on the effectiveness of TAPT, as well as DAPT, on BanglaBert.
Our research has been mostly based on finetuning.As future work, we would like to explore using common data augmentation techniques (Bayer et al., 2022) for the given data.Besides, there are several multilingual Pre-trained Models that include the Bangla Language are need to be explored along with sophisticated methods and may even achieve better results.

Conclusion
In this paper, we stated our systems based on BanglaBert and BanglaBert-Large for that Sentiment Analysis task.We used simple techniques like, 2-stage fine-tuning, using external datasets, and dropping random tokens.Our system scored 3rd overall in the task.We also discussed some potential systems that didn't demonstrate satisfactory performance.More importantly, we have discussed the score inconsistency of our best model between Development-Test Set and Test Set as our limitation.Finally, we discussed directing some future research like applying TAPT and DAPT on BanglaBert and trying more data augmentations or sophisticated methods.

Table 1 :
Showing top 5 of the final standings of the BLP-2023 Task 2. Our team stands 3rd among 30 participants.
tokens, using open-source external data during pretraining, and other methods described in section 4. Utilization of random token drop and external datasets has benefited our systems by improving micro-f1 scores around 0.006 to 0.01.Our best model, an ensembled model from three top models based on the Development Test-Set score, has scored a micro-f1 score of 0.7179, standing overall 3rd among 30 participants.Table1shows the final standings of the task.

Table 2 :
Sample sizes of various sets provided in the Task 2.
(Muhammad et al., 2023)tuning.Namely, we use BanglaBert and BanglaBert-Large.Besides, we also use XLM-Roberta-Large(Conneau et al., 2020), a multi-lingual model.We don't explore much on multi-lingual models, since we have found that monolingual models are more used than multi-lingual models on monolingual-specific tasks(Muhammad et al., 2023)due to high scores.4.2 Task Adaptive Pre-training of LLMsGururangan et al. (2020) suggest that Domain Adapting Pre-Training(DAPT) And Task Adaptive Pre-Training(TAPT) improve the scores of the corresponding task.Here, we do TAPT on BanglaBert and BanglaBert-Large using the Electra pre-training method

Table 5 :
Performance of TAPT and Paraphrasing on BanglaBert-Large in comparison with fine-tuning on Development Set.

Table 6 :
Confusion Matrix of the Ensembled model on Test Set.