BLP-2023 Task 2: Sentiment Analysis

We present an overview of the BLP Sentiment Shared Task, organized as part of the inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is defined as the detection of sentiment in a given piece of social media text. This task attracted interest from 71 participants, among whom 29 and 30 teams submitted systems during the development and evaluation phases, respectively. In total, participants submitted 597 runs. However, only 15 teams submitted system description papers. The range of approaches in the submitted systems spans from classical machine learning models, fine-tuning pre-trained models, to leveraging Large Language Model (LLMs) in zero- and few-shot settings. In this paper, we provide a detailed account of the task setup, including dataset development and evaluation setup. Additionally, we provide a succinct overview of the systems submitted by the participants. All datasets and evaluation scripts from the shared task have been made publicly available for the research community, to foster further research in this domain.


Introduction
Sentiment analysis has emerged as a significant sub-field in Natural Language Processing (NLP), with a wide array of applications encompassing social media monitoring, brand reputation management, market research, customer feedback analysis, among others.The advancement of sentiment analysis systems has been driven by substantial research efforts, addressing its indispensable utility across diverse fields such as business, finance, politics, education, and services (Cui et al., 2023).Traditionally, analysis has been conducted across various types of content and domains including news articles, blog posts, customer reviews, and social media posts, and extended over different modali-ties like textual and multimodal analyses (Hussein, 2018;Dashtipour et al., 2016).
At its core, the task of sentiment analysis is defined as the extraction and identification of polarities (e.g., positive, neutral, and negative) expressed within texts.However, its scope has broadened to encompass the identification of: (i) the target (i.e., an entity) or aspect of the entity on which sentiment is expressed, (ii) the opinion holder, and (iii) the time at which it is expressed (Liu, 2020).Such advancements have primarily been made for high-resource languages.
Research on fundamental sentiment analysis remains an ongoing exploration, especially for many low-resource languages, primarily due to the scarcity of datasets and consolidated community effort.Although there has been a recent surge in interest (Batanović et al., 2016;Nabil et al., 2015;Muhammad et al., 2023), the field continues to pose significant challenges.Similar to other lowresource languages, the challenges for sentiment analysis in Bangla have been reported in recent studies (Alam et al., 2021a;Islam et al., 2021Islam et al., , 2023)).Alam et al. (2021a) emphasized the primary challenges associated with Bangla sentiment analysis, specifically issues of duplicate instances in the data, inadequate reporting of annotation agreement, and generalization.These challenges were also highlighted in (Islam et al., 2021), further emphasizing the need to address them for effective sentiment analysis in Bangla.
To advance research in Bangla sentiment analysis, we emphasized community engagement and organized a shared task at BLP 2023.Similar efforts have primarily been conducted for other languages as part of the SemEval Workshop.The analysis of sentiment in tweets serves as an example of such efforts, particularly focusing on Arabic and English (Rosenthal et al., 2017).An earlier attempt at such an endeavor for Bangla is reported in (Patra et al., 2015), which mainly focused on arXiv:2310.16183v2[cs.CL] 22 Feb 2024 tweets.Our initiative significantly different from theirs in terms of datasets (e.g., data from multiple social media platforms and diverse domains) and evaluation setup.
A total of 71 teams registered for the task, out of which 30 made an official submission on the test set, and 15 of the participating teams submitted a system description paper.
The remainder of the paper is structured as follows: Section 2 provides an overview of the relevant literature.Section 3 discusses the task and dataset.Section 4 describes the organization of the task and the evaluation measures.An overview of the participating systems is provided in Section 5. Lastly, Section 6 concludes the paper.

Related Work
The current state-of-the-art research for Bangla sentiment classification mainly dominated focuses on two key aspects: the development or datasets and model development.Notable recent work in this direction include (Chowdhury and Chowdhury, 2014;Alam et al., 2021a;Islam et al., 2021;Kabir et al., 2023;Islam et al., 2023).Kabir et al. (2023) curated the largest dataset from book reviews, with annotations based on the review ratings.Although the dataset encompasses a large number of reviews, the class distribution poses a challenge for the Negative and Neutral classes.A well-balanced dataset has been explored in (Islam et al., 2021), comprising ∼15K manually annotated comments spanning 13 different domains.This dataset is also used as a part of this shared task.
Given the significant capabilities that Large Language Models (LLMs) have demonstrated across diverse applications and scenarios, Hasan et al. (2023) explored various LLMs such as Flan-T5 (large and XL) (Chung et al., 2022), Bloomz (1.7B, 3B, 7.1B, 176B-8bit) (Muennighoff et al., 2022), and GPT-4 (OpenAI, 2023), comparing the results with fine-tuned models.The resulting performance demonstrate that fine-tuned models continue to outperform zero-and few-shot prompting.However, the performance of LLMs showcases a promising direction towards the development of systems with limited datasets for new domains.
Though there is a surge of research interest and progress, utilizing such systems in real applications remains a challenge in terms of performance and generalization capability.This shared task aimed to advance research through community effort and focus on a standard evaluation setup.As a starting point, we aimed to classify sentiment into three sentiment polarities: positive, neutral, and negative.This approach can be further extended in future studies.
3 Task and Dataset

Task
The task is defined as "detect the sentiment associated within a given text".This is a multi-class classification task that involves determining whether the sentiment expressed in the text is Positive, Negative, and Neutral.

Dataset
We utilized the MUBASE (Hasan et al., 2023) and SentNoB (Islam et al., 2021) datasets for the task.Both datasets were annotated by multiple annotators, with the inter-annotation agreement being 0.84 for MUBASE and 0.53 for SentNoB, respectively.The SentNoB data is curated from newspapers and YouTube video comments, covering 13 different topics such as Politics, National, International, Food, Sports, Teach, etc.The MUBASE dataset consists of comments from popular news media sources such as BBC Bangla, Prothom Alo, and BD24Live, which were collected from Facebook and Twitter.
We further analyzed the distribution of sentences based on the number of words associated with each class label, as depicted in Table 1.We created various ranges of sentence length buckets to understand and define the sequence length while training the transformer-based models.It appears that more than 80% of the posts comprise twenty words or fewer, a finding consistent with the typical of social media posts, as observed in previous studies (Alam et al., 2021b).Moreover, the average number of words and sentences per data point are 15.87 and 1.03, respectively.Dataset Train Dev DT Test  For evaluation, we used the Micro-F1 score and the evaluation scripts along with data are available online2 .As reference points, we provided both the majority and random baselines.The majority baseline always predicts the most common class in the training data and assigns this class to each instance in the test dataset.Conversely, the random baseline assigns one of the classes randomly to each instance in the test dataset.

Task Organization
For the shared task, we provided four sets of data: the training set, development set, development-test set, and test set, as outlined in Table 3.The purpose of providing the development set is for hyperparameter tuning.We provided the development test set without labels to allow participants to evaluate their systems during the system development phase.The test set was designated for the final system evaluation and ranking.We ran the shared task in two phases and hosted the submission system on the CodaLab platform. 3evelopment Phase In the first phase, only the training set, development set, and development-test set were made available, with no gold labels provided for the latter.Participants competed against each other to achieve the best performance on the development test set.A live leaderboard was made available to keep track of all submissions.
Test Phase In the second phase, the test set was released without labels, and the participants were given just four days to submit their final predictions.The test set was used for evaluation and ranking.The leaderboard was set to private during the evaluation phase, and participants were allowed to submit multiple systems without seeing the scores.The last valid submission was considered for official ranking.
After the competition concluded, we released the test set with gold labels to enable participants to conduct further experiments and error analysis.
5 Results and Overview of the Systems

Results
A total of 29 and 30 teams submitted their systems during the development and evaluation phases, respectively.In Table 4, we report the results of the submitted system on dev-test and test sets.We also include the results for the majority and random baselines.The ranking on the table was determined by the results from the test set.Note that some teams participated in the development phase but did not participate in the evaluation phase, and vice versa, as indicated by the symbol ✗.Additionally, the team marked with * did not submit a system description paper.
Upon comparing the results from the dev-test and test sets across different teams, it appears that the performance difference between them is very minimal.The models did not exhibit overfitting; in some cases, the performance on the test set even surpassed that on the dev-test set.
As can be seen in Table 4, almost all systems outperformed random baseline except one system, whereas 26 systems outperformed the majority baseline.The best system, Aambela (Fahim, 2023), achieved micro-F1 score of 0.73, which is an absolute improvement of 0.23.The team mainly finetuned BanglaBERT and multilingual BERT along with adversarial weight perturbation.The second best system, Knowdee (Liu et al., 2023), used data augmentation with psudolabeling, which are ob-tained from an ensemble of models.The third best system, LowResource (Chakma and Hasan, 2023), used ensemble of different fine-tuned models.
In Table 5, we report the overview of the approaches of the submitted systems.The most used models are multilingual BERT, BanglaBERT, and XLM-RoBERTa.Specifically, 9, 8, and 14 out of 15 teams utilized multilingual BERT, BanglaBERT, and XLM-RoBERTa, respectively.Ensembles of fine-tuned models provide the best systems for this task.Additionally, two teams applied few-shot learning using the mT5, BanglaBERT large, and GPT-3.5 models.However, the teams did not provide the details regarding the prompts.

Discussion
From the official ranking presented in Table 4, early every team outperformed the performance of the random baseline system.The performance difference between the top 22 teams is very small compared with the 23rd-ranked team.In Table 6, we presented the per-class performances for the top 5 teams.Although most of the teams performed better than the random baseline by a large margin, the neutral class is still the most difficult one to identify.The low performance in neutral class might be due to its skewed distribution in the dataset.Data augmentation, up-sampling the minority class, and class re-weighting are common approaches typically used to address such issues.Although some systems employed data augmentation, it seems this issue was not thoroughly considered across all teams.

Participating Systems
Below, we provide a brief description of the participating systems and their leaderboard rank.
Aambela (Fahim, 2023) (rank 1) emerged as the best-performing team in the shared task, finetuning pretrained models BanglaBERT (Bhattacharjee et al., 2022a) and multilingual BERT (Devlin et al., 2019) using two classification heads.Initially, the author removed URLs and HTML tags, then applied a normalizer to the preprocessed text.Adversarial weight perturbation was utilized to enhance the training's robustness, and a 5-fold crossvalidation was also conducted.
Knowdee (Liu et al., 2023) (rank 2) partitioned the data set into 10 folds and generated pseudolabels for unlabeled data using a fine-tuned ensemble of models.They employed standard data
LowResource (Chakma and Hasan, 2023) (rank 3) fine-tuned both the base and large versions of BanglaBERT (Bhattacharjee et al., 2022a), employing randomly dropping tokens, and also fine-tuned XLM-RoBERTa (Conneau et al., 2020).During the development phase, they created an ensemble of three models.However, for the evaluation phase, they ensembled only two variants of BanglaBERT, with one of them being fine-tuned using external data.Additionally, they employed task-adaptive pretraining and paraphrasing techniques utilizing BanglaT5 (Bhattacharjee et al., 2022b).
LowResourceNLU (Veeramani et al., 2023) (rank 4) fine-tuned BanglaBERT base and large (Bhattacharjee et al., 2022a), with MLM and classification heads, and multilingual BERT (Devlin et al., 2019) jointly on the XNLI and shared task dataset.They also created an ensemble of all three transformer-based models and applied multi-step aggregation to capture the most confident class predicted across all models.
Z-Index (Tarannum et al., 2023) (rank 5) utilized standard preprocessing techniques to remove URLs, usernames, emojis, and hashtags from the text.Initially, they employed SVM and Random Forest classical models, and later fine-tuned both the base and large variants of BanglaBERT (Bhattacharjee et al., 2022a), as well as the multilingual BERT (Devlin et al., 2019).The model was trained using the provided training set.
Embeddings (Tonmoy, 2023) (rank 9) fine-tuned pretrained models BanglaBERT (Bhattacharjee  et al., 2022a), BanglaGPT2, 4 Indic-BERT (Kakwani et al., 2020), and multilingual BERT (Devlin et al., 2019) using cross entropy loss function.Later to reduce the computational cost, they investigated the performances across the self-adjusting dice loss, focal loss, and F1-micro loss.They also combined training, dev, and dev-test sets as training data to train and test data to evaluate the performances of the models.
Error Point (Das et al., 2023) (rank 27) performed preprocessing by removing duplicate text, filtering based on text length, and eliminating punctuation, links, emojis, non-character elements, and stopwords.They also carried out data augmentation.For their analysis, they utilized classical algorithms such as Logistic Regression, Decision Tree, Random Forest, Multinomial Naive Bayes, SVM, and SGD, using n-grams to represent the input.Additionally, they employed deep learning models, namely LSTM and LSTM-CNN.

Conclusion and Future Work
We presented an overview of the shared task 2 (sentiment analysis) at the BLP Workshop 2023.Task 2 aimed to classify the sentiment in textual content.Notable systems employed an ensemble of pretrained language models, with the languagespecific BanglaBERT being the most popular.Also, some interesting approaches including P-tuning, Few-shot learning, LLMs, and different loss functions have been explored for tackling the problem.In general, numerous models, including different kinds of transformers, have been used in the current submissions for the task.
In future work, we plan to extend the task in various ways, such as aspect-based sentiment analysis and incorporating multiple modalities.

Limitations
The BLP-2023 sentiment analysis shared task primarily focuses on sentiment polarity classification (positive, negative, and neutral) at the post level.This approach limits the identification of specific sentiment aspects and other crucial elements associated with them.Future editions of the task will address this aspect.Moreover, this edition focused solely on unimodality (text-only) models, leaving multimodal models for future study.

Table 1 :
Detailed class label distribution of the shared task data splits.Pos: Positive, Neu: Neutral, Neg: Negative.

Table 2 :
(Islam et al., 2021)d in various splits for the shared task.DT: Dev-Test For the shared task, we combined the MUBASE(Hasan et al., 2023)training set with the SentNoB(Islam et al., 2021)training set, resulting in a total of 35,266 entries for the training set.The SentNoB development set was used as the shared task development set.Additionally, the MUBASE development set served as the dev-test set for the shared task, while the test set was utilized for system evaluation and participant ranking.The specifics of the data sources are outlined in Table2, and the detailed distribution of the data split is presented in Table3.

Table 3 :
Class label distribution of the shared task dataset.DT: Dev-Test, Pos: Positive, Neu: Neutral,

Table 4 :
Official ranking of the shared task on the test set.*No working note submitted.-Run submitted after the deadline.✗ -indicates team has not submitted system in the respective phase.

Table 6 :
, with the additional use of training data.They employed standard pre-F1 scores of the baseline and top five systems for each class.