SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at https://git. io/JuuNB.


Introduction
Sentiment analysis is one of the classic problems in computational linguistics, and it has shown a massive impact on different real-life applications. The capability to quantify sentiment polarity of English texts has enabled the creation of solutions for a diverse set of problems like understanding the possible movement of stock markets, public sentiment towards any event or product, and understanding client satisfaction for customer support. A major reason behind such a success is the amount of collaborative efforts invested in the research and development of the creation of public resources like Sentiment140 (Go et al., 2009;Mohammad et al., 2013), SentiWordNet (Baccianella et al., 2010), IMDB review corpus (Maas et al., 2011), Stanford Sentiment Treebank (Socher et al., 2013), TS-Lex (Tang et al., 2014), and SemEval Twitter sentiment analysis corpus (Rosenthal et al., 2017).
Bangla is the sixth most spoken language worldwide and the second Indo-Aryan language after  Hindi (Eberhard et al., 2021) 1 with 268M speakers. Bangla is the native language of Bangladesh and some regions of India, such as West Bengal. While technology is dramatically improving the lives of people from these densely populated and economically burgeoning regions, it is a timely need of building technologies that can understand the language, enhancing the overall impact on social welfare and businesses. Existing datasets for sentiment analysis for a low-resource language like Bangla suffer from three major limitations: 1) none to slight inter annotator agreement score questioning the annotation reliability (e.g., 0.11 in Ashik et al., 0.18 in Islam et al., 2020), 2) lack of cross-domain generalization capability due to large domain dependency (Wahid et al., 2019;Rahman et al., 2019;Sazzed, 2020), and 3) lack of public availability for further research (Karim et al., 2020;Nabi et al., 2016;Hassan et al., 2016;Sharmin and Chakma, 2020;Choudhary et al., 2018;Das and Bandyopadhyay, 2009).
In this paper, we aim at creating a domainrepresentative sentiment polarity classification dataset by collecting public opinions on various topics. During the data collection and annotation process, we invest efforts to improve the quality of the dataset using data curation techniques. On one hand, it includes the steps for duplicate removal, while on the other hand we increase the vocabulary size by incorporating instances that will help to increase the unique word percentage. Our contributions can be summarized as follows: • We propose SentNoB, a dataset for analysing Sentiment in Noisy Bangla texts. This dataset is a collection of ≈15K social media comments on news and videos from 13 different domains. Instances from the dataset demonstrate heavy usage of different local dialects, spelling, and grammatical errors. We show some examples in Table 1. • We experiment on different techniques such as linguistic features, recurrent neural networks, and pre-trained language model; and show that old-school lexical features like word n-grams demonstrate superior performance in classification. We shed light on different aspects of the problem throughout our analysis.
• We make our dataset and model publicly available to foster research in this direction.

Development of SentNoB
Data Collection We defined the following objectives before creating the dataset as we believe these objectives will enhance the generalization capability of SentNoB: 1) Samples should represent many different domains to encourage domainindependent solutions. 2) Samples should contribute to making the dataset less repetitive. We start by collecting public comments on articles on the most popular 13 topics from Prothom Alo 2 , the most circulated newspaper in Bangladesh 3 . Then we collect comments from a set of Youtube videos on similar topics. Out of ≈ 31K collected comments, we keep the comments that are written in only Bangla alphabets. To reduce repetitiveness and noise, we remove duplicates and exclude instances shorter than three or longer than 50 words tokens. Additionally, we aim at increasing the vocabulary size by   incorporating as many different words as possible. Therefore, we prioritize the instances for annotation that will increase the percentage of the unique word in the dataset. Diverse vocabulary poses a challenge in modeling but eventually helps to create more robust classification systems that can generalize well.
Annotation We use three different annotators to label each instance with one of the five polarity labels Strong Negative, Moderate Negative, Neutral, Moderate Positive, and Strong Positive. For this task, we employed ten undergraduate students and provided them with detailed annotation guidelines. We use majority voting to assign the final class label, where we keep the neutral class unchanged but combine the two intensities of the polar classes and assign either Positive or Negative label. An interannotator agreement (Fleiss, 1971) score of 0.53 indicates a moderate agreement across the dataset. To our knowledge, this is the highest such score among the Bangla datasets that made the agreement score public.

Statistics and Analysis
In total, we have 15, 728 instances in the final dataset ( Table 2). The average length of the instances is 1.63±1.03 sentences and average sentence length is 15.37 ± 9.93 words. 40.8% of the data are labeled as Positive, 36.3% Negative, and 22.9% Neutral. Figure 1 shows the topic distribution of the dataset. While, 42.73% instances are from national and political news, we have less data from fashion and agriculture. We observe that agreement decreases with in-stance length. For instance, all three annotators agreed for 36% texts with 11-20 tokens, 15.07% texts with 21-30 tokens, and 7.08% texts for 31-40 tokens. This is intuitive as longer texts can pose sentiment contradiction among different segments and often challenge annotators' own biases and perspectives. For example, we observe low agreements on data from politics and national domain as these domains demonstrate heavy partisanship.

Methodology
In this section, we describe the methods we investigate to develop a benchmark model for classifying sentiment polarity on SentNoB. We start by training linear SVM (Cortes and Vapnik, 1995) models with traditional hand-engineered linguistic features. Then, we experiment with recurrent neural network models and pre-trained transformer based language models due to their recent success on a wide variety of NLP tasks.

Linguistic Features
Lexical We extract word (1-3) and character (2-5) n-grams from the instances as these lexical representations have shown strong performance in different classification tasks. Then we vectorize each instance with the TF-IDF weighted scores for each n-gram. Semantic To utilize semantic information from the texts, we experiment with FastText (Grave et al., 2018) pre-trained Bangla word embeddings, where we represent a text with the mean of the vectors for each word. FastText has 81.75% coverage on our dataset as FastText's training data are formal Bangla texts from Wikipedia, whereas we created our dataset with informal Bangla texts written by general people on the internet. We considered Fast-Text embedding for linguistic feature-based experiments. We represent the out of vocabulary words with zero vector.

Recurrent Neural Networks
We use a bidirectional long short-term memory (Bi-LSTM; Hochreiter and Schmidhuber, 1997) network that encodes a text from the forward and backward directions and creates a 2D vector for each direction. Then, we concatenate the vectors and apply attention mechanism (Bahdanau et al., 2015) that learns to put more weight on the words crucial for correct classification. We compute the attention weighted sum of the vectors and predict the sentiment polarity through an output layer. Instead of using any pre-trained embeddings (e.g., FastText) to initialize the embedding layer, we use random initialization because of better performance in some initial experiments.

Pre-trained Language Model
In recent years, large pre-trained language models like BERT (Devlin et al., 2018) have shown impressive performance in a wide range of linguistic tasks of many languages. Therefore, we assess the performance of such a model by fine-tuning it on our dataset. We choose the multi-lingual BERT (mBERT) as its training data included Bangla texts, and only fine-tune the output layer with our training data due to computing resource limitation.

Experimental Setup
We implement our experimental framework using Pytorch (Paszke et al., 2019), Transformers (Wolf et al., 2020), and Scikit-learn (Pedregosa et al., 2011). We evaluate our methods using micro averaged F1. As the baseline systems, we compare our results with the majority, random, and weighted random baselines. To reduce noise, we replace the numerical tokens with a CC token and normalize English and Bangla sentence stoppers. Due to the class imbalance, we perform per-topic stratified split to create training (80%), development (10%), and test (10%) sets. While we evaluate all the individual features using the same hyper-parameter setting, we tune the SVM regularizer C 4 of the model on the validation set performance for the best performing feature combination. For training the BiLSTM model with mini batches, we left pad the instances and perform hyper-parameter tuning on learning rate, batch size, dropout rate, number of LSTM cells and layers. For fine-tuning mBERT, we only tune the learning rate and batch size.

Results and Analysis
We report our experimental results on the test set in Table 3. The majority baseline achieves a 41.24 F1 score by assigning the dominant label (+ve) to every instance, which is better than the random baselines (34.53 and 32.60). Among the word n-grams, we observe better performance with unigram 63.19 compared to bigram (59.68) and trigram (55.56).  Combining bigram with unigram lifts the unigram F1 by 1.25 (i.e., 64.44), but adding trigram to that combination reduces the rate of improvement, and we achieve 63.60 F1. We observe similar classification performance with the character n-grams. While character 3, 4, and 5 grams' performances are around 3-4% higher than character bigram, the difference among their F1 scores is low. Surprisingly, different combinations of the character n-grams do not show significantly higher gains. Combining all character n-grams yields a small gain of 0.44 over the most robust character 5-gram feature. However, we do not observe any significant shift in the precision and recall scores for character n-gram combinations. This implies that the task highly depends on word units and does not rely much on the subword level information. Integrating the all word n-grams with all character ngrams achieves the best F1 of 64.61, and improves on both precision and recall. The embedding feature demonstrates poor performance (F1=56.46), and combining this with the lexical features does not show any improvement.

Method Precision Recall F1
According to our results, linguistic feature combinations perform better than the neural models on our dataset. Although the Bi-LSTM model's precision is closer to the precision of the lexical feature combination approach, the recall is ≈ 8% lower (64.97 vs 73.39). We observe that mBERT's per-  formance (F1=52.79) is significantly lower than the Bi-LSTM model. There can be two possible reasons behind such a performance: a) mBERT's training data is compiled of formal Bangla text from Wikipedia, whereas our dataset contains informal and noisy Bangla texts, and b) fine-tuning only the output layer makes mBERT under-trained for the task. To verify the first hypothesis, we randomly sample 100 instances from the training and validation sets, and manually translate them to formal Bangla. Then, we perform some few-shot experiments on mBERT with different train-test combinations of the formal and informal versions. Although the dataset for this experiment is very small, the results in Table 4 indicates that the first hypothesis is not true. If the hypothesis was true, we would have observed the best performance when both training and test sets are made of formal texts. But, the results are quite the opposite. Best F1 is achieved when the training material is formal text, but test set is informal text. This suggests that fine-tuning only the output layer of mBERT probably leaves the model under-trained for this task. However, poor performance of FastText embeddings (pretrained on Wikipedia) than random embeddings in BiLSTM model adds some support towards the first hypothesis. In the future, we plan to further investigate in this direction. Performance by Topic Analysing the results per topic and per class from Table 5, we find that the F1 difference for +ve and -ve class is small (78.99 vs 76.29), but 42.25 F1 indicates that the Neutral samples are the hardest to identify. F1 for the Negative class is comparatively higher for topics like Politics and Economy as ideological conflicts are mostly responsible for negativity in these topics. Additionally, we find that people tend to speak more about their negative experiences about Food, Travel, and Tech products, and our approach shows higher recall in these topics. Interestingly, +ve instances are harder to identify for Tech. Although   we have a very small amount of data for Education, Fashion and Agriculture, +ve class's performance is significantly higher for these topics. Table 6 shows some of the strongest n-gram features from each class. We observe that n-grams expressing strong positive emotions and compliments act as the indicator of the positive class, and they are mostly adjectives. On the other hand, negative samples are often associated with police, crime, lack of trust in the judicial system, and slang. Strongest n-grams for the neutral class are mostly nouns or information.

Dominant Features
We notice that many of the strongest n-grams are misspelled. Therefore, we believe pre-processing techniques like spell-correction and word segmentation can help normalize such noises and help to get better performance.

Conclusion
In this paper, we present SentNoB, a dataset for analysing sentiment in noisy Bangla texts collected from the comments section of Bangla news and videos from 13 different domains. SentNoB contains ≈ 15K instances labeled with positive, negative, or neutral class label. We found that lexical feature combinations demonstrate stronger classification performance compared to neural models. As the future work, we will focus on different preprocessing techniques and more investigation with pre-trained language models.

References
Md Akhter-Uz-Zaman Ashik, Shahriar Shovon, and Summit Haque. 2019. Data set for sentiment analysis on bengali news comments and its baseline evaluation.