BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.


Introduction
The resources publicly available for scholarly investigation in the realm of Sentiment Analysis (SA) for the Bangla language are scarce and limited in quantity (Khatun and Rabeya, 2022;Sazzed, 2021;Rahman et al., 2019) despite its literary gravitas as the 6 th most spoken language1 in the world with approximately 200 million speakers.In the existing literature on Bangla Text SA, as shown in Table 5, the largest dataset consists of 20,468 samples (Islam et al., 2022) while the smallest has a mere 1,050 samples (Tabassum and Khan, 2019).Besides these, Islam et al. ( 2020) created a dataset consisting of 17,852 samples and Islam et al. ( 2021) utilized a dataset of 15,728 samples.All other datasets apart from these either have <15,000 samples or are publicly unavailable.Another limitation of the existing research works in Bangla Text SA is the deficiency of datasets having product-specific review samples.Most of the available Bangla SA datasets are focused on usergenerated textual content from cyberspace.The insights derived from these may not accurately represent sentiment in the context of product reviews, thus hindering their usefulness for businesses.The tonal and linguistic analysis of reviews from product-specific datasets can aid businesses to gain valuable insights into customer attitudes, preferences, and experiences which can then be leveraged to improve products and services, design targeted marketing campaigns, and make more informed business decisions.In this paper, we introduce a large-scale dataset, BANGLABOOK, consisting of 158,065 samples of book reviews collected from online bookshops written in the Bangla language.This is the largest dataset for Bangla sentiment analysis to the best of our knowledge.We perform an analysis of the dataset's statistical characteristics, employ various ML techniques to establish a performance benchmark for validating the dataset, and also conduct a thorough evaluation of the classification errors.

Dataset Construction
In order to create this dataset, we collect a total of 204,659 book reviews from two online bookshops (Rokomari2 and Wafilife3 ) using a web scraper developed with several Python libraries, including BeautifulSoup, Selenium, Pandas, Openpyxl,  and Webdriver, to collect and process the raw data.
For the data collection and preparation process of the BANGLABOOK dataset, we first compile a list of URLs for authors from online bookstores.From there, we procure URLs for the books.We meticulously scrape information such as book titles, author names, book categories, review texts, reviewer names, review dates, and ratings by utilizing these book URLs.

Labeling & Translation
If a review does not have a rating, we deem it unannotated.Reviews with a rating of 1 or 2 are classified as negative, a rating of 3 is considered neutral, and a rating of 4 or 5 is classified as positive.Two manual experiments are carried out to validate the use of ratings as a measure of sentiment in product reviews.In the first experiment, around 10% of the reviews are randomly selected and annotated manually.The annotated labels are cross-checked with the original labels, resulting in a 96.7% accuracy in the corresponding labels.In addition, we consult the work of Wang et al. (2020) that explored the issue of incongruous sentiment expressions with regard to ratings.Specifically, the study scrutinized two categories of reviews: high ratings lacking a positive sentiment, and low ratings lacking a negative sentiment.We perform an analysis to identify such inconsistencies within our dataset and discovered that only a minuscule 3.41% of the samples exhibited this pattern.This figure is rela-tively insignificant when considering the substantially large scale of our dataset.After discarding the unannotated reviews, we curate a final dataset of 158,065 annotated reviews.Of these, 89,371 are written entirely in Bangla.
The remaining 68,694 reviews were written in Romanized Bangla, English, or a mix of languages.
They are translated into Bangla with Google Translator and a custom Python program using the googletrans library.The translations are subsequently subjected to manual review and scrutiny to confirm their accuracy.The majority of inaccurate translations primarily comprise spelling errors and instances where English words remain untranslated within samples containing a combination of Bangla and English text.The meticulous evaluation process of untranslated samples involves a thorough assessment by post-graduate native Bangla speakers, who critically compare the translated text against the original untranslated text to ascertain the correctness of the translation.

Statistical Analysis
Tables 1 and 2   spond to each rating on a scale of 1 to 5. Upon analyzing the sentiment chart, it appears that the majority of the reviews (124,084 + 17,503 = 141,587 samples) are positive, with a significant portion also being negative (2,728 + 6,946 = 9,674 samples).A relatively small fraction of the reviews are neutral (6,804 samples).This suggests that overall, the books have been well received by the readers, with the majority expressing favorable opinions.The distribution of the dataset is representative of real-world scenarios and it tessellates well with previous content analysis works on book reviews (Lin et al., 2005;Sorensen and Rasmussen, 2004).In Figure-2, we can visualize an illustration of the sentiment distribution among the 5 most frequently reviewed categories of books.We can gain some salient insights from the popularity of these genres.Contemporary novels are bestsellers as they reflect current events, social issues, and trends, making them relatable and thought-provoking for the readers while self-help and religious books provide guidance, inspiration, and a sense of purpose, catering to individuals' quest for personal growth and spiritual fulfillment.

Developing Benchmark for BANGLABOOK
A series of baseline models with combinations of different lexical and semantic features are chosen to evaluate the BANGLABOOK dataset.An overview of the models, evaluation metrics, results, and analysis of the experimental results are provided in this section.

Baseline Models & Features
For the lexical features, we extract bag-of-words (BoW), char n-grams (1-3), and word n-grams (1-3) from the reviews as these representations have performed well in different classification tasks (Islam et al., 2022).After extracting the features, they are vectorized using TF-IDF and count vec- torizer and trained on a series of ML models such as Random Forest (Breiman, 2001), XG-Boost (Chen and Guestrin, 2016), linear SVM (Cortes and Vapnik, 1995), Logistic Regression (le Cessie and van Houwelingen, 1992) and Multinomial Naive Bayes (John and Langley, 1995).We choose LSTM (Hochreiter and Schmidhuber, 1997) with GloVe (Pennington et al., 2014) embedding for its ability to understand context along with recent dependency.We also fine-tuned two available transformer-based models in Bangla: Bangla-BERT(base-uncased) (110M parameters) (Sarker, 2020) and Bangla-BERT(large) (2.5B parameters) (Bhattacharjee et al., 2022), due to the recent success of BERT (Devlin et al., 2019) in various downstream NLP tasks.We select F1-score and weighted average F1-score to evaluate the models because the dataset has an un-   even class distribution.F1-score is the harmonic mean of precision and recall and it helps balance the metric across the imbalanced positive/negative samples (Sokolova et al., 2006).All our experiments are done using scikit-learn, pytorch, and transformers (Vaswani et al., 2017) and run on Google Colaboratory.The training, testing, and validation split of the entire dataset was 70-20-10 with previously unseen samples in the test and validation set.To summarize, the utilization of pre-trained models (i.e.Bangla-BERT) that undergo training on extensive corpora, leading to exposure to extensive general language knowledge, has significantly contributed to their superior classification performance compared to other models and word embeddings.Additionally, models trained on handcrafted features also perform significantly well.It should be noted that Bangla pre-trained models are currently undergoing development, and further training on expansive corpora has the potential to enhance their ability to generalize and achieve even more impressive results.

Error Analysis
In the 'Positive' class, all the models produce excellent classification results.While some models perform reasonably well on the 'Negative' class, nearly all of the models perform poorly on the 'Neutral' class.The class imbalance of the dataset, as shown in Figure 1, is one obvious cause of this fluctuation in results.The confusion matrix for Bangla-BERT on our dataset, presented in Figure-3, reveals that most of the 'Negative' and 'Neutral' samples are misclassified as 'Positive' samples by our classifiers.To further analyze the misclassifications, we examine the W1 (word unigrams) of these three classes.We find 124,796 unique W1 for the 'Positive' class, 20,714 unique W1 for the 'Negative' class, and 19,096 unique W1 for the 'Neutral' class.77.57% of the W1 from the 'Neutral' class and 79.83% of the W1 from the 'Negative' class are found in the 'Positive' class.Table 4 depicts the most frequent W1 conveying the strongest sentiments for each class.With only one distinct 'Neutral' W1 and even the 'Negative' class having multiple positive W1, the dominance of 'Positive' sentiment W1 over the other two classes is evident.This may have contributed to the lack of distinctive words in the 'Negative' and 'Neutral' classes, which inevitably prevented the feature-based models from generalizing.

Morphology and Negation Patterns of Bangla
Understanding the morphology and negation patterns of a language holds paramount importance in the realm of sentiment analysis because negation can alter the meaning of words and phrases, thereby affecting the overall sentiment conveyed by a text.We provide a concise yet insightful recapitulation of the topic in the case of Bangla accompanied by review samples from our dataset BANGLABOOK as the respective examples.From the linguistic typological standpoint, Bangla is categorized as a subject-object-verb (SOV) language because the subject, object, and verb generally adhere to said order in its sentential structure (Ramchand, 2004).The most common juxtaposition of polarity from positive to negative is the use of ni (িন) as a tensed negative.For example, The Bangla language consists of no negative adverbs or pronouns (Thompson, 2006).This is why the negative element responsible for the reversal of polarity transcends from the word-level to the sentence-level rendering the occurrences of almost all negations in Bangla manifest on the syntactic level (Thompson, 2006).
In the cases of double negatives, we see the involvement of lexical negation, a morphological feature that works with negative affixes (prefixes and suffixes) attached to a root word.The prefixes in Bangla have two different phonetic variations or allophones depending on whether the prefix precedes a vowel or a consonant.The same is true for prefixes that imbue a negative connotation to a root word, e.g.o (অ) and on (অন্ ).For example, িকন্তু এই বইিট এই অপূ ণর্ তা েঢেক েফেলেছ। Translation: But this book has covered up this incompleteness .
Another negative prefix that precedes a root word to invert its polarity is nir (িনর্ ).For example,

েলখেকর িনরলস শৰ্ম েলখায় ফু েট উেঠেছ।
Translation: The relentless effort of the author is reflected in the writing.
On the contrary, the suffix hin (হীন) succeeds a root word to convert it to the corresponding negative form.For example, এরকম িভিত্তহীন কাল্পিনক গল্প িশশুেদর না পড়াই ভােলা। Translation: It is better for children not to read such baseless fictional stories.
The expression of negative sentiment is, therefore, very nuanced in the Bangla language as every occurrence of negative is intertwined with features like the tense, hierarchy of syntax, verb status, case-specific issues, and sequential arrangement of words (Thompson, 2006).

Conclusion
This paper introduces BANGLABOOK, the largest Bangla book review dataset with 158,065 samples, each labeled with 1 of 3 user sentiments.We provide extensive statistical analysis and strong baselines facilitating the utility of the dataset.Given its massive size and fine-grained sentiment distribution, BANGLABOOK has the potential to alleviate the resource scarcity in Bangla language research.

Limitations
Many of the reviews that were gathered for constructing BANGLABOOK are discarded because they lack a corresponding rating.A manual annotation process would have yielded a much larger dataset, which was not feasible due to resource constraints.Moreover, one of the challenges for validating the dataset is the lack of statistical models and word-embeddings pre-trained on the Bangla language.Some pre-trained Bangla-BERT models, yet to be trained on extensive corpora, have only recently been proposed D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Table 1 :
Summary statistics of our dataset.Bangla † denotes Romanized Bangla text.

Table 3 :
Catergory-wise Binary Task F1-score and Weighted Average F1-score of each method on BANGLABOOK.

Table 4 :
Most frequent word unigrams conveying the strongest sentiments of each class with English translation.The colors respectively denote Positive , Neutral and Negative sentiments.

Table 3
. Improving transformer-based models for Bangla can enhance sub-word level contextual understanding which will consequently help in more accurate identification of the sentiments in BANGLABOOK (Islam et al., 2022).C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.