Abusive text detection in low-resource languages such as Bengali is a challenging task due to the inadequacy of resources and tools. The ubiquity of transliterated Bengali comments in social media makes the task even more involved as monolingual approaches cannot capture them. Unfortunately, no transliterated Bengali corpus is publicly available yet for abusive content analysis. Therefore, in this paper, we introduce an annotated Bengali corpus of 3000 transliterated Bengali comments categorized into two classes, abusive and non-abusive, 1500 comments for each. For baseline evaluations, we employ several supervised machine learning (ML) and deep learning-based classifiers. We find support vector machine (SVM) shows the highest efficacy for identifying abusive content. We make the annotated corpus freely available for the researcher to aid abusive content detection in Bengali social media data.
A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
The existing research on sentiment analysis mainly utilized data curated in limited geographical regions and demography (e.g., USA, UK, China) due to commercial interest and availability of review data. Since the user’s attitudes and preferences can be affected by numerous sociocultural factors and demographic characteristics, it is necessary to have annotated review datasets belong to various demography. In this work, we first construct a review dataset BanglaRestaurant that contains over 2300 customer reviews towards a number of Bangladeshi restaurants. Then, we present a hybrid methodology that yields improvement over the best performing lexicon-based and machine learning (ML) based classifier without using any labeled data. Finally, we investigate how the demography (i.e., geography and nativeness in English) of users affect the linguistic characteristics of the reviews by contrasting two datasets, BanglaRestaurant and Yelp. The comparative results demonstrate the efficacy of the proposed hybrid approach. The data analysis reveals that demography plays an influential role in the linguistic aspects of reviews.
Bengali is a low-resource language that lacks tools and resources for profane and obscene textual content detection. Until now, no lexicon exists for detecting obscenity in Bengali social media text. This study introduces a Bengali obscene lexicon consisting of over 200 Bengali terms, which can be considered filthy, slang, profane or obscene. A semi-automatic methodology is presented for developing the profane lexicon that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The developed lexicon achieves coverage of around 0.85 for obscene and profane content detection in an evaluation dataset. The experimental results imply that the developed lexicon is effective at identifying obscenity in Bengali social media content.
Sentiment analysis research in low-resource languages such as Bengali is still unexplored due to the scarcity of annotated data and the lack of text processing tools. Therefore, in this work, we focus on generating resources and showing the applicability of the cross-lingual sentiment analysis approach in Bengali. For benchmarking, we created and annotated a comprehensive corpus of around 12000 Bengali reviews. To address the lack of standard text-processing tools in Bengali, we leverage resources from English utilizing machine translation. We determine the performance of supervised machine learning (ML) classifiers in machine-translated English corpus and compare it with the original Bengali corpus. Besides, we examine sentiment preservation in the machine-translated corpus utilizing Cohen’s Kappa and Gwet’s AC1. To circumvent the laborious data labeling process, we explore lexicon-based methods and study the applicability of utilizing cross-domain labeled data from the resource-rich language. We find that supervised ML classifiers show comparable performances in Bengali and machine-translated English corpus. By utilizing labeled data, they achieve 15%-20% higher F1 scores compared to both lexicon-based and transfer learning-based methods. Besides, we observe that machine translation does not alter the sentiment polarity of the review for most of the cases. Our experimental results demonstrate that the machine translation based cross-lingual approach can be an effective way for sentiment classification in Bengali.