The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.
Recognizing biomedical entities in the text has significance in biomedical and health science research, as it benefits myriad downstream tasks, including entity linking, relation extraction, or entity resolution. While English and a few other widely used languages enjoy ample resources for automatic biomedical entity recognition, it is not the case for Bangla, a low-resource language. On that account, in this paper, we introduce BanglaBioMed, a Bangla biomedical named entity (NE) annotated dataset in standard IOB format, the first of its kind, consisting of over 12000 tokens annotated with the biomedical entities. The corpus is created by collecting Bangla text from a list of health articles and then annotated with four distinct types of entities: Anatomy (AN), Chemical and Drugs (CD), Disease and Symptom (DS), and Medical Procedure (MP). We provide the details of the entire data collection and annotation procedure and illustrate various statistics of the created corpus. Our developed corpus is a much-needed addition to the Bangla NLP resource that will facilitate biomedical NLP research in Bangla.
A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
The existing research on sentiment analysis mainly utilized data curated in limited geographical regions and demography (e.g., USA, UK, China) due to commercial interest and availability of review data. Since the user’s attitudes and preferences can be affected by numerous sociocultural factors and demographic characteristics, it is necessary to have annotated review datasets belong to various demography. In this work, we first construct a review dataset BanglaRestaurant that contains over 2300 customer reviews towards a number of Bangladeshi restaurants. Then, we present a hybrid methodology that yields improvement over the best performing lexicon-based and machine learning (ML) based classifier without using any labeled data. Finally, we investigate how the demography (i.e., geography and nativeness in English) of users affect the linguistic characteristics of the reviews by contrasting two datasets, BanglaRestaurant and Yelp. The comparative results demonstrate the efficacy of the proposed hybrid approach. The data analysis reveals that demography plays an influential role in the linguistic aspects of reviews.
Bengali is a low-resource language that lacks tools and resources for profane and obscene textual content detection. Until now, no lexicon exists for detecting obscenity in Bengali social media text. This study introduces a Bengali obscene lexicon consisting of over 200 Bengali terms, which can be considered filthy, slang, profane or obscene. A semi-automatic methodology is presented for developing the profane lexicon that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The developed lexicon achieves coverage of around 0.85 for obscene and profane content detection in an evaluation dataset. The experimental results imply that the developed lexicon is effective at identifying obscenity in Bengali social media content.
Abusive text detection in low-resource languages such as Bengali is a challenging task due to the inadequacy of resources and tools. The ubiquity of transliterated Bengali comments in social media makes the task even more involved as monolingual approaches cannot capture them. Unfortunately, no transliterated Bengali corpus is publicly available yet for abusive content analysis. Therefore, in this paper, we introduce an annotated Bengali corpus of 3000 transliterated Bengali comments categorized into two classes, abusive and non-abusive, 1500 comments for each. For baseline evaluations, we employ several supervised machine learning (ML) and deep learning-based classifiers. We find support vector machine (SVM) shows the highest efficacy for identifying abusive content. We make the annotated corpus freely available for the researcher to aid abusive content detection in Bengali social media data.
Sentiment analysis research in low-resource languages such as Bengali is still unexplored due to the scarcity of annotated data and the lack of text processing tools. Therefore, in this work, we focus on generating resources and showing the applicability of the cross-lingual sentiment analysis approach in Bengali. For benchmarking, we created and annotated a comprehensive corpus of around 12000 Bengali reviews. To address the lack of standard text-processing tools in Bengali, we leverage resources from English utilizing machine translation. We determine the performance of supervised machine learning (ML) classifiers in machine-translated English corpus and compare it with the original Bengali corpus. Besides, we examine sentiment preservation in the machine-translated corpus utilizing Cohen’s Kappa and Gwet’s AC1. To circumvent the laborious data labeling process, we explore lexicon-based methods and study the applicability of utilizing cross-domain labeled data from the resource-rich language. We find that supervised ML classifiers show comparable performances in Bengali and machine-translated English corpus. By utilizing labeled data, they achieve 15%-20% higher F1 scores compared to both lexicon-based and transfer learning-based methods. Besides, we observe that machine translation does not alter the sentiment polarity of the review for most of the cases. Our experimental results demonstrate that the machine translation based cross-lingual approach can be an effective way for sentiment classification in Bengali.