2023
pdf
bib
abs
BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer
Sadia Afrin
|
Md. Shahad Mahmud Chowdhury
|
Md. Islam
|
Faisal Khan
|
Labib Chowdhury
|
Md. Mahtab
|
Nazifa Chowdhury
|
Massud Forkan
|
Neelima Kundu
|
Hakim Arif
|
Mohammad Mamun Or Rashid
|
Mohammad Amin
|
Nabeel Mohammed
Findings of the Association for Computational Linguistics: EMNLP 2023
Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.
pdf
bib
abs
Vio-Lens: A Novel Dataset of Annotated Social Network Posts Leading to Different Forms of Communal Violence and its Evaluation
Sourav Saha
|
Jahedul Alam Junaed
|
Maryam Saleki
|
Arnab Sen Sharma
|
Mohammad Rashidujjaman Rifat
|
Mohamed Rahouti
|
Syed Ishtiaque Ahmed
|
Nabeel Mohammed
|
Mohammad Ruhul Amin
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
This paper presents a computational approach for creating a dataset on communal violence in the context of Bangladesh and West Bengal of India and benchmark evaluation. In recent years, social media has been used as a weapon by factions of different religions and backgrounds to incite hatred, resulting in physical communal violence and causing death and destruction. To prevent such abusive use of online platforms, we propose a framework for classifying online posts using an adaptive question-based approach. We collected more than 168,000 YouTube comments from a set of manually selected videos known for inciting violence in Bangladesh and West Bengal. Using both unsupervised and later semi-supervised topic modeling methods on those unstructured data, we discovered the major word clusters to interpret the related topics of peace and violence. Topic words were later used to select 20,142 posts related to peace and violence of which we annotated a total of 6,046 posts. Finally, we applied different modeling techniques based on linguistic features, and sentence transformers to benchmark the labeled dataset with the best-performing model reaching ~71% macro F1 score.
pdf
bib
abs
BLP-2023 Task 1: Violence Inciting Text Detection (VITD)
Sourav Saha
|
Jahedul Alam Junaed
|
Maryam Saleki
|
Mohamed Rahouti
|
Nabeel Mohammed
|
Mohammad Ruhul Amin
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
We present the comprehensive technical description of the outcome of the BLP shared task on Violence Inciting Text Detection (VITD).In recent years, social media has become a tool for groups of various religions and backgrounds to spread hatred, leading to physicalviolence with devastating consequences. To address this challenge, the VITD shared task was initiated, aiming to classify the level of violence incitement in various texts. The competition garnered significant interest with a total of 27 teams consisting of 88 participants successfully submitting their systems to the CodaLab leaderboard. During the post-workshop phase, we received 16 system papers on VITD from those participants. In this paper, we intend to discuss the VITD baseline performance, error analysis of the submitted models, and provide a comprehensive summary of the computational techniques applied by the participating teams
pdf
bib
abs
BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations
Shadman Rohan
|
Mojammel Hossain
|
Mohammad Rashid
|
Nabeel Mohammed
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
Coreference Resolution is a well studied problem in NLP. While widely studied for English and other resource-rich languages, research on coreference resolution in Bengali largely remains unexplored due to the absence of relevant datasets. Bengali, being a low-resource language, exhibits greater morphological richness compared to English. In this article, we introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens. We describe the process of creating this dataset and report performance of multiple models trained using BenCoref. We anticipate that our work sheds some light on the variations in coreference phenomena across multiple domains in Bengali and encourages the development of additional resources for Bengali. Furthermore, we found poor crosslingual performance at zero-shot setting from English, highlighting the need for more language-specific resources for this task.