Mohammed Moshiul Hoque


2024

pdf bib
Sandalphon@DravidianLangTech-EACL2024: Hate and Offensive Language Detection in Telugu Code-mixed Text using Transliteration-Augmentation
Nafisa Tabassum | Mosabbir Khan | Shawly Ahsan | Jawad Hossain | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Hate and offensive language in online platforms pose significant challenges, necessitating automatic detection methods. Particularly in the case of codemixed text, which is very common in social media, the complexity of this problem increases due to the cultural nuances of different languages. DravidianLangTech-EACL2024 organized a shared task on detecting hate and offensive language for Telugu. To complete this task, this study investigates the effectiveness of transliteration-augmented datasets for Telugu code-mixed text. In this work, we compare the performance of various machine learning (ML), deep learning (DL), and transformer-based models on both original and augmented datasets. Experimental findings demonstrate the superiority of transformer models, particularly Telugu-BERT, achieving the highest f1-score of 0.77 on the augmented dataset, ranking the 1st position in the leaderboard. The study highlights the potential of transliteration-augmented datasets in improving model performance and suggests further exploration of diverse transliteration options to address real-world scenarios.

pdf bib
CUET_Binary_Hackers@DravidianLangTech EACL2024: Fake News Detection in Malayalam Language Leveraging Fine-tuned MuRIL BERT
Salman Farsi | Asrarul Eusha | Ariful Islam | Hasan Mesbaul Ali Taher | Jawad Hossain | Shawly Ahsan | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Due to technological advancements, various methods have emerged for disseminating news to the masses. The pervasive reach of news, however, has given rise to a significant concern: the proliferation of fake news. In response to this challenge, a shared task in Dravidian- LangTech EACL2024 was initiated to detect fake news and classify its types in the Malayalam language. The shared task consisted of two sub-tasks. Task 1 focused on a binary classification problem, determining whether a piece of news is fake or not. Whereas task 2 delved into a multi-class classification problem, categorizing news into five distinct levels. Our approach involved the exploration of various machine learning (RF, SVM, XGBoost, Ensemble), deep learning (BiLSTM, CNN), and transformer-based models (MuRIL, Indic- SBERT, m-BERT, XLM-R, Distil-BERT) by emphasizing parameter tuning to enhance overall model performance. As a result, we introduce a fine-tuned MuRIL model that leverages parameter tuning, achieving notable success with an F1-score of 0.86 in task 1 and 0.5191 in task 2. This successful implementation led to our system securing the 3rd position in task 1 and the 1st position in task 2. The source code will be found in the GitHub repository at this link: https://github.com/Salman1804102/ DravidianLangTech-EACL-2024-FakeNews.

pdf bib
Punny_Punctuators@DravidianLangTech-EACL2024: Transformer-based Approach for Detection and Classification of Fake News in Malayalam Social Media Text
Nafisa Tabassum | Sumaiya Aodhora | Rowshon Akter | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The alarming rise of fake news on social media poses a significant threat to public discourse and decision-making. While automatic detection of fake news offers a promising solution, research in low-resource languages like Malayalam often falls behind due to limited data and tools. This paper presents the participation of team Punny_Punctuators in the Fake News Detection in Dravidian Languages shared task at DravidianLangTech@EACL 2024, addressing this gap. The shared task focuses on two sub-tasks: 1. classifying social media texts as original or fake, and 2. categorizing fake news into 5 categories. We experimented with various machine learning (ML), deep learning (DL) and transformer-based models as well as processing techniques such as transliteration. Malayalam-BERT achieved the best performance on both sub-tasks, which obtained us 2nd place with a macro f1-score of 0.87 for the subtask-1 and 11th place with a macro f1-score of 0.17 for the subtask-2. Our results highlight the potential of transformer models for low-resource languages in fake news detection and pave the way for further research in this crucial area.

pdf bib
CUET_NLP_GoodFellows@DravidianLangTech EACL2024: A Transformer-Based Approach for Detecting Fake News in Dravidian Languages
Md Osama | Kawsar Ahmed | Hasan Mesbaul Ali Taher | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

In this modern era, many people have been using Facebook and Twitter, leading to increased information sharing and communication. However, a considerable amount of information on these platforms is misleading or intentionally crafted to deceive users, which is often termed as fake news. A shared task on fake news detection in Malayalam organized by DravidianLangTech@EACL 2024 allowed us for addressing the challenge of distinguishing between original and fake news content in the Malayalam language. Our approach involves creating an intelligent framework to categorize text as either fake or original. We experimented with various machine learning models, including Logistic Regression, Decision Tree, Random Forest, Multinomial Naive Bayes, SVM, and SGD, and various deep learning models, including CNN, BiLSTM, and BiLSTM + Attention. We also explored Indic-BERT, MuRIL, XLM-R, and m-BERT for transformer-based approaches. Notably, our most successful model, m-BERT, achieved a macro F1 score of 0.85 and ranked 4th in the shared task. This research contributes to combating misinformation on social media news, offering an effective solution to classify content accurately.

pdf bib
CUET_Binary_Hackers@DravidianLangTech EACL2024: Hate and Offensive Language Detection in Telugu Code-Mixed Text Using Sentence Similarity BERT
Salman Farsi | Asrarul Eusha | Jawad Hossain | Shawly Ahsan | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

With the continuous evolution of technology and widespread internet access, various social media platforms have gained immense popularity, attracting a vast number of active users globally. However, this surge in online activity has also led to a concerning trend by driving many individuals to resort to posting hateful and offensive comments or posts, publicly targeting groups or individuals. In response to these challenges, we participated in this shared task. Our approach involved proposing a fine-tuning-based pre-trained transformer model to effectively discern whether a given text contains offensive content that propagates hatred. We conducted comprehensive experiments, exploring various machine learning (LR, SVM, and Ensemble), deep learning (CNN, BiLSTM, CNN+BiLSTM), and transformer-based models (Indic-SBERT, m- BERT, MuRIL, Distil-BERT, XLM-R), adhering to a meticulous fine-tuning methodology. Among the models evaluated, our fine-tuned L3Cube-Indic-Sentence-Similarity- BERT or Indic-SBERT model demonstrated superior performance, achieving a macro-average F1-score of 0.7013. This notable result positioned us at the 6th place in the task. The implementation details of the task will be found in the GitHub repository.

pdf bib
CUET_Binary_Hackers@DravidianLangTech-EACL 2024: Sentiment Analysis using Transformer-Based Models in Code-Mixed and Transliterated Tamil and Tulu
Asrarul Eusha | Salman Farsi | Ariful Islam | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Textual Sentiment Analysis (TSA) delves into people’s opinions, intuitions, and emotions regarding any entity. Natural Language Processing (NLP) serves as a technique to extract subjective knowledge, determining whether an idea or comment leans positive, negative, neutral, or a mix thereof toward an entity. In recent years, it has garnered substantial attention from NLP researchers due to the vast availability of online comments and opinions. Despite extensive studies in this domain, sentiment analysis in low-resourced languages such as Tamil and Tulu needs help handling code-mixed and transliterated content. To address these challenges, this work focuses on sentiment analysis of code-mixed and transliterated Tamil and Tulu social media comments. It explored four machine learning (ML) approaches (LR, SVM, XGBoost, Ensemble), four deep learning (DL) methods (BiLSTM and CNN with FastText and Word2Vec), and four transformer-based models (m-BERT, MuRIL, L3Cube-IndicSBERT, and Distilm-BERT) for both languages. For Tamil, L3Cube-IndicSBERT and ensemble approaches outperformed others, while m-BERT demonstrated superior performance among the models for Tulu. The presented models achieved the 3rd and 1st ranks by attaining macro F1-scores of 0.227 and 0.584 in Tamil and Tulu, respectively.

pdf bib
Binary_Beasts@DravidianLangTech-EACL 2024: Multimodal Abusive Language Detection in Tamil based on Integrated Approach of Machine Learning and Deep Learning Techniques
Md. Rahman | Abu Raihan | Tanzim Rahman | Shawly Ahsan | Jawad Hossain | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Detecting abusive language on social media is a challenging task that needs to be solved effectively. This research addresses the formidable challenge of detecting abusive language in Tamil through a comprehensive multimodal approach, incorporating textual, acoustic, and visual inputs. This study utilized ConvLSTM, 3D-CNN, and a hybrid 3D-CNN with BiLSTM to extract video features. Several models, such as BiLSTM, LR, and CNN, are explored for processing audio data, whereas for textual content, MNB, LR, and LSTM methods are explored. To further enhance overall performance, this work introduced a weighted late fusion model amalgamating predictions from all modalities. The fusion model was then applied to make predictions on the test dataset. The ConvLSTM+BiLSTM+MNB model yielded the highest macro F1 score of 71.43%. Our methodology allowed us to achieve 1 st rank for multimodal abusive language detection in the shared task

pdf bib
CUET_DUO@DravidianLangTech EACL2024: Fake News Classification Using Malayalam-BERT
Tanzim Rahman | Abu Raihan | Md. Rahman | Jawad Hossain | Shawly Ahsan | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Identifying between fake and original news in social media demands vigilant procedures. This paper introduces the significant shared task on ‘Fake News Detection in Dravidian Languages - DravidianLangTech@EACL 2024’. With a focus on the Malayalam language, this task is crucial in identifying social media posts as either fake or original news. The participating teams contribute immensely to this task through their varied strategies, employing methods ranging from conventional machine-learning techniques to advanced transformer-based models. Notably, the findings of this work highlight the effectiveness of the Malayalam-BERT model, demonstrating an impressive macro F1 score of 0.88 in distinguishing between fake and original news in Malayalam social media content, achieving a commendable rank of 1st among the participants.

pdf bib
CUETSentimentSillies@DravidianLangTech-EACL2024: Transformer-based Approach for Sentiment Analysis in Tamil and Tulu Code-Mixed Texts
Zannatul Tripty | Md. Nafis | Antu Chowdhury | Jawad Hossain | Shawly Ahsan | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment analysis (SA) on social media reviews has become a challenging research agenda in recent years due to the exponential growth of textual content. Although several effective solutions are available for SA in high-resourced languages, it is considered a critical problem for low-resourced languages. This work introduces an automatic system for analyzing sentiment in Tamil and Tulu code-mixed languages. Several ML (DT, RF, MNB), DL (CNN, BiLSTM, CNN+BiLSTM), and transformer-based models (Indic-BERT, XLM-RoBERTa, m-BERT) are investigated for SA tasks using Tamil and Tulu code-mixed textual data. Experimental outcomes reveal that the transformer-based models XLM-R and m-BERT surpassed others in performance for Tamil and Tulu, respectively. The proposed XLM-R and m-BERT models attained macro F1-scores of 0.258 (Tamil) and 0.468 (Tulu) on test datasets, securing the 2nd and 5th positions, respectively, in the shared task.

pdf bib
CUETSentimentSillies@DravidianLangTech EACL2024: Transformer-based Approach for Detecting and Categorizing Fake News in Malayalam Language
Zannatul Tripty | Md. Nafis | Antu Chowdhury | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Fake news misleads people and may lead to real-world miscommunication and injury. Removing misinformation encourages critical thinking, democracy, and the prevention of hatred, fear, and misunderstanding. Identifying and removing fake news and developing a detection system is essential for reliable, accurate, and clear information. Therefore, a shared task was organized to detect fake news in Malayalam. This paper presents a system developed for the shared task of detecting and classifying fake news in Malayalam. The approach involves a combination of machine learning models (LR, DT, RF, MNB), deep learning models (CNN, BiLSTM, CNN+BiLSTM), and transformer-based models (Indic-BERT, XLMR, Malayalam-BERT, m-BERT) for both subtasks. The experimental results demonstrate that transformer-based models, specifically m- BERT and Malayalam-BERT, outperformed others. The m-BERT model achieved superior performance in subtask 1 with macro F1-scores of 0.84, and Malayalam-BERT outperformed the other models in subtask 2 with macro F1- scores of 0.496, securing us the 5th and 2nd positions in subtask 1 and subtask 2, respectively.

pdf bib
A Multimodal Framework to Detect Target Aware Aggression in Memes
Shawly Ahsan | Eftekhar Hossain | Omar Sharif | Avishek Das | Mohammed Moshiul Hoque | M. Dewan
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Internet memes have gained immense traction as a medium for individuals to convey emotions, thoughts, and perspectives on social media. While memes often serve as sources of humor and entertainment, they can also propagate offensive, incendiary, or harmful content, deliberately targeting specific individuals or communities. Identifying such memes is challenging because of their satirical and cryptic characteristics. Most contemporary research on memes’ detrimental facets is skewed towards high-resource languages, often sidelining the unique challenges tied to low-resource languages, such as Bengali. To facilitate this research in low-resource languages, this paper presents a novel dataset MIMOSA (MultIMOdal aggreSsion dAtaset) in Bengali. MIMOSA encompasses 4,848 annotated memes across five aggression target categories: Political, Gender, Religious, Others, and non-aggressive. We also propose MAF (Multimodal Attentive Fusion), a simple yet effective approach that uses multimodal context to detect the aggression targets. MAF captures the selective modality-specific features of the input meme and jointly evaluates them with individual modality features. Experiments on MIMOSA exhibit that the proposed method outperforms several state-of-the-art rivaling approaches. Our code and data are available at https://github.com/shawlyahsan/Bengali-Aggression-Memes.

pdf bib
Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque | Sarah Masud Preum
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Multimodal hateful content detection is a challenging task that requires complex reasoning across visual and textual modalities. Therefore, creating a meaningful multimodal representation that effectively captures the interplay between visual and textual features through intermediate fusion is critical. Conventional fusion techniques are unable to attend to the modality-specific features effectively. Moreover, most studies exclusively concentrated on English and overlooked other low-resource languages. This paper proposes a context-aware attention framework for multimodal hateful content detection and assesses it for both English and non-English languages. The proposed approach incorporates an attention layer to meaningfully align the visual and textual features. This alignment enables selective focus on modality-specific features before fusing them. We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English). Evaluation results demonstrate our proposed approach’s effectiveness with F1-scores of 69.7% and 70.3% for the MUTE and MultiOFF datasets. The scores show approximately 2.5% and 3.2% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at https://github.com/eftekhar-hossain/Bengali-Hateful-Memes.

pdf bib
CUET_NLP_Manning@LT-EDI 2024: Transformer-based Approach on Caste and Migration Hate Speech Detection
Md Alam | Hasan Mesbaul Ali Taher | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

The widespread use of online communication has caused a significant increase in the spread of hate speech on social media. However, there are also hate crimes based on caste and migration status. Despite several nations efforts to bring equality among their citizens, numerous crimes occur just based on caste. Migration-based hostility happens both in India and in developed countries. A shared task was arranged to address this issue in a low-resourced language such as Tamil. This paper aims to improve the detection of hate speech and hostility based on caste and migration status on social media. To achieve this, this work investigated several Machine Learning (ML), Deep Learning (DL), and transformer-based models, including M-BERT, XLM-R, and Tamil BERT. Experimental results revealed the highest macro f1-score of 0.80 using the M-BERT model, which enabled us to rank 3rd on the shared task.

pdf bib
CUET_DUO@StressIdent_LT-EDI@EACL2024: Stress Identification Using Tamil-Telugu BERT
Abu Raihan | Tanzim Rahman | Md. Rahman | Jawad Hossain | Shawly Ahsan | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

The pervasive impact of stress on individuals necessitates proactive identification and intervention measures, especially in social media interaction. This research paper addresses the imperative need for proactive identification and intervention concerning the widespread influence of stress on individuals. This study focuses on the shared task, “Stress Identification in Dravidian Languages,” specifically emphasizing Tamil and Telugu code-mixed languages. The primary objective of the task is to classify social media messages into two categories: stressed and non stressed. We employed various methodologies, from traditional machine-learning techniques to state-of-the-art transformer-based models. Notably, the Tamil-BERT and Telugu-BERT models exhibited exceptional performance, achieving a noteworthy macro F1-score of 0.71 and 0.72, respectively, and securing the 15th position in Tamil code-mixed language and the 9th position in the Telugu code-mixed language. These findings underscore the effectiveness of these models in recognizing stress signals within social media content composed in Tamil and Telugu.

pdf bib
Deciphering Hate: Identifying Hateful Memes and Their Targets
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque | Sarah Masud Preum
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Internet memes have become a powerful means for individuals to express emotions, thoughts, and perspectives on social media. While often considered as a source of humor and entertainment, memes can also disseminate hateful content targeting individuals or communities. Most existing research focuses on the negative aspects of memes in high-resource languages, overlooking the distinctive challenges associated with low-resource languages like Bengali (also known as Bangla). Furthermore, while previous work on Bengali memes has focused on detecting hateful memes, there has been no work on detecting their targeted entities. To bridge this gap and facilitate research in this arena, we introduce a novel multimodal dataset for Bengali, BHM (Bengali Hateful Memes). The dataset consists of 7,148 memes with Bengali as well as code-mixed captions, tailored for two tasks: (i) detecting hateful memes, and (ii) detecting the social entities they target (i.e., Individual, Organization, Community, and Society). To solve these tasks, we propose DORA (Dual cO-attention fRAmework), a multimodal deep neural network that systematically extracts the significant modality features from the memes and jointly evaluates them with the modality-specific features to understand the context better. Our experiments show that DORA is generalizable on other low-resource hateful meme datasets and outperforms several state-of-the-art rivaling baselines.

pdf bib
SemanticCuetSync at AraFinNLP2024: Classification of Cross-Dialect Intent in the Banking Domain using Transformers
Ashraful Paran | Symom Shohan | Md. Hossain | Jawad Hossain | Shawly Ahsan | Mohammed Moshiul Hoque
Proceedings of The Second Arabic Natural Language Processing Conference

Intention detection is a crucial aspect of natural language understanding (NLU), focusing on identifying the primary objective underlying user input. In this work, we present a transformer-based method that excels in determining the intent of Arabic text within the banking domain. We explored several machine learning (ML), deep learning (DL), and transformer-based models on an Arabic banking dataset for intent detection. Our findings underscore the challenges that traditional ML and DL models face in understanding the nuances of various Arabic dialects, leading to subpar performance in intent detection. However, the transformer-based methods, designed to tackle such complexities, significantly outperformed the other models in classifying intent across different Arabic dialects. Notably, the AraBERTv2 model achieved the highest micro F1 score of 82.08% in ArBanking77 dataset, a testament to its effectiveness in this context. This achievement, which contributed to our work being ranked 5th in the shared task, AraFinNLP2024, highlights the importance of developing models that can effectively handle the intricacies of Arabic language processing and intent detection.

pdf bib
SemanticCuetSync at ArAIEval Shared Task: Detecting Propagandistic Spans with Persuasion Techniques Identification using Pre-trained Transformers
Symom Shohan | Md. Hossain | Ashraful Paran | Shawly Ahsan | Jawad Hossain | Mohammed Moshiul Hoque
Proceedings of The Second Arabic Natural Language Processing Conference

Detecting propagandistic spans and identifying persuasion techniques are crucial for promoting informed decision-making, safeguarding democratic processes, and fostering a media environment characterized by integrity and transparency. Various machine learning (Logistic Regression, Random Forest, and Multinomial Naive Bayes), deep learning (CNN, CNN+LSTM, CNN+BiLSTM), and transformer-based (AraBERTv2, AraBERT-NER, CamelBERT, BERT-Base-Arabic) models were exploited to perform the task. The evaluation results indicate that CamelBERT achieved the highest micro-F1 score (24.09%), outperforming CNN+LSTM and AraBERTv2. The study found that most models struggle to detect propagandistic spans when multiple spans are present within the same article. Overall, the model’s performance secured a 6th place ranking in the ArAIEval Shared Task-1.

pdf bib
SemanticCUETSync at SemEval-2024 Task 1: Finetuning Sentence Transformer to Find Semantic Textual Relatedness
Md. Sajjad Hossain | Ashraful Islam Paran | Symom Hossain Shohan | Jawad Hossain | Mohammed Moshiul Hoque
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Semantic textual relatedness is crucial to Natural Language Processing (NLP). Methodologies often exhibit superior performance in high-resource languages such as English compared to low-resource ones like Marathi, Telugu, and Spanish. This study leverages various machine learning (ML) approaches, including Support Vector Regression (SVR) and Random Forest, deep learning (DL) techniques such as Siamese Neural Networks, and transformer-based models such as MiniLM-L6-v2, Marathi-sbert, Telugu-sentence-bert-nli, and Roberta-bne-sentiment-analysis-es, to assess semantic relatedness across English, Marathi, Telugu, and Spanish. The developed transformer-based methods notably outperformed other models in determining semantic textual relatedness across these languages, achieving a Spearman correlation coefficient of 0.822 (for English), 0.870 (for Marathi), 0.820 (for Telugu), and 0.677 (for Spanish). These results led to our work attaining rankings of 22th (for English), 11th (for Marathi), 11th (for Telegu) and 14th (for Spanish), respectively.

2023

pdf bib
Score_IsAll_You_Need at BLP-2023 Task 1: A Hierarchical Classification Approach to Detect Violence Inciting Text using Transformers
Kawsar Ahmed | Md Osama | Md. Sirajul Islam | Md Taosiful Islam | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Violence-inciting text detection has become critical due to its significance in social media monitoring, online security, and the prevention of violent content. Developing an automatic text classification model for identifying violence in languages with limited resources, like Bangla, poses significant challenges due to the scarcity of resources and complex morphological structures. This work presents a transformer-based method that can classify Bangla texts into three violence classes: direct, passive, and non-violence. We leveraged transformer models, including BanglaBERT, XLM-R, and m-BERT, to develop a hierarchical classification model for the downstream task. In the first step, the BanglaBERT is employed to identify the presence of violence in the text. In the next step, the model classifies stem texts that incite violence as either direct or passive. The developed system scored 72.37 and ranked 14th among the participants.

pdf bib
NLP_CUET at BLP-2023 Task 1: Fine-grained Categorization of Violence Inciting Text using Transformer-based Approach
Jawad Hossain | Hasan Mesbaul Ali Taher | Avishek Das | Mohammed Moshiul Hoque
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

The amount of online textual content has increased significantly in recent years through social media posts, online chatting, web portals, and other digital platforms due to the significant increase in internet users and their unprompted access via digital devices. Unfortunately, the misappropriation of textual communication via the Internet has led to violence-inciting texts. Despite the availability of various forms of violence-inciting materials, text-based content is often used to carry out violent acts. Thus, developing a system to detect violence-inciting text has become vital. However, creating such a system in a low-resourced language like Bangla becomes challenging. Therefore, a shared task has been arranged to detect violence-inciting text in Bangla. This paper presents a hybrid approach (GAN+Bangla-ELECTRA) to classify violence-inciting text in Bangla into three classes: direct, passive, and non-violence. We investigated a variety of deep learning (CNN, BiLSTM, BiLSTM+Attention), machine learning (LR, DT, MNB, SVM, RF, SGD), transformers (BERT, ELECTRA), and GAN-based models to detect violence inciting text in Bangla. Evaluation results demonstrate that the GAN+Bangla-ELECTRA model gained the highest macro f1-score (74.59), which obtained us a rank of 3rd position at the BLP-2023 Task 1.

2022

pdf bib
M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets
Omar Sharif | Eftekhar Hossain | Mohammed Moshiul Hoque
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

Recently, detection and categorization of undesired (e. g., aggressive, abusive, offensive, hate) content from online platforms has grabbed the attention of researchers because of its detrimental impact on society. Several attempts have been made to mitigate the usage and propagation of such content. However, most past studies were conducted primarily for English, where low-resource languages like Bengali remained out of the focus. Therefore, to facilitate research in this arena, this paper introduces a novel multilabel Bengali dataset (named M-BAD) containing 15650 texts to detect aggressive texts and their targets. Each text of M-BAD went through rigorous two-level annotations. At the primary level, each text is labelled as either aggressive or non-aggressive. In the secondary level, the aggressive texts have been further annotated into five fine-grained target classes: religion, politics, verbal, gender and race. Baseline experiments are carried out with different machine learning (ML), deep learning (DL) and transformer models, where Bangla-BERT acquired the highest weighted f1-score in both detection (0.92) and target identification (0.83) tasks. Error analysis of the models exhibits the difficulty to identify context-dependent aggression, and this work argues that further research is required to address these issues.

pdf bib
MemoSen: A Multimodal Dataset for Sentiment Analysis of Memes
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Posting and sharing memes have become a powerful expedient of expressing opinions on social media in recent days. Analysis of sentiment from memes has gained much attention to researchers due to its substantial implications in various domains like finance and politics. Past studies on sentiment analysis of memes have primarily been conducted in English, where low-resource languages gain little or no attention. However, due to the proliferation of social media usage in recent years, sentiment analysis of memes is also a crucial research issue in low resource languages. The scarcity of benchmark datasets is a significant barrier to performing multimodal sentiment analysis research in resource-constrained languages like Bengali. This paper presents a novel multimodal dataset (named MemoSen) for Bengali containing 4417 memes with three annotated labels positive, negative, and neutral. A detailed annotation guideline is provided to facilitate further resource development in this domain. Additionally, a set of experiments are carried out on MemoSen by constructing twelve unimodal (i.e., visual, textual) and ten multimodal (image+text) models. The evaluation exhibits that the integration of multimodal information significantly improves (about 1.2%) the meme sentiment classification compared to the unimodal counterparts and thus elucidate the novel aspects of multimodality.

pdf bib
MUTE: A Multimodal Dataset for Detecting Hateful Memes
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop

The exponential surge of social media has enabled information propagation at an unprecedented rate. However, it also led to the generation of a vast amount of malign content, such as hateful memes. To eradicate the detrimental impact of this content, over the last few years hateful memes detection problem has grabbed the attention of researchers. However, most past studies were conducted primarily for English memes, while memes on resource constraint languages (i.e., Bengali) are under-studied. Moreover, current research considers memes with a caption written in monolingual (either English or Bengali) form. However, memes might have code-mixed captions (English+Bangla), and the existing models can not provide accurate inference in such cases. Therefore, to facilitate research in this arena, this paper introduces a multimodal hate speech dataset (named MUTE) consisting of 4158 memes having Bengali and code-mixed captions. A detailed annotation guideline is provided to aid the dataset creation in other resource constraint languages. Additionally, extensive experiments have been carried out on MUTE, considering the only visual, only textual, and both modalities. The result demonstrates that joint evaluation of visual and textual features significantly improves (≈ 3%) the hateful memes classification compared to the unimodal evaluation.

pdf bib
CUET-NLP@DravidianLangTech-ACL2022: Investigating Deep Learning Techniques to Detect Multimodal Troll Memes
Md Hasan | Nusratul Jannat | Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

With the substantial rise of internet usage, social media has become a powerful communication medium to convey information, opinions, and feelings on various issues. Recently, memes have become a popular way of sharing information on social media. Usually, memes are visuals with text incorporated into them and quickly disseminate hatred and offensive content. Detecting or classifying memes is challenging due to their region-specific interpretation and multimodal nature. This work presents a meme classification technique in Tamil developed by the CUET NLP team under the shared task (DravidianLangTech-ACL2022). Several computational models have been investigated to perform the classification task. This work also explored visual and textual features using VGG16, ResNet50, VGG19, CNN and CNN+LSTM models. Multimodal features are extracted by combining image (VGG16) and text (CNN, LSTM+CNN) characteristics. Results demonstrate that the textual strategy with CNN+LSTM achieved the highest weighted f1-score (0.52) and recall (0.57). Moreover, the CNN-Text+VGG16 outperformed the other models concerning the multimodal memes detection by achieving the highest f1-score of 0.49, but the LSTM+CNN model allowed the team to achieve 4th place in the shared task.

pdf bib
CUET-NLP@DravidianLangTech-ACL2022: Exploiting Textual Features to Classify Sentiment of Multimodal Movie Reviews
Nasehatul Mustakim | Nusratul Jannat | Md Hasan | Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

With the proliferation of internet usage, a massive growth of consumer-generated content on social media has been witnessed in recent years that provide people’s opinions on diverse issues. Through social media, users can convey their emotions and thoughts in distinctive forms such as text, image, audio, video, and emoji, which leads to the advancement of the multimodality of the content users on social networking sites. This paper presents a technique for classifying multimodal sentiment using the text modality into five categories: highly positive, positive, neutral, negative, and highly negative categories. A shared task was organized to develop models that can identify the sentiments expressed by the videos of movie reviewers in both Malayalam and Tamil languages. This work applied several machine learning techniques (LR, DT, MNB, SVM) and deep learning (BiLSTM, CNN+BiLSTM) to accomplish the task. Results demonstrate that the proposed model with the decision tree (DT) outperformed the other methods and won the competition by acquiring the highest macro f1-score of 0.24.

pdf bib
CUET-NLP@TamilNLP-ACL2022: Multi-Class Textual Emotion Detection from Social Media using Transformer
Nasehatul Mustakim | Rabeya Rabu | Golam Md. Mursalin | Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Recently, emotion analysis has gained increased attention by NLP researchers due to its various applications in opinion mining, e-commerce, comprehensive search, healthcare, personalized recommendations and online education. Developing an intelligent emotion analysis model is challenging in resource-constrained languages like Tamil. Therefore a shared task is organized to identify the underlying emotion of a given comment expressed in the Tamil language. The paper presents our approach to classifying the textual emotion in Tamil into 11 classes: ambiguous, anger, anticipation, disgust, fear, joy, love, neutral, sadness, surprise and trust. We investigated various machine learning (LR, DT, MNB, SVM), deep learning (CNN, LSTM, BiLSTM) and transformer-based models (Multilingual-BERT, XLM-R). Results reveal that the XLM-R model outdoes all other models by acquiring the highest macro f1-score (0.33).

pdf bib
COMBATANT@TamilNLP-ACL2022: Fine-grained Categorization of Abusive Comments using Logistic Regression
Alamgir Hossain | Mahathir Bishal | Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

With the widespread usage of social media and effortless internet access, millions of posts and comments are generated every minute. Unfortunately, with this substantial rise, the usage of abusive language has increased significantly in these mediums. This proliferation leads to many hazards such as cyber-bullying, vulgarity, online harassment and abuse. Therefore, it becomes a crucial issue to detect and mitigate the usage of abusive language. This work presents our system developed as part of the shared task to detect the abusive language in Tamil. We employed three machine learning (LR, DT, SVM), two deep learning (CNN+BiLSTM, CNN+BiLSTM with FastText) and a transformer-based model (Indic-BERT). The experimental results show that Logistic regression (LR) and CNN+BiLSTM models outperformed the others. Both Logistic Regression (LR) and CNN+BiLSTM with FastText achieved the weighted F1-score of 0.39. However, LR obtained a higher recall value (0.44) than CNN+BiLSTM (0.36). This leads us to stand the 2nd rank in the shared task competition.

2021

pdf bib
Emotion Classification in a Resource Constrained Language Using Transformer-based Approach
Avishek Das | Omar Sharif | Mohammed Moshiul Hoque | Iqbal H. Sarker
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Although research on emotion classification has significantly progressed in high-resource languages, it is still infancy for resource-constrained languages like Bengali. However, unavailability of necessary language processing tools and deficiency of benchmark corpora makes the emotion classification task in Bengali more challenging and complicated. This work proposes a transformer-based technique to classify the Bengali text into one of the six basic emotions: anger, fear, disgust, sadness, joy, and surprise. A Bengali emotion corpus consists of 6243 texts is developed for the classification task. Experimentation carried out using various machine learning (LR, RF, MNB, SVM), deep neural networks (CNN, BiLSTM, CNN+BiLSTM) and transformer (Bangla-BERT, m-BERT, XLM-R) based approaches. Experimental outcomes indicate that XLM-R outdoes all other techniques by achieving the highest weighted f_1-score of 69.73% on the test data.

pdf bib
NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

In recent years, several systems have been developed to regulate the spread of negativity and eliminate aggressive, offensive or abusive contents from the online platforms. Nevertheless, a limited number of researches carried out to identify positive, encouraging and supportive contents. In this work, our goal is to identify whether a social media post/comment contains hope speech or not. We propose three distinct models to identify hope speech in English, Tamil and Malayalam language to serve this purpose. To attain this goal, we employed various machine learning (SVM, LR, ensemble), deep learning (CNN+BiLSTM) and transformer (m-BERT, Indic-BERT, XLNet, XLM-R) based methods. Results indicate that XLM-R outdoes all other techniques by gaining a weighted f_1-score of 0.93, 0.60 and 0.85 respectively for English, Tamil and Malayalam language. Our team has achieved 1st, 2nd and 1st rank in these three tasks respectively.

pdf bib
NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from Multilingual Code-Mixed Text using Transformers
Omar Sharif | Eftekhar Hossain | Mohammed Moshiul Hoque
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

The increasing accessibility of the internet facilitated social media usage and encouraged individuals to express their opinions liberally. Nevertheless, it also creates a place for content polluters to disseminate offensive posts or contents. Most of such offensive posts are written in a cross-lingual manner and can easily evade the online surveillance systems. This paper presents an automated system that can identify offensive text from multilingual code-mixed data. In the task, datasets provided in three languages including Tamil, Malayalam and Kannada code-mixed with English where participants are asked to implement separate models for each language. To accomplish the tasks, we employed two machine learning techniques (LR, SVM), three deep learning (LSTM, LSTM+Attention) techniques and three transformers (m-BERT, Indic-BERT, XLM-R) based methods. Results show that XLM-R outperforms other techniques in Tamil and Malayalam languages while m-BERT achieves the highest score in the Kannada language. The proposed models gained weighted f_1 score of 0.76 (for Tamil), 0.93 (for Malayalam ), and 0.71 (for Kannada) with a rank of 3rd, 5th and 4th respectively.

pdf bib
NLP-CUET@DravidianLangTech-EACL2021: Investigating Visual and Textual Features to Identify Trolls from Multimodal Social Media Memes
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

In the past few years, the meme has become a new way of communication on the Internet. As memes are in images forms with embedded text, it can quickly spread hate, offence and violence. Classifying memes are very challenging because of their multimodal nature and region-specific interpretation. A shared task is organized to develop models that can identify trolls from multimodal social media memes. This work presents a computational model that we developed as part of our participation in the task. Training data comes in two forms: an image with embedded Tamil code-mixed text and an associated caption. We investigated the visual and textual features using CNN, VGG16, Inception, m-BERT, XLM-R, XLNet algorithms. Multimodal features are extracted by combining image (CNN, ResNet50, Inception) and text (Bi-LSTM) features via early fusion approach. Results indicate that the textual approach with XLNet achieved the highest weighted f_1-score of 0.58, which enable our model to secure 3rd rank in this task.

2020

pdf bib
Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations
Md. Rajib Hossain | Mohammed Moshiul Hoque
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Distributional word vector representation or word embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval and question answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations. This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpus consists of 180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classification, which achieved a maximum of 96.48% accuracy. Intrinsic performance evaluated by word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson (rˆ) correlation accuracy of 60.66% (Ssrˆ) achieved for semantic similarities and 71.64% (Syrˆ) for syntactic similarities whereas the relatedness obtained 79.80% (Rsrˆ). The semantic word analogy tasks achieved 44.00% of accuracy while syntactic word analogy tasks obtained 36.00%.

pdf bib
TechTexC: Classification of Technical Texts using Convolution and Bidirectional Long Short Term Memory Network
Omar Sharif | Eftekhar Hossain | Mohammed Moshiul Hoque
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

This paper illustrates the details description of technical text classification system and its results that developed as a part of participation in the shared task TechDofication 2020. The shared task consists of two sub-tasks: (i) first task identify the coarse-grained technical domain of given text in a specified language and (ii) the second task classify a text of computer science domain into fine-grained sub-domains. A classification system (called ‘TechTexC’) is developed to perform the classification task using three techniques: convolution neural network (CNN), bidirectional long short term memory (BiLSTM) network, and combined CNN with BiLSTM. Results show that CNN with BiLSTM model outperforms the other techniques concerning task-1 of sub-tasks (a, b, c and g) and task-2a. This combined model obtained f1 scores of 82.63 (sub-task a), 81.95 (sub-task b), 82.39 (sub-task c), 84.37 (sub-task g), and 67.44 (task-2a) on the development dataset. Moreover, in the case of test set, the combined CNN with BiLSTM approach achieved that higher accuracy for the subtasks 1a (70.76%), 1b (79.97%), 1c (65.45%), 1g (49.23%) and 2a (70.14%).