Dhiman Goswami


2024

pdf bib
Native Language Identification in Texts: A Survey
Dhiman Goswami | Sharanya Thilagan | Kai North | Shervin Malmasi | Marcos Zampieri
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present the first comprehensive survey of Native Language Identification (NLI) applied to texts. NLI is the task of automatically identifying an author’s native language (L1) based on their second language (L2) production. NLI is an important task with practical applications in second language teaching and NLP. The task has been widely studied for both text and speech, particularly for L2 English due to the availability of suitable corpora. Speech-based NLI relies heavily on accent modeled by pronunciation patterns and prosodic cues while text-based NLI relies primarily on modeling spelling errors and grammatical patterns that reveal properties of an individuals’ L1 influencing L2 production. We survey over one hundred papers on the topic including the papers associated with the NLI and INLI shared tasks. We describe several text representations and computational techniques used in text-based NLI. Finally, we present a comprehensive account of publicly available datasets used for the task thus far.

pdf bib
MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles
Amrita Ganguly | Al Nahian Bin Emran | Sadiya Sayara Chowdhury Puspo | Md Nishat Raihan | Dhiman Goswami | Marcos Zampieri
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities. Identifying hate speech in multimodal content is a particularly challenging task because offensiveness can be manifested in either words or images or a juxtaposition of the two. This paper presents the MasonPerplexity submission for the Shared Task on Multimodal Hate Speech Event Detection at CASE 2024 at EACL 2024. The task is divided into two sub-tasks: sub-task A focuses on the identification of hate speech and sub-task B focuses on the identification of targets in text-embedded images during political events. We use an XLM-roBERTa-large model for sub-task A and an ensemble approach combining XLM-roBERTa-base, BERTweet-large, and BERT-base for sub-task B. Our approach obtained 0.8347 F1-score in sub-task A and 0.6741 F1-score in sub-task B ranking 3rd on both sub-tasks.

pdf bib
MasonPerplexity at ClimateActivism 2024: Integrating Advanced Ensemble Techniques and Data Augmentation for Climate Activism Stance and Hate Event Identification
Al Nahian Bin Emran | Amrita Ganguly | Sadiya Sayara Chowdhury Puspo | Dhiman Goswami | Md Nishat Raihan
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

The task of identifying public opinions on social media, particularly regarding climate activism and the detection of hate events, has emerged as a critical area of research in our rapidly changing world. With a growing number of people voicing either to support or oppose to climate-related issues - understanding these diverse viewpoints has become increasingly vital. Our team, MasonPerplexity, participates in a significant research initiative focused on this subject. We extensively test various models and methods, discovering that our most effective results are achieved through ensemble modeling, enhanced by data augmentation techniques like back-translation. In the specific components of this research task, our team achieved notable positions, ranking 5th, 1st, and 6th in the respective sub-tasks, thereby illustrating the effectiveness of our approach in this important field of study.

pdf bib
MasonTigers@LT-EDI-2024: An Ensemble Approach Towards Detecting Homophobia and Transphobia in Social Media Comments
Dhiman Goswami | Sadiya Sayara Chowdhury Puspo | Md Nishat Raihan | Al Nahian Bin Emran
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, we describe our approaches and results for Task 2 of the LT-EDI 2024 Workshop, aimed at detecting homophobia and/or transphobia across ten languages. Our methodologies include monolingual transformers and ensemble methods, capitalizing on the strengths of each to enhance the performance of the models. The ensemble models worked well, placing our team, MasonTigers, in the top five for eight of the ten languages, as measured by the macro F1 score. Our work emphasizes the efficacy of ensemble methods in multilingual scenarios, addressing the complexities of language-specific tasks.

pdf bib
GMU at MLSP 2024: Multilingual Lexical Simplification with Transformer Models
Dhiman Goswami | Kai North | Marcos Zampieri
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper presents GMU’s submission to the Multilingual Lexical Simplification Pipeline (MLSP) shared task at the BEA workshop 2024. The task includes Lexical Complexity Prediction (LCP) and Lexical Simplification (LS) sub-tasks across 10 languages. Our submissions achieved rankings ranging from 1st to 5th in LCP and from 1st to 3rd in LS. Our best performing approach for LCP is a weighted ensemble based on Pearson correlation of language specific transformer models trained on all languages combined. For LS, GPT4-turbo zero-shot prompting achieved the best performance.

pdf bib
EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi for Emotion Detection
Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

pdf bib
MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thought Prompts
Nishat Raihan | Dhiman Goswami | Al Nahian Bin Emran | Sadiya Sayara Chowdhury Puspo | Amrita Ganguly | Marcos Zampieri
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain further improved results with chain-of-thought prompting, an iterative prompting method that breaks down the reasoning process step-by-step. We obtain our best results by utilizing an ensemble of chain-of-thought prompts, placing 2nd in the word puzzle subtask and 13th in the sentence puzzle subtask. The strong performance of prompted LLMs demonstrates their capability for complex reasoning when provided with a decomposition of the thought process. Our work sheds light on how step-wise explanatory prompts can unlock more of the knowledge encoded in the parameters of large models.

pdf bib
MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection
Sadiya Sayara Chowdhury Puspo | Nishat Raihan | Dhiman Goswami | Al Nahian Bin Emran | Amrita Ganguly | Özlem Uzuner
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents the MasonTigers entryto the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual BlackBox Machine-Generated Text Detection. Thetask encompasses Binary Human-Written vs.Machine-Generated Text Classification (TrackA), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine MixedText Detection (Track C). Our best performing approaches utilize mainly the ensemble ofdiscriminator transformer models along withsentence transformer and statistical machinelearning approaches in specific cases. Moreover, Zero shot prompting and fine-tuning ofFLAN-T5 are used for Track A and B.

pdf bib
MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
Dhiman Goswami | Sadiya Sayara Chowdhury Puspo | Nishat Raihan | Al Nahian Bin Emran | Amrita Ganguly | Marcos Zampieri
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents the MasonTigers’ entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches to semantic textual relatedness across 14 languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize an ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.

2023

pdf bib
nlpBDpatriots at BLP-2023 Task 1: Two-Step Classification for Violence Inciting Text Detection in Bangla - Leveraging Back-Translation and Multilinguality
Md Nishat Raihan | Dhiman Goswami | Sadiya Sayara Chowdhury Puspo | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using back translation and multilinguality which ranked 6th out of 27 teams with a macro F1 score of 0.74.

pdf bib
nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach towards Bangla Sentiment Analysis
Dhiman Goswami | Md Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we discuss the entry of nlpBDpatriots to some sophisticated approaches for classifying Bangla Sentiment Analysis. This is a shared task of the first workshop on Bangla Language Processing (BLP) organized under EMNLP. The main objective of this task is to identify the sentiment polarity of social media content. There are 30 groups of NLP enthusiasts who participate in this shared task and our best-performing approach for the task is transfer learning with data augmentation. Our group ranked 12th position in this competition with this methodology securing a micro F1 score of 0.71.

pdf bib
OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification
Dhiman Goswami | Md Nishat Raihan | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the 11th International Workshop on Natural Language Processing for Social Media

pdf bib
SentMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Sentiment Analysis
Md Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the First Workshop in South East Asian Language Processing