Tanmoy Chakraborty


2022

pdf bib
When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues
Shivani Kumar | Atharva Kulkarni | Md Shad Akhtar | Tanmoy Chakraborty
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Indirect speech such as sarcasm achieves a constellation of discourse goals in human communication. While the indirectness of figurative language warrants speakers to achieve certain pragmatic goals, it is challenging for AI agents to comprehend such idiosyncrasies of human communication. Though sarcasm identification has been a well-explored topic in dialogue analysis, for conversational systems to truly grasp a conversation’s innate meaning and generate appropriate responses, simply detecting sarcasm is not enough; it is vital to explain its underlying sarcastic connotation to capture its true essence. In this work, we study the discourse structure of sarcastic conversations and propose a novel task – Sarcasm Explanation in Dialogue (SED). Set in a multimodal and code-mixed setting, the task aims to generate natural language explanations of satirical conversations. To this end, we curate WITS, a new dataset to support our task. We propose MAF (Modality Aware Fusion), a multimodal context-aware attention and global information fusion module to capture multimodality and use it to benchmark WITS. The proposed attention module surpasses the traditional multimodal fusion baselines and reports the best performance on almost all metrics. Lastly, we carry out detailed analysis both quantitatively and qualitatively.

pdf bib
Can Unsupervised Knowledge Transfer from Social Discussions Help Argument Mining?
Subhabrata Dutta | Jeevesh Juneja | Dipankar Das | Tanmoy Chakraborty
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Identifying argument components from unstructured texts and predicting the relationships expressed among them are two primary steps of argument mining. The intrinsic complexity of these tasks demands powerful learning models. While pretrained Transformer-based Language Models (LM) have been shown to provide state-of-the-art results over different NLP tasks, the scarcity of manually annotated data and the highly domain-dependent nature of argumentation restrict the capabilities of such models. In this work, we propose a novel transfer learning strategy to overcome these challenges. We utilize argumentation-rich social discussions from the ChangeMyView subreddit as a source of unsupervised, argumentative discourse-aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task. Furthermore, we introduce a novel prompt-based strategy for inter-component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context. Exhaustive experiments show the generalization capability of our method on these two tasks over within-domain as well as out-of-domain datasets, outperforming several existing and employed strong baselines.

pdf bib
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations
Tanmoy Chakraborty | Md. Shad Akhtar | Kai Shu | H. Russell Bernard | Maria Liakata | Preslav Nakov | Shivam Sharma | Chhavi Sharma | Shivani Kumar | Yash Kumar Atri | Sarah Masud | Sunil Saumya | Megha Sundriyal | Karan Goyal | Anam Fatima | Aseem Srivastava
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

pdf bib
Findings of the CONSTRAINT 2022 Shared Task on Detecting the Hero, the Villain, and the Victim in Memes
Shivam Sharma | Tharun Suresh | Atharva Kulkarni | Himanshi Mathur | Preslav Nakov | Md. Shad Akhtar | Tanmoy Chakraborty
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

We present the findings of the shared task at the CONSTRAINT 2022 Workshop: Hero, Villain, and Victim: Dissecting harmful memes for Semantic role labeling of entities. The task aims to delve deeper into the domain of meme comprehension by deciphering the connotations behind the entities present in a meme. In more nuanced terms, the shared task focuses on determining the victimizing, glorifying, and vilifying intentions embedded in meme entities to explicate their connotations. To this end, we curate HVVMemes, a novel meme dataset of about 7000 memes spanning the domains of COVID-19 and US Politics, each containing entities and their associated roles: hero, villain, victim, or none. The shared task attracted 105 participants, but eventually only 6 submissions were made. Most of the successful submissions relied on fine-tuning pre-trained language and multimodal models along with ensembles. The best submission achieved an F1-score of 58.67.

pdf bib
Document Retrieval and Claim Verification to Mitigate COVID-19 Misinformation
Megha Sundriyal | Ganeshan Malhotra | Md Shad Akhtar | Shubhashis Sengupta | Andrew Fano | Tanmoy Chakraborty
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

During the COVID-19 pandemic, the spread of misinformation on online social media has grown exponentially. Unverified bogus claims on these platforms regularly mislead people, leading them to believe in half-baked truths. The current vogue is to employ manual fact-checkers to verify claims to combat this avalanche of misinformation. However, establishing such claims’ veracity is becoming increasingly challenging, partly due to the plethora of information available, which is difficult to process manually. Thus, it becomes imperative to verify claims automatically without human interventions. To cope up with this issue, we propose an automated claim verification solution encompassing two steps – document retrieval and veracity prediction. For the retrieval module, we employ a hybrid search-based system with BM25 as a base retriever and experiment with recent state-of-the-art transformer-based models for re-ranking. Furthermore, we use a BART-based textual entailment architecture to authenticate the retrieved documents in the later step. We report experimental findings, demonstrating that our retrieval module outperforms the best baseline system by 10.32 NDCG@100 points. We escort a demonstration to assess the efficacy and impact of our suggested solution. As a byproduct of this study, we present an open-source, easily deployable, and user-friendly Python API that the community can adopt.

2021

pdf bib
Detecting Harmful Memes and Their Targets
Shraman Pramanick | Dimitar Dimitrov | Rituparna Mukherjee | Shivam Sharma | Md. Shad Akhtar | Preslav Nakov | Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
HIT - A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation
Ayan Sengupta | Sourabh Kumar Bhattacharjee | Tanmoy Chakraborty | Md. Shad Akhtar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Fingerprinting Fine-tuned Language Models in the Wild
Nirav Diwan | Tanmoy Chakraborty | Zubair Shafiq
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Shraman Pramanick | Shivam Sharma | Dimitar Dimitrov | Md. Shad Akhtar | Preslav Nakov | Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EMNLP 2021

Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abuse. Detecting such memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes such as hate speech and propaganda, there has been little work on harm in general. Here, we aim to bridge this gap. In particular, we focus on two tasks: (i)detecting harmful memes, and (ii) identifying the social entities they target. We further extend the recently released HarMeme dataset, which covered COVID-19, with additional memes and a new topic: US politics. To solve these tasks, we propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal deep neural network that uses global and local perspectives to detect harmful memes. MOMENTA systematically analyzes the local and the global perspective of the input meme (in both modalities) and relates it to the background context. MOMENTA is interpretable and generalizable, and our experiments show that it outperforms several strong rivaling approaches.

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kristina Toutanova | Anna Rumshisky | Luke Zettlemoyer | Dilek Hakkani-Tur | Iz Beltagy | Steven Bethard | Ryan Cotterell | Tanmoy Chakraborty | Yichao Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content
Shreya Gupta | Parantak Singh | Megha Sundriyal | Md. Shad Akhtar | Tanmoy Chakraborty
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The conceptualization of a claim lies at the core of argument mining. The segregation of claims is complex, owing to the divergence in textual syntax and context across different distributions. Another pressing issue is the unavailability of labeled unstructured text for experimentation. In this paper, we propose LESA, a framework which aims at advancing headfirst into expunging the former issue by assembling a source-independent generalized model that captures syntactic features through part-of-speech and dependency embeddings, as well as contextual features through a fine-tuned language model. We resolve the latter issue by annotating a Twitter dataset which aims at providing a testing ground on a large unstructured dataset. Experimental results show that LESA improves upon the state-of-the-art performance across six benchmark claim datasets by an average of 3 claim-F1 points for in-domain experiments and by 2 claim-F1 points for general-domain experiments. On our dataset too, LESA outperforms existing baselines by 1 claim-F1 point on the in-domain experiments and 2 claim-F1 points on the general-domain experiments. We also release comprehensive data annotation guidelines compiled during the annotation phase (which was missing in the current literature).

2020

pdf bib
Corpora Evaluation and System Bias Detection in Multi-document Summarization
Alvin Dey | Tanya Chowdhury | Yash Kumar | Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EMNLP 2020

Multi-document summarization (MDS) is the task of reflecting key points from any set of documents into a concise text paragraph. In the past, it has been used to aggregate news, tweets, product reviews, etc. from various sources. Owing to no standard definition of the task, we encounter a plethora of datasets with varying levels of overlap and conflict between participating documents. There is also no standard regarding what constitutes summary information in MDS. Adding to the challenge is the fact that new systems report results on a set of chosen datasets, which might not correlate with their performance on the other datasets. In this paper, we study this heterogeneous task with the help of a few widely used MDS corpora and a suite of state-of-theart models. We make an attempt to quantify the quality of summarization corpus and prescribe a list of points to consider while proposing a new MDS corpus. Next, we analyze the reason behind the absence of an MDS system which achieves superior performance across all corpora. We then observe the extent to which system metrics are influenced, and bias is propagated due to corpus properties. The scripts to reproduce the experiments in this work are available at https://github.com/LCS2-IIITD/summarization_bias.git

pdf bib
SemEval-2020 Task 8: Memotion Analysis- the Visuo-Lingual Metaphor!
Chhavi Sharma | Deepesh Bhageria | William Scott | Srinivas PYKL | Amitava Das | Tanmoy Chakraborty | Viswanath Pulabaigari | Björn Gambäck
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Information on social media comprises of various modalities such as textual, visual and audio. NLP and Computer Vision communities often leverage only one prominent modality in isolation to study social media. However, computational processing of Internet memes needs a hybrid approach. The growing ubiquity of Internet memes on social media platforms such as Facebook, Instagram, and Twitter further suggests that we can not ignore such multimodal content anymore. To the best of our knowledge, there is not much attention towards meme emotion analysis. The objective of this proposal is to bring the attention of the research community towards the automatic processing of Internet memes. The task Memotion analysis released approx 10K annotated memes- with human annotated labels namely sentiment(positive, negative, neutral), type of emotion(sarcastic,funny,offensive, motivation) and their corresponding intensity. The challenge consisted of three subtasks: sentiment (positive, negative, and neutral) analysis of memes,overall emotion (humor, sarcasm, offensive, and motivational) classification of memes, and classifying intensity of meme emotion. The best performances achieved were F1 (macro average) scores of 0.35, 0.51 and 0.32, respectively for each of the three subtasks.

pdf bib
SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Parth Patwa | Gustavo Aguilar | Sudipta Kar | Suraj Pandey | Srinivas PYKL | Björn Gambäck | Tanmoy Chakraborty | Thamar Solorio | Amitava Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.

pdf bib
Nurse is Closer to Woman than Surgeon? Mitigating Gender-Biased Proximities in Word Embeddings
Vaibhav Kumar | Tenzin Singhay Bhotia | Vaibhav Kumar | Tanmoy Chakraborty
Transactions of the Association for Computational Linguistics, Volume 8

Word embeddings are the standard model for semantic and syntactic representations of words. Unfortunately, these models have been shown to exhibit undesirable word associations resulting from gender, racial, and religious biases. Existing post-processing methods for debiasing word embeddings are unable to mitigate gender bias hidden in the spatial arrangement of word vectors. In this paper, we propose RAN-Debias, a novel gender debiasing methodology that not only eliminates the bias present in a word vector but also alters the spatial distribution of its neighboring vectors, achieving a bias-free setting while maintaining minimal semantic offset. We also propose a new bias evaluation metric, Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of evaluation metrics show that RAN-Debias significantly outperforms the state-of-the-art in reducing proximity bias (GIPE) by at least 42.02%. It also reduces direct bias, adding minimal semantic disturbance, and achieves the best performance in a downstream application task (coreference resolution).

2019

pdf bib
Clark Kent at SemEval-2019 Task 4: Stylometric Insights into Hyperpartisan News Detection
Viresh Gupta | Baani Leen Kaur Jolly | Ramneek Kaur | Tanmoy Chakraborty
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present a news bias prediction system, which we developed as part of a SemEval 2019 task. We developed an XGBoost based system which uses character and word level n-gram features represented using TF-IDF, count vector based correlation matrix, and predicts if an input news article is a hyperpartisan news article. Our model was able to achieve a precision of 68.3% on the test set provided by the contest organizers. We also run our model on the BuzzFeed corpus and find XGBoost with simple character level N-Gram embeddings to be performing well with an accuracy of around 96%.

2016

pdf bib
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing
Tanmoy Chakraborty | Martin Riedl | V.G.Vinod Vydiswaran
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing

pdf bib
All Fingers are not Equal: Intensity of References in Scientific Articles
Tanmoy Chakraborty | Ramasuri Narayanam
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Authorship Identification in Bengali Literature: a Comparative Analysis
Tanmoy Chakraborty
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
Handling Multiword Expressions in Phrase-Based Statistical Machine Translation
Santanu Pal | Tanmoy Chakraborty | Sivaji Bandyopadhyay
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Semantic Clustering: an Attempt to Identify Multiword Expressions in Bengali
Tanmoy Chakraborty | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
Shared Task System Description: Measuring the Compositionality of Bigrams using Statistical Methodologies
Tanmoy Chakraborty | Santanu Pal | Tapabrata Mondal | Tanik Saikh | Sivaju Bandyopadhyay
Proceedings of the Workshop on Distributional Semantics and Compositionality

2010

pdf bib
Automatic Extraction of Complex Predicates in Bengali
Dipankar Das | Santanu Pal | Tapabrata Mondal | Tanmoy Chakraborty | Sivaji Bandyopadhyay
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib
Identification of Reduplication in Bengali Corpus and their Semantic Analysis: A Rule Based Approach
Tanmoy Chakraborty | Sivaji Bandyopadhyay
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications