Abhik Jana

2025

This is not a Disimprovement: Improving Negation Reasoning in Large Language Models via Prompt Engineering
Joshua Jose Dias Barreto | Abhik Jana
Findings of the Association for Computational Linguistics: EMNLP 2025

Negation reasoning remains a challenge for large language models (LLMs), often causing incorrect interpretations of negated statements. In this study, we analyze various LLMs for their handling of negation and propose two genres of prompts (*Warning-based* and *Persona-based*), which improve overall absolute accuracy by up to 3.17% and distractor negation accuracy by up to 25.14% over most competitive baselines. Next, we assess the robustness of LLMs by reordering prompts while preserving meaning, observing instability linked to positional encoding schemes. Further, we introduce a negative token attention score (NTAS) to quantify attention to negation words. From the comprehensive analysis, we point out that within a specific LLM family, the performance of a model (measured using accuracy) correlates more with NTAS than with model size. The code is publicly available: [https://github.com/Joshua-Dias-Barreto/This-is-not-a-Disimprovement](https://github.com/Joshua-Dias-Barreto/This-is-not-a-Disimprovement)

pdf bib abs

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning
Soumyabrata Chaudhuri | Pranav Purkar | Ritwik Raghav | Shubhojit Mallick | Manish Gupta | Abhik Jana | Shreya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, though this remains a rather nascent field. Existing benchmarks, such as TravelPlanner and TravelPlanner+, rely on semi-synthetic data as well ignoring several key components of travel planning, limiting their real-world applicability. Therefore, we introduce TripCraft, a spatio-temporally coherent travel planning dataset incorporating real-world constraints, including public transit schedules, public events, varied attraction categories, and user personas for enhanced personalization. Our dataset enables more detailed trip itinerary generation (including duration spent at each point of interest based on users’ persona, transit between two points of interest, etc.) while ensuring spatio-temporal consistency. Further, we propose novel evaluation metrics (temporal meal score, attraction score, spatial score, ordering score, and persona score) to assess LLM-generated plans across temporal, spatial, sequential, and personal dimensions, overcoming the limitations of commonsense and hard constraint metrics. Interestingly, our parameter-informed setting significantly enhances meal scheduling, improving performance from 61% to 80% in the 7-day scenario- as quantified by a 19% gain in our temporal meal score. Moreover, TripCraft serves as a high-quality benchmark for advancing personalized LLM-driven travel planning.

pdf bib abs

Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick | Saransh Sharma | Abhik Jana | Pawan Goyal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multimodal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 dataset. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively. We release both the code and the dataset used for this work at https://github.com/Text-Takes-Over-EMNLP-2025/MultiModal-Intent-EMNLP-2025.

pdf bib abs

Are LLMs Good for Semantic Role Labeling via Question Answering?: A Preliminary Analysis
Ritwik Raghav | Abhik Jana
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Semantic role labeling (SRL) is a fundamental task in natural language processing that is crucial for achieving deep semantic understanding. Despite the success of large language models (LLMs) in several downstream NLP tasks, key tasks such as SRL remain a challenge for LLMs. Hence, in this study, we attempt to instantiate the efficacy of LLMs for the task of SRL via Question answering. Toward that goal, we investigate the effectiveness of five different LLMs (Llama, Mistral, Qwen, OpenChat, Gemini) using zero-shot and few-shot prompting. Our findings indicate that few-shot prompting enhances the performance of all models. Although Gemini outperformed others by a margin of 11%, Qwen and Llama are not too far behind. Additionally, we conduct a comprehensive error analysis to shed light on the cases where LLMs fail. This study offers valuable insights into the performance of LLMs for structured prediction and the effectiveness of simple prompting techniques in the Question-Answering framework for SRL.

2024

pdf bib abs

On Zero-Shot Counterspeech Generation by LLMs
Punyajoy Saha | Aalok Agrawal | Abhik Jana | Chris Biemann | Animesh Mukherjee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models.

pdf bib abs

CrowdCounter: A benchmark type-specific multi-target counterspeech dataset
Punyajoy Saha | Abhilash Datta | Abhik Jana | Animesh Mukherjee
Proceedings of the 28th Conference on Computational Natural Language Learning

Counterspeech presents a viable alternative to banning or suspending users for hate speech while upholding freedom of expression. However, writing effective counterspeech is challenging for moderators/users. Hence, developing suggestion tools for writing counterspeech is the need of the hour. One critical challenge in developing such a tool is the lack of quality and diversity of the responses in the existing datasets. Hence, we introduce a new dataset - CrowdCounter containing 3,425 hate speech-counterspeech pairs spanning six different counterspeech types (empathy, humor, questioning, warning, shaming, contradiction), which is the first of its kind. The design of our annotation platform itself encourages annotators to write type-specific, non-redundant and high-quality counterspeech. We evaluate two frameworks for generating counterspeech responses - vanilla and type-controlled prompts - across four large language models. In terms of metrics, we evaluate the responses using relevance, diversity and quality. We observe that Flan-T5 is the best model in the vanilla framework across different models. Type-specific prompts enhance the relevance of the responses, although they might reduce the language quality. DialoGPT proves to be the best at following the instructions and generating the type-specific counterspeech accurately.

2022

pdf bib abs

Towards Bengali WordNet Enrichment using Knowledge Graph Completion Techniques
Sree Bhattacharyya | Abhik Jana
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

WordNet serves as a very essential knowledge source for various downstream Natural Language Processing (NLP) tasks. Since this is a human-curated resource, building such a resource is very cumbersome and time-consuming. Even though for languages like English, the existing WordNet is reasonably rich in terms of coverage, for resource-poor languages like Bengali, the WordNet is far from being reasonably sufficient in terms of coverage of vocabulary and relations between them. In this paper, we investigate the usefulness of some of the existing knowledge graph completion algorithms to enrich Bengali WordNet automatically. We explore three such techniques namely DistMult, ComplEx, and HolE, and analyze their effectiveness for adding more relations between existing nodes in the WordNet. We achieve maximum Hits@1 of 0.412 and Hits@10 of 0.703, which look very promising for low resource languages like Bengali.

pdf bib

pdf bib abs

Enriching Hindi WordNet Using Knowledge Graph Completion Approach
Sushil Awale | Abhik Jana
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

Even though the use of WordNet in the Natural Language Processing domain is unquestionable, creating and maintaining WordNet is a cumbersome job and it is even difficult for low resource languages like Hindi. In this study, we aim to enrich the Hindi WordNet automatically by using state-of-the-art knowledge graph completion (KGC) approaches. We pose the automatic Hindi WordNet enrichment problem as a knowledge graph completion task and therefore we modify the WordNet structure to make it appropriate for applying KGC approaches. Second, we attempt five KGC approaches of three different genres and compare the performances for the task. Our study shows that ConvE is the best KGC methodology for this specific task compared to other KGC approaches.

pdf bib abs

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
Ilias Chalkidis | Abhik Jana | Dirk Hartung | Michael Bommarito | Ion Androutsopoulos | Daniel Katz | Nikolaos Aletras
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Laws and their interpretations, legal arguments and agreements are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.

2021

pdf bib abs

Error Analysis of using BART for Multi-Document Summarization: A Study for English and German Language
Timo Johner | Abhik Jana | Chris Biemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Recent research using pre-trained language models for multi-document summarization task lacks deep investigation of potential erroneous cases and their possible application on other languages. In this work, we apply a pre-trained language model (BART) for multi-document summarization (MDS) task using both fine-tuning and without fine-tuning. We use two English datasets and one German dataset for this study. First, we reproduce the multi-document summaries for English language by following one of the recent studies. Next, we show the applicability of the model to German language by achieving state-of-the-art performance on German MDS. We perform an in-depth error analysis of the followed approach for both languages, which leads us to identifying most notable errors, from made-up facts and topic delimitation, and quantifying the amount of extractiveness.

pdf bib abs

An Investigation towards Differentially Private Sequence Tagging in a Federated Framework
Abhik Jana | Chris Biemann
Proceedings of the Third Workshop on Privacy in Natural Language Processing

To build machine learning-based applications for sensitive domains like medical, legal, etc. where the digitized text contains private information, anonymization of text is required for preserving privacy. Sequence tagging, e.g. as done in Named Entity Recognition (NER) can help to detect private information. However, to train sequence tagging models, a sufficient amount of labeled data are required but for privacy-sensitive domains, such labeled data also can not be shared directly. In this paper, we investigate the applicability of a privacy-preserving framework for sequence tagging tasks, specifically NER. Hence, we analyze a framework for the NER task, which incorporates two levels of privacy protection. Firstly, we deploy a federated learning (FL) framework where the labeled data are not shared with the centralized server as well as the peer clients. Secondly, we apply differential privacy (DP) while the models are being trained in each client instance. While both privacy measures are suitable for privacy-aware models, their combination results in unstable models. To our knowledge, this is the first study of its kind on privacy-aware sequence tagging models.

pdf bib

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)
Alexander Panchenko | Fragkiskos D. Malliaros | Varvara Logacheva | Abhik Jana | Dmitry Ustalov | Peter Jansen
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

pdf bib abs

Sentiment Analysis For Bengali Using Transformer Based Models
Anirban Bhowmick | Abhik Jana
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Sentiment analysis is one of the key Natural Language Processing (NLP) tasks that has been attempted by researchers extensively for resource-rich languages like English. But for low resource languages like Bengali very few attempts have been made due to various reasons including lack of corpora to train machine learning models or lack of gold standard datasets for evaluation. However, with the emergence of transformer models pre-trained in several languages, researchers are showing interest to investigate the applicability of these models in several NLP tasks, especially for low resource languages. In this paper, we investigate the usefulness of two pre-trained transformers models namely multilingual BERT and XLM-RoBERTa (with fine-tuning) for sentiment analysis for the Bengali Language. We use three datasets for the Bengali language for evaluation and produce state-of-the-art performance, even reaching a maximum of 95% accuracy for a two-class sentiment classification task. We believe, this work can serve as a good benchmark as far as sentiment analysis for the Bengali language is concerned.

2020

pdf bib abs

Using Distributional Thesaurus Embedding for Co-hyponymy Detection
Abhik Jana | Nikhil Reddy Varimalla | Pawan Goyal
Proceedings of the Twelfth Language Resources and Evaluation Conference

Discriminating lexical relations among distributionally similar words has always been a challenge for natural language processing (NLP) community. In this paper, we investigate whether the network embedding of distributional thesaurus can be effectively utilized to detect co-hyponymy relations. By extensive experiments over three benchmark datasets, we show that the vector representation obtained by applying node2vec on distributional thesaurus outperforms the state-of-the-art models for binary classification of co-hyponymy vs. hypernymy, as well as co-hyponymy vs. meronymy, by huge margins.

pdf bib

2019

pdf bib abs

On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings
Abhik Jana | Dima Puzyrev | Alexander Panchenko | Pawan Goyal | Chris Biemann | Animesh Mukherjee
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincaré embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincaré similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincaré embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus.

pdf bib abs

Incorporating Domain Knowledge into Medical NLI using Knowledge Graphs
Soumya Sharma | Bishal Santra | Abhik Jana | Santosh T.y.s.s | Niloy Ganguly | Pawan Goyal
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recently, biomedical version of embeddings obtained from language models such as BioELMo have shown state-of-the-art results for the textual inference task in the medical domain. In this paper, we explore how to incorporate structured domain knowledge, available in the form of a knowledge graph (UMLS), for the Medical NLI task. Specifically, we experiment with fusing embeddings obtained from knowledge graph with the state-of-the-art approaches for NLI task (ESIM model). We also experiment with fusing the domain-specific sentiment information for the task. Experiments conducted on MedNLI dataset clearly show that this strategy improves the baseline BioELMo architecture for the Medical NLI task.

2018

pdf bib abs

WikiRef: Wikilinks as a route to recommending appropriate references for scientific Wikipedia pages
Abhik Jana | Pranjal Kanojiya | Pawan Goyal | Animesh Mukherjee
Proceedings of the 27th International Conference on Computational Linguistics

The exponential increase in the usage of Wikipedia as a key source of scientific knowledge among the researchers is making it absolutely necessary to metamorphose this knowledge repository into an integral and self-contained source of information for direct utilization. Unfortunately, the references which support the content of each Wikipedia entity page, are far from complete. Why are the reference section ill-formed for most Wikipedia pages? Is this section edited as frequently as the other sections of a page? Can there be appropriate surrogates that can automatically enhance the reference section? In this paper, we propose a novel two step approach – WikiRef – that (i) leverages the wikilinks present in a scientific Wikipedia target page and, thereby, (ii) recommends highly relevant references to be included in that target page appropriately and automatically borrowed from the reference section of the wikilinks. In the first step, we build a classifier to ascertain whether a wikilink is a potential source of reference or not. In the following step, we recommend references to the target page from the reference section of the wikilinks that are classified as potential sources of references in the first step. We perform an extensive evaluation of our approach on datasets from two different domains – Computer Science and Physics. For Computer Science we achieve a notably good performance with a precision@1 of 0.44 for reference recommendation as opposed to 0.38 obtained from the most competitive baseline. For the Physics dataset, we obtain a similar performance boost of 10% with respect to the most competitive baseline.

pdf bib

Network Features Based Co-hyponymy Detection
Abhik Jana | Pawan Goyal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Can Network Embedding of Distributional Thesaurus Be Combined with Word Vectors for Better Representation?
Abhik Jana | Pawan Goyal
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Distributed representations of words learned from text have proved to be successful in various natural language processing tasks in recent times. While some methods represent words as vectors computed from text using predictive model (Word2vec) or dense count based model (GloVe), others attempt to represent these in a distributional thesaurus network structure where the neighborhood of a word is a set of words having adequate context overlap. Being motivated by recent surge of research in network embedding techniques (DeepWalk, LINE, node2vec etc.), we turn a distributional thesaurus network into dense word vectors and investigate the usefulness of distributional thesaurus embedding in improving overall word representation. This is the first attempt where we show that combining the proposed word representation obtained by distributional thesaurus embedding with the state-of-the-art word representations helps in improving the performance by a significant margin when evaluated against NLP tasks like word similarity and relatedness, synonym detection, analogy detection. Additionally, we show that even without using any handcrafted lexical resources we can come up with representations having comparable performance in the word similarity and relatedness tasks compared to the representations where a lexical resource has been used.