Manish Gupta - ACL Anthology

Manish Gupta

2026

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems
Anubhab Mandal | Sandeep Mishra | Bishal Santra | Tushar Abhishek | Pawan Goyal | Manish Gupta
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Ghosting, the ability to predict a user’s intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available at https://github.com/blitzprecision/Chat-Ghosting.

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
Sandeep Mishra | Devichand Budagam | Anubhab Mandal | Bishal Santra | Pawan Goyal | Manish Gupta
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

Learning to Ask: Multi-Decoder Fine-Tuning for Multi-Hop Visual Question Generation with External Knowledge
Arpan Phukan | Manish Gupta | Asif Ekbal
Findings of the Association for Computational Linguistics: EACL 2026

2025

SCULPT: Systematic Tuning of Long Prompts
Shanu Kumar | Akhila Yesantarao Venkata | Shubhanshu Khandelwal | Bishal Santra | Parag Agrawal | Manish Gupta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Prompt optimization is essential for effective utilization of large language models (LLMs) across diverse tasks. While existing optimization methods are effective in optimizing short prompts, they struggle with longer, more complex ones, often risking information loss and being sensitive to small perturbations. To address these challenges, we propose SCULPT (Systematic Tuning of Long Prompts), a framework that treats prompt optimization as a hierarchical tree refinement problem. SCULPT represents prompts as tree structures, enabling targeted modifications while preserving contextual integrity. It employs a Critic-Actor framework that generates reflections and applies actions to refine the prompt. Evaluations demonstrate SCULPT’s effectiveness on long prompts, its robustness to adversarial perturbations, and its ability to generate high-performing prompts even without any initial human-written prompt. Compared to existing state of the art methods, SCULPT consistently improves LLM performance by preserving essential task information while applying structured refinements. Both qualitative and quantitative analyses show that SCULPT produces more stable and interpretable prompt modifications, ensuring better generalization across tasks.

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning
Soumyabrata Chaudhuri | Pranav Purkar | Ritwik Raghav | Shubhojit Mallick | Manish Gupta | Abhik Jana | Shreya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, though this remains a rather nascent field. Existing benchmarks, such as TravelPlanner and TravelPlanner+, rely on semi-synthetic data as well ignoring several key components of travel planning, limiting their real-world applicability. Therefore, we introduce TripCraft, a spatio-temporally coherent travel planning dataset incorporating real-world constraints, including public transit schedules, public events, varied attraction categories, and user personas for enhanced personalization. Our dataset enables more detailed trip itinerary generation (including duration spent at each point of interest based on users’ persona, transit between two points of interest, etc.) while ensuring spatio-temporal consistency. Further, we propose novel evaluation metrics (temporal meal score, attraction score, spatial score, ordering score, and persona score) to assess LLM-generated plans across temporal, spatial, sequential, and personal dimensions, overcoming the limitations of commonsense and hard constraint metrics. Interestingly, our parameter-informed setting significantly enhances meal scheduling, improving performance from 61% to 80% in the 7-day scenario- as quantified by a 19% gain in our temporal meal score. Moreover, TripCraft serves as a high-quality benchmark for advancing personalized LLM-driven travel planning.

CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
Anant Khandelwal | Manish Gupta | Puneet Agrawal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA’s state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.

Aligning Text/Speech Representations from Multimodal Models with MEG Brain Activity During Listening
Padakanti Srijith | Khushbu Pahwa | Radhika Mamidi | Bapi Raju Surampudi | Manish Gupta | Subba Reddy Oota
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Although speech language models are expected to align well with brain language processing during speech comprehension, recent studies have found that they fail to capture brain-relevant semantics beyond low-level features. Surprisingly, text-based language models exhibit stronger alignment with brain language regions, as they better capture brain-relevant semantics. However, no prior work has examined the alignment effectiveness of text/speech representations from multimodal models. This raises several key questions: Can speech embeddings from such multimodal models capture brain-relevant semantics through cross-modal interactions? Which modality can take advantage of this synergistic multimodal understanding to improve alignment with brain language processing? Can text/speech representations from such multimodal models outperform unimodal models? To address these questions, we systematically analyze multiple multimodal models, extracting both text- and speech-based representations to assess their alignment with MEG brain recordings during naturalistic story listening. We find that text embeddings from both multimodal and unimodal models significantly outperform speech embeddings from these models. Specifically, multimodal text embeddings exhibit a peak around 200 ms, suggesting that they benefit from speech embeddings, with heightened activity during this time period. However, speech embeddings from these multimodal models still show a similar alignment compared to their unimodal counterparts, suggesting that they do not gain meaningful semantic benefits over text-based representations. These results highlight an asymmetry in cross-modal knowledge transfer, where the text modality benefits more from speech information, but not vice versa.

USDC: A Dataset of ̲User ̲Stance and ̲Dogmatism in Long ̲Conversations
Mounika Marreddy | Subba Reddy Oota | Venkata Charan Chinni | Manish Gupta | Lucie Flek
Findings of the Association for Computational Linguistics: ACL 2025

Analyzing user opinion changes in long conversation threads is extremely critical for applications like enhanced personalization, market research, political campaigns, customer service, targeted advertising, and content moderation. Unfortunately, previous studies on stance and dogmatism in user conversations have focused on training models using datasets annotated at the post level, treating each post as independent and randomly sampling posts from conversation threads. Hence, first, we build a dataset for studying user opinion fluctuations in 764 long multi-user Reddit conversation threads, called USDC. USDC contains annotations for 2 tasks: i) User Stance classification, which involves labeling a user’s stance in a post within a conversation on a five-point scale; ii) User Dogmatism classification, which involves labeling a user’s overall opinion in the conversation on a four-point scale. Besides being time-consuming and costly, manual annotations for USDC are challenging because: 1) Conversation threads could be very long, increasing the chances of noisy annotations; and 2) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Hence, we leverage majority voting on zero-shot, one-shot, and few-shot annotations from Mistral Large and GPT-4 to automate the annotation process. Human annotations on 200 test conversations achieved inter-annotator agreement scores of 0.49 for stance and 0.50 for dogmatism with these LLM annotations, indicating a reasonable level of consistency between human and LLM annotations. USDC is then used to finetune and instruction-tune multiple deployable small language models like LLaMA, Falcon and Vicuna for the stance and dogmatism classification tasks. We make the code and dataset publicly available [https://github.com/mounikamarreddy/USDC].

LRPLAN: A Multi-Agent Collaboration of Large Language and Reasoning Models for Planning with Implicit & Explicit Constraints
T Karthikeyan | Om Dehlan | Mausam | Manish Gupta
Findings of the Association for Computational Linguistics: EMNLP 2025

Our goal is to build language model based multi-agent systems for complex planning problems involving multiple explicit and implicit constraints, some of which may be commonsense. Our initial investigations reveal that large language models (LLMs) are often unable to maintain consistency across the planning process, whereas large reasoning models (LRMs) struggle with handling implicit commonsense constraints. In response, we introduce LRPlan, a novel domain-independent, language-based multi-agent architecture where LLM and LRM-based agents collaborate at training time to abstract important patterns, heuristics and insights about the domain. At test time, they collaborate in implementing these learned patterns and insights for a new planning instance. We perform experiments on two datasets, TravelPlanner and TimeArena-Static, and use two LLM-LRM combinations from GPT and DeepSeek families. We find that LRPlan outperforms various multi-agent and single-agent baselines obtaining notably higher accuracy as well as cost efficiency. We make the code publiclyavailable.

Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities
Bhagesh Gaur | Karan Gupta | Aseem Srivastava | Manish Gupta | Md Shad Akhtar
Findings of the Association for Computational Linguistics: EMNLP 2025

Online Mental Health Communities (OMHCs) provide crucial peer and expert support, yet many posts remain unanswered due to missing support attributes that signal the need for help. We present a novel framework that identifies these gaps and prompts users to enrich their posts, thereby improving engagement. To support this, we introduce REDDME, a new dataset of 4,760 posts from mental health subreddits annotated for the span and intensity of three key support attributes: event what happened?, effect what did the user experience?, and requirement what support they need?. Next, we devise a hierarchical taxonomy, CueTaxo, of support attributes for controlled question generation. Further, we propose MH-COPILOT, a reinforcement learning-based system that integrates (a) contextual attribute-span identification, (b) support attribute intensity classification, (c) controlled question generation via a hierarchical taxonomy, and (d) a verifier for reward modeling. Our model dynamically assesses posts for the presence/absence of support attributes, and generates targeted prompts to elicit missing information. Empirical results across four notable language models demonstrate significant improvements in attribute elicitation and user engagement. A human evaluation further validates the model’s effectiveness in real-world OMHC settings.

Towards Multimodal Question Answering in Educational Domain
Himanshu Wadhwa | T Karthikeyan | Mausam | Manish Gupta
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The proliferation of educational videos on the Internet has changed the educational landscape by enabling students to learn complex concepts at their own pace. Our work outlines the vision of an automated tutor – a multimodal question answering (QA) system to answer questions from students watching a video. This can make doubt resolution faster and further improve learning experience. In this work, we take first steps towards building such a QA system. We curate and release a dataset named EduVidQA, with 3,158 videos and 18,474 QA-pairs. However, building and evaluating an educational QA system is challenging because (1) existing evaluation metrics do not correlate with human judgments, and (2) a student question could be answered in many different ways, training on a single gold answer could confuse the model and make it worse. We conclude with important research questions to develop this research area further.

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
Akhilesh Aravapalli | Mounika Marreddy | Radhika Mamidi | Manish Gupta | Subba Reddy Oota
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately ~47K sentences. Our probing analysis of surface, syntactic, and semantic properties reveals that, while almost all multilingual models demonstrate consistent encoding performance for English, surprisingly, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages.

2024

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
Arpan Phukan | Manish Gupta | Asif Ekbal
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending “People Also Ask” questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available.

Multilingual Bias Detection and Mitigation for Indian Languages
Ankita Maity | Anubhav Sharma | Rudra Dhar | Tushar Abhishek | Manish Gupta | Vasudeva Varma
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.

2023

Multimodal Persona Based Generation of Comic Dialogs
Harsh Agrawal | Aditya Mishra | Manish Gupta | Mausam
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We focus on the novel problem of persona based dialogue generation for comic strips. Dialogs in comic strips is a unique and unexplored area where every strip contains utterances from various characters with each one building upon the previous utterances and the associated visual scene. Previous works like DialoGPT, PersonaGPT and other dialog generation models encode two-party dialogues and do not account for the visual information. To the best of our knowledge we are the first to propose the paradigm of multimodal persona based dialogue generation. We contribute a novel dataset, ComSet, consisting of 54K strips, harvested from 13 popular comics available online. Further, we propose a multimodal persona-based architecture, MPDialog, to generate dialogues for the next panel in the strip which decreases the perplexity score by ~10 points over strong dialogue generation baseline models. We demonstrate that there is still ample opportunity for improvement, highlighting the importance of building stronger dialogue systems that are able to generate persona-consistent dialogues and understand the context through various modalities.

How does the brain process syntactic structure while listening?
Subba Reddy Oota | Mounika Marreddy | Manish Gupta | Raju Bapi
Findings of the Association for Computational Linguistics: ACL 2023

Syntactic parsing is the task of assigning a syntactic structure to a sentence. There are two popular syntactic parsing methods: constituency and dependency parsing. Recent works have used syntactic embeddings based on constituency trees, incremental top-down parsing, and other word syntactic features for brain activity prediction given the text stimuli to study how the syntax structure is represented in the brain’s language network. However, the effectiveness of dependency parse trees or the relative predictive power of the various syntax parsers across brain areas, especially for the listening task, is yet unexplored. In this study, we investigate the predictive power of the brain encoding models in three settings: (i) individual performance of the constituency and dependency syntactic parsing based embedding methods, (ii) efficacy of these syntactic parsing based embedding methods when controlling for basic syntactic signals, (iii) relative effectiveness of each of the syntactic embedding methods when controlling for the other. Further, we explore the relative importance of syntactic information (from these syntactic embedding methods) versus semantic information using BERT embeddings. We find that constituency parsers help explain activations in the temporal lobe and middle-frontal gyrus, while dependency parsers better encode syntactic structure in the angular gyrus and posterior cingulate cortex. Although semantic signals from BERT are more effective compared to any of the syntactic features or embedding methods, syntactic embedding methods explain additional variance for a few brain regions.

Frugal Prompting for Dialog Models
Bishal Santra | Sakya Basak | Abhinandan De | Manish Gupta | Pawan Goyal
Findings of the Association for Computational Linguistics: EMNLP 2023

The use of large language models (LLMs) in natural language processing (NLP) tasks is rapidly increasing, leading to changes in how researchers approach problems in the field. To fully utilize these models’ abilities, a better understanding of their behavior for different input protocols is required. With LLMs, users can directly interact with the models through a text-based interface to define and solve various tasks. Hence, understanding the conversational abilities of these LLMs, which may not have been specifically trained for dialog modeling, is also important. This study examines different approaches for building dialog systems using LLMs by considering various aspects of the prompt. As part of prompt tuning, we experiment with various ways of providing instructions, exemplars, current query and additional context. The research also analyzes the representations of dialog history that have the optimal usable-information density. Based on the findings, the paper suggests more compact ways of providing dialog history information while ensuring good performance and reducing model’s inference-API costs. The research contributes to a better understanding of how LLMs can be effectively used for building interactive systems.

On Robustness of Finetuned Transformer-based NLP Models
Pavan Kalyan Reddy Neerudu | Subba Reddy Oota | Mounika Marreddy | Venkateswara Rao Kagita | Manish Gupta
Findings of the Association for Computational Linguistics: EMNLP 2023

Transformer-based pretrained models like BERT, GPT-2 and T5 have been finetuned for a large number of natural language processing (NLP) tasks, and have been shown to be very effective. However, while finetuning, what changes across layers in these models with respect to pretrained checkpoints is under-studied. Further, how robust are these models to perturbations in input text? Does the robustness vary depending on the NLP task for which the models have been finetuned? While there exists some work on studying robustness of BERT finetuned for a few NLP tasks, there is no rigorous study which compares this robustness across encoder only, decoder only and encoder-decoder models. In this paper, we characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR. Further, we study the robustness of three language models (BERT, GPT-2 and T5) with eight different text perturbations on classification tasks from General Language Understanding Evaluation (GLUE) benchmark, and generation tasks like summarization, free-form generation and question generation. GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbation. Although models exhibit good robustness broadly, dropping nouns, verbs or changing characters are the most impactful. Overall, this study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models which should be kept in mind when passing inputs.

XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
Shivprasad Sagare | Tushar Abhishek | Bhavyajeet Singh | Anubhav Sharma | Manish Gupta | Vasudeva Varma
Proceedings of the 16th International Natural Language Generation Conference

Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. This has resulted into substantial work on fact-to-text generation systems recently. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XAlign for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XAlign dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XAlignV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results (30.90 BLEU, 55.12 METEOR and 59.17 chrF++) across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.

2022

Multi-view and Cross-view Brain Decoding
Subba Reddy Oota | Jashn Arora | Manish Gupta | Raju S. Bapi
Proceedings of the 29th International Conference on Computational Linguistics

Can we build multi-view decoders that can decode concepts from brain recordings corresponding to any view (picture, sentence, word cloud) of stimuli? Can we build a system that can use brain recordings to automatically describe what a subject is watching using keywords or sentences? How about a system that can automatically extract important keywords from sentences that a subject is reading? Previous brain decoding efforts have focused only on single view analysis and hence cannot help us build such systems. As a first step toward building such systems, inspired by Natural Language Processing literature on multi-lingual and cross-lingual modeling, we propose two novel brain decoding setups: (1) multi-view decoding (MVD) and (2) cross-view decoding (CVD). In MVD, the goal is to build an MV decoder that can take brain recordings for any view as input and predict the concept. In CVD, the goal is to train a model which takes brain recordings for one view as input and decodes a semantic vector representation of another view. Specifically, we study practically useful CVD tasks like image captioning, image tagging, keyword extraction, and sentence formation. Our extensive experiments lead to MVD models with ~0.68 average pairwise accuracy across view pairs, and also CVD models with ~0.8 average pairwise accuracy across tasks. Analysis of the contribution of different brain networks reveals exciting cognitive insights: (1) Models trained on picture or sentence view of stimuli are better MV decoders than a model trained on word cloud view. (2) Our extensive analysis across 9 broad regions, 11 language sub-regions and 16 visual sub-regions of the brain help us localize, for the first time, the parts of the brain involved in cross-view tasks like image captioning, image tagging, sentence formation and keyword extraction. We make the code publicly available.

Visio-Linguistic Brain Encoding
Subba Reddy Oota | Jashn Arora | Vijay Rowtula | Manish Gupta | Raju S. Bapi
Proceedings of the 29th International Conference on Computational Linguistics

Brain encoding aims at reconstructing fMRI brain activity given a stimulus. There exists a plethora of neural encoding models which study brain encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore the co-attentive multi-modal modeling for visual and text reasoning. In this paper, we systematically explore the efficacy of image and multi-modal Transformers for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. (2) The regions such as LPTG, LMTG, LIFG, and STS which have dual functionalities for language and vision, have higher correlation with multi-modal models which reinforces the fact that these models are good at mimicing the human brain behavior. (3) The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting. We make our code publicly available.

VisToT: Vision-Augmented Table-to-Text Generation
Prajwal Gatti | Anand Mishra | Manish Gupta | Mithun Das Gupta
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Table-to-text generation has been widely studied in the Natural Language Processing community in the recent years. We give a new perspective to this problem by incorporating signals from both tables as well as associated images to generate relevant text. While tables contain a structured list of facts, images are a rich source of unstructured visual information. For example, in the tourism domain, images can be used to infer knowledge such as the type of landmark (e.g., church), its architecture (e.g., Ancient Roman), and composition (e.g., white marble). Therefore, in this paper, we introduce the novel task of Vision-augmented Table-To-Text Generation (VisToT, defined as follows: given a table and an associated image, produce a descriptive sentence conditioned on the multimodal input. For the task, we present a novel multimodal table-to-text dataset, WikiLandmarks, covering 73,084 unique world landmarks. Further, we also present a competitive architecture, namely, VT3 that generates accurate sentences conditioned on the image and table pairs. Through extensive analyses and experiments, we show that visual cues from images are helpful in (i) inferring missing information from incomplete or sparse tables, and (ii) strengthening the importance of useful information from noisy tables for natural language generation. We make the code and data publicly available.

Representation Learning for Conversational Data using Discourse Mutual Information Maximization
Bishal Santra | Sumegh Roychowdhury | Aishik Mandal | Vasu Gurram | Atharva Naik | Manish Gupta | Pawan Goyal
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Although many pretrained models exist for text or images, there have been relatively fewer attempts to train representations specifically for dialog understanding. Prior works usually relied on finetuned representations based on generic text representation models like BERT or GPT-2. But such language modeling pretraining objectives do not take the structural information of conversational text into consideration. Although generative dialog models can learn structural features too, we argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling. We empirically demonstrate that such representations do not perform consistently across various dialog understanding tasks. Hence, we propose a structure-aware Mutual Information based loss-function DMI (Discourse Mutual Information) for training dialog-representation models, that additionally captures the inherent uncertainty in response prediction. Extensive evaluation on nine diverse dialog modeling tasks shows that our proposed DMI-based models outperform strong baselines by significant margins.

Neural Language Taskonomy: Which NLP Tasks are the most Predictive of fMRI Brain Activity?
Subba Reddy Oota | Jashn Arora | Veeral Agarwal | Mounika Marreddy | Manish Gupta | Bapi Surampudi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Several popular Transformer based language models have been found to be successful for text-driven brain encoding. However, existing literature leverages only pretrained text Transformer models and has not explored the efficacy of task-specific learned Transformer representations. In this work, we explore transfer learning from representations learned for ten popular natural language processing tasks (two syntactic and eight semantic) for predicting brain responses from two diverse datasets: Pereira (subjects reading sentences from paragraphs) and Narratives (subjects listening to the spoken stories). Encoding models based on task features are used to predict activity in different regions across the whole brain. Features from coreference resolution, NER, and shallow syntax parsing explain greater variance for the reading activity. On the other hand, for the listening activity, tasks such as paraphrase generation, summarization, and natural language inference show better encoding performance. Experiments across all 10 task representations provide the following cognitive insights: (i) language left hemisphere has higher predictive brain activity versus language right hemisphere, (ii) posterior medial cortex, temporo-parieto-occipital junction, dorsal frontal lobe have higher correlation versus early auditory and auditory association cortex, (iii) syntactic and semantic tasks display a good predictive performance across brain regions for reading and listening stimuli resp.

2020

Predicting Clickbait Strength in Online Social Media
Vijayasaradhi Indurthi | Bakhtiyar Syed | Manish Gupta | Vasudeva Varma
Proceedings of the 28th International Conference on Computational Linguistics

Hoping for a large number of clicks and potentially high social shares, journalists of various news media outlets publish sensationalist headlines on social media. These headlines lure the readers to click on them and satisfy the curiosity gap in their mind. Low quality material pointed to by clickbaits leads to time wastage and annoyance for users. Even for enterprises publishing clickbaits, it hurts more than it helps as it erodes user trust, attracts wrong visitors, and produces negative signals for ranking algorithms. Hence, identifying and flagging clickbait titles is very essential. Previous work on clickbaits has majorly focused on binary classification of clickbait titles. However not all clickbaits are equally clickbaity. It is not only essential to identify a click-bait, but also to identify the intensity of the clickbait based on the strength of the clickbait. In this work, we model clickbait strength prediction as a regression problem. While previous methods have relied on traditional machine learning or vanilla recurrent neural networks, we rigorously investigate the use of transformers for clickbait strength prediction. On a benchmark dataset with ∼39K posts, our methods outperform all the existing methods in the Clickbait Challenge.

AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts
Mohit Chandra | Ashwin Pathak | Eesha Dutta | Paryul Jain | Manish Gupta | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the 28th International Conference on Computational Linguistics

While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7,601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of ∼80% for abuse presence, ∼82% for abuse target prediction, and ∼65% for abuse severity prediction.

Summaformers @ LaySumm 20, LongSumm 20
Sayar Ghosh Roy | Nikhil Pinnaparaju | Risubh Jain | Manish Gupta | Vasudeva Varma
Proceedings of the First Workshop on Scholarly Document Processing

Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task – extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely, (a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.

2019

Multi-label Categorization of Accounts of Sexism using a Neural Framework
Pulkit Parikh | Harika Abburi | Pinkesh Badjatiya | Radhika Krishnan | Niyati Chhaya | Manish Gupta | Vasudeva Varma
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Sexism, an injustice that subjects women and girls to enormous suffering, manifests in blatant as well as subtle ways. In the wake of growing documentation of experiences of sexism on the web, the automatic categorization of accounts of sexism has the potential to assist social scientists and policy makers in utilizing such data to study and counter sexism better. The existing work on sexism classification, which is different from sexism detection, has certain limitations in terms of the categories of sexism used and/or whether they can co-occur. To the best of our knowledge, this is the first work on the multi-label classification of sexism of any kind(s), and we contribute the largest dataset for sexism categorization. We develop a neural solution for this multi-label classification that can combine sentence representations obtained using models such as BERT with distributional and linguistic word embeddings using a flexible, hierarchical architecture involving recurrent components and optional convolutional ones. Further, we leverage unlabeled accounts of sexism to infuse domain-specific elements into our framework. The best proposed method outperforms several deep learning as well as traditional machine learning baselines by an appreciable margin.

FERMI at SemEval-2019 Task 5: Using Sentence embeddings to Identify Hate Speech Against Immigrants and Women in Twitter
Vijayasaradhi Indurthi | Bakhtiyar Syed | Manish Shrivastava | Nikhil Chakravartula | Manish Gupta | Vasudeva Varma
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team - Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.

Fermi at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings
Vijayasaradhi Indurthi | Bakhtiyar Syed | Manish Shrivastava | Manish Gupta | Vasudeva Varma
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team Fermi’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for sub-task C which uses pre-trained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.

Fermi at SemEval-2019 Task 8: An elementary but effective approach to Question Discernment in Community QA Forums
Bakhtiyar Syed | Vijayasaradhi Indurthi | Manish Shrivastava | Manish Gupta | Vasudeva Varma
Proceedings of the 13th International Workshop on Semantic Evaluation

Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019 - which tackles the first problem in the pipeline of factual evaluation in cQA forums, i.e., deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification sub-task A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.

2018

Translation Quality Estimation for Indian Languages
Nisarg Jhaveri | Manish Gupta | Vasudeva Varma
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Translation Quality Estimation (QE) aims to estimate the quality of an automated machine translation (MT) output without any human intervention or reference translation. With the increasing use of MT systems in various cross-lingual applications, the need and applicability of QE systems is increasing. We study existing approaches and propose multiple neural network approaches for sentence-level QE, with a focus on MT outputs in Indian languages. For this, we also introduce five new datasets for four language pairs: two for English–Gujarati, and one each for English–Hindi, English–Telugu and English–Bengali, which includes one manually post-edited dataset for English– Gujarati. These Indian languages are spoken by around 689M speakers world-wide. We compare results obtained using our proposed models with multiple state-of-the-art systems including the winning system in the WMT17 shared task on QE and show that our proposed neural model which combines the discriminative power of carefully chosen features with Siamese Convolutional Neural Networks (CNNs) works best for all Indian language datasets.

A Workbench for Rapid Generation of Cross-Lingual Summaries
Nisarg Jhaveri | Manish Gupta | Vasudeva Varma
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Higher-order Relation Schema Induction using Tensor Factorization with Back-off and Aggregation
Madhav Nimishakavi | Manish Gupta | Partha Talukdar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Relation Schema Induction (RSI) is the problem of identifying type signatures of arguments of relations from unlabeled text. Most of the previous work in this area have focused only on binary RSI, i.e., inducing only the subject and object type signatures per relation. However, in practice, many relations are high-order, i.e., they have more than two arguments and inducing type signatures of all arguments is necessary. For example, in the sports domain, inducing a schema win(WinningPlayer, OpponentPlayer, Tournament, Location) is more informative than inducing just win(WinningPlayer, OpponentPlayer). We refer to this problem as Higher-order Relation Schema Induction (HRSI). In this paper, we propose Tensor Factorization with Back-off and Aggregation (TFBA), a novel framework for the HRSI problem. To the best of our knowledge, this is the first attempt at inducing higher-order relation schemata from unlabeled text. Using the experimental analysis on three real world datasets we show how TFBA helps in dealing with sparsity and induce higher-order schemata.

2017

SSAS: Semantic Similarity for Abstractive Summarization
Raghuram Vadapalli | Litton J Kurisinkel | Manish Gupta | Vasudeva Varma
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Ideally a metric evaluating an abstract system summary should represent the extent to which the system-generated summary approximates the semantic inference conceived by the reader using a human-written reference summary. Most of the previous approaches relied upon word or syntactic sub-sequence overlap to evaluate system-generated summaries. Such metrics cannot evaluate the summary at semantic inference level. Through this work we introduce the metric of Semantic Similarity for Abstractive Summarization (SSAS), which leverages natural language inference and paraphrasing techniques to frame a novel approach to evaluate system summaries at semantic inference level. SSAS is based upon a weighted composition of quantities representing the level of agreement, contradiction, independence, paraphrasing, and optionally ROUGE score between a system-generated and a human-written summary.

2015

IIIT-H at SemEval 2015: Twitter Sentiment Analysis – The Good, the Bad and the Neutral!
Ayushi Dalmia | Manish Gupta | Vasudeva Varma
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2010

Shallow Information Extraction from Medical Forum Data
Parikshit Sondhi | Manish Gupta | ChengXiang Zhai | Julia Hockenmaier
Coling 2010: Posters

Co-authors

Vijayasaradhi Indurthi 4

Manish Shrivastava 4

Bakhtiyar Syed 4

Tushar Abhishek 3

Nisarg Jhaveri 2

T Karthikeyan 2

Radhika Mamidi 2

Anubhab Mandal 2

Sandeep Mishra 2

Anubhav Sharma 2

Bapi Raju Surampudi 2

Harika Abburi 1

Veeral Agarwal 1

Harsh Agrawal 1

Parag Agrawal 1

Puneet Agrawal 1

Md. Shad Akhtar 1

Akhilesh Aravapalli 1

Pinkesh Badjatiya 1

Devichand Budagam 1

Nikhil Chakravartula 1

Mohit Chandra 1

Soumyabrata Chaudhuri 1

Niyati Chhaya 1

Venkata Charan Chinni 1

Ayushi Dalmia 1

Mithun Das Gupta 1

Abhinandan De 1

Prajwal Gatti 1

Sayar Ghosh Roy 1

Julia Hockenmaier 1

Litton J Kurisinkel 1

Venkateswara Rao Kagita 1

Shubhanshu Khandelwal 1

Anant Khandelwal 1

Radhika Krishnan 1

Ponnurangam Kumaraguru 1

Shubhojit Mallick 1

Aishik Mandal 1

Aditya Mishra 1

Pavan Kalyan Reddy Neerudu 1

Madhav Nimishakavi 1

Khushbu Pahwa 1

Pulkit Parikh 1

Ashwin Pathak 1

Nikhil Pinnaparaju 1

Pranav Purkar 1

Ritwik Raghav 1

Vijay Rowtula 1

Sumegh Roychowdhury 1

Shivprasad Sagare 1

Bhavyajeet Singh 1

Parikshit Sondhi 1

Padakanti Srijith 1

Aseem Srivastava 1

Bapi Surampudi 1

Partha Talukdar 1

Raghuram Vadapalli 1

Akhila Yesantarao Venkata 1

Himanshu Wadhwa 1

ChengXiang Zhai 1

Venues