Vineet Gandhi


2024

pdf bib
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
Kawshik Manikantan Sundar | Shubham Toshniwal | Makarand Tapaswi | Vineet Gandhi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

pdf bib
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
Kawshik S. Manikantan | Shubham Toshniwal | Makarand Tapaswi | Vineet Gandhi
Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations
Neil Shah | Saiteja Kosgi | Vishal Tambrahalli | Neha S | Anil Nelakanti | Vineet Gandhi
Findings of the Association for Computational Linguistics: EACL 2024

We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/

2022

pdf bib
Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems
Saiteja Kosgi | Sarath Sivaprasad | Niranjan Pedanekar | Anil Nelakanti | Vineet Gandhi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted “human touch” in machine dialogue. Audio samples from our experiments and the code are available at: https://emtts.github.io/tts-demo/

pdf bib
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Kanishk Jain | Vineet Gandhi
Findings of the Association for Computational Linguistics: ACL 2022

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach’s performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.