Chayan Sarkar

2025

Are Relational Triple Extraction Frameworks Sufficient for Hyper-relational Facts ?
Pratik Saini | Chayan Sarkar | Tapas Nayak
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Hyper-relational fact extraction involves identifying relational triples along with additional contextual information—known as qualifiers—such as time, location, or quantity. These qualifiers enable models to represent complex real-world knowledge more accurately. While numerous end-to-end models have been developed for extracting relational triples, they are not designed to handle qualifiers directly. In this work, we propose a straightforward and effective approach to extend existing end-to-end triple extraction models to also capture qualifiers. Our method reformulates qualifiers as new relations by computing the Cartesian product between qualifiers and their associated relations. This transformation allows the model to extract qualifier information as additional triples, which can later be merged to form complete hyper-relational facts. We evaluate our approach using multiple end-to-end triple extraction models on the HyperRED dataset and demonstrate its effectiveness in extracting hyper-relational facts.

2023

pdf bib abs

tagE: Enabling an Embodied Agent to Understand Human Instructions
Chayan Sarkar | Avik Mitra | Pradip Pramanick | Tapas Nayak
Findings of the Association for Computational Linguistics: EMNLP 2023

Natural language serves as the primary mode of communication when an intelligent agent with a physical presence engages with human beings. While a plethora of research focuses on natural language understanding (NLU), encompassing endeavors such as sentiment analysis, intent prediction, question answering, and summarization, the scope of NLU directed at situations necessitating tangible actions by an embodied agent remains limited. The inherent ambiguity and incompleteness inherent in natural language present challenges for intelligent agents striving to decipher human intention. To tackle this predicament head-on, we introduce a novel system known as task and argument grounding for Embodied agents (tagE). At its core, our system employs an inventive neural network model designed to extract a series of tasks from complex task instructions expressed in natural language. Our proposed model adopts an encoder-decoder framework enriched with nested decoding to effectively extract tasks and their corresponding arguments from these intricate instructions. These extracted tasks are then mapped (or grounded) to the robot’s established collection of skills, while the arguments find grounding in objects present within the environment. To facilitate the training and evaluation of our system, we have curated a dataset featuring complex instructions. The results of our experiments underscore the prowess of our approach, as it outperforms robust baseline models.

2022

pdf bib abs

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?
Pradip Pramanick | Chayan Sarkar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot’s visual information into an ASR system and improve the recognition of a spoken utterance containing a visible entity. Specifically, we propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59% relative reduction in WER from an unmodified ASR system.

Co-authors

Venues

Fix author