2024
pdf
bib
abs
RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets
Bhathiya Hemanthage
|
Hakan Bilen
|
Phil Bartie
|
Christian Dondrup
|
Oliver Lemon
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The Generalized Referring Expression Comprehension (GREC) task extends classic REC by generating image bounding boxes for objects referred to in natural language expressions, which may indicate zero, one, or multiple targets. This generalization enhances the practicality of REC models for diverse real-world applications. However, the presence of varying numbers of targets in samples makes GREC a more complex task, both in terms of training supervision and final prediction selection strategy. Addressing these challenges, we introduce RECANTFormer, a one-stage method for GREC that combines a decoder-free (encoder-only) transformer architecture with DETR-like Hungarian matching. Our approach consistently outperforms baselines by significant margins in three GREC datasets.
pdf
bib
abs
Divide and Conquer: Rethinking Ambiguous Candidate Identification in Multimodal Dialogues with Pseudo-Labelling
Bhathiya Hemanthage
|
Christian Dondrup
|
Hakan Bilen
|
Oliver Lemon
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Ambiguous Candidate Identification(ACI) in multimodal dialogue is the task of identifying all potential objects that a user’s utterance could be referring to in a visual scene, in cases where the reference cannot be uniquely determined. End-to-end models are the dominant approach for this task, but have limited real-world applicability due to unrealistic inference-time assumptions such as requiring predefined catalogues of items. Focusing on a more generalized and realistic ACI setup, we demonstrate that a modular approach, which first emphasizes language-only reasoning over dialogue context before performing vision-language fusion, significantly outperforms end-to-end trained baselines. To mitigate the lack of annotations for training the language-only module (student), we propose a pseudo-labelling strategy with a prompted Large Language Model (LLM) as the teacher.
pdf
bib
abs
Generalized Visual-Language Grounding with Complex Language Context
Bhathiya Hemanthage
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
My research focus on Visual Dialogues and Generalized Visual-Language Grounding with Complex Language Context. Specifically, my research aim to utilize Large Language Models (LLMs) to build conversational agents capable of comprehending and responding to visual cues. Visual-Language Pre-trained (VLP) models, primarily utilizing transformer-based encoder-decoder architectures, are extensively employed across a range of visual-language tasks, such as visual question answering (VQA) and referring expression comprehension (REC). The effectiveness of these models stems from their robust visual-language integration capabilities. However, their performance is constrained in more complex applications like multimodal conversational agents, where intricate and extensive language contexts pose significant challenges. These tasks demands language-only reasoning before engaging in multimodal fusion. In response, my research investigates the application of Large Language Models (LLMs) with advance comprehension and generation capabilities to enhance performance in complex multimodal tasks, particularly multimodal dialogues. In brief, my work in visual dialogues revolves around two major research questions. i) How to redefine visually grounded conversational agent architectures to benefit from LLMs ii) How to transfer the large body of knowledge encoded in LLMs to conversational systems.
2023
pdf
bib
abs
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Georgios Pantazopoulos
|
Malvina Nikandrou
|
Amit Parekh
|
Bhathiya Hemanthage
|
Arash Eshghi
|
Ioannis Konstas
|
Verena Rieser
|
Oliver Lemon
|
Alessandro Suglia
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena.
pdf
bib
abs
SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation
Bhathiya Hemanthage
|
Christian Dondrup
|
Phil Bartie
|
Oliver Lemon
Proceedings of the 15th International Conference on Computational Semantics
SimpleMTOD is a simple language model which recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is built on a large-scale transformer-based auto-regressive architecture, which has already proven to be successful in uni-modal task-oriented dialogues, and effectively leverages transfer learning from pretrained GPT-2. In-order to capture the semantics of visual scenes, we introduce both local and de-localized tokens for objects within a scene. De-localized tokens represent the type of an object rather than the specific object itself and so possess a consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0 test-std dataset while performing on par in other multimodal sub-tasks: Disambiguation, Coreference Resolution, and Dialog State Tracking. This is despite taking a minimalist approach for extracting visual (and non-visual) informa- tion. In addition the model does not rely on task-specific architectural changes such as classification heads.
2022
pdf
bib
abs
Demonstrating EMMA: Embodied MultiModal Agent for Language-guided Action Execution in 3D Simulated Environments
Alessandro Suglia
|
Bhathiya Hemanthage
|
Malvina Nikandrou
|
Georgios Pantazopoulos
|
Amit Parekh
|
Arash Eshghi
|
Claudio Greco
|
Ioannis Konstas
|
Oliver Lemon
|
Verena Rieser
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
We demonstrate EMMA, an embodied multimodal agent which has been developed for the Alexa Prize SimBot challenge. The agent acts within a 3D simulated environment for household tasks. EMMA is a unified and multimodal generative model aimed at solving embodied tasks. In contrast to previous work, our approach treats multiple multimodal tasks as a single multimodal conditional text generation problem, where a model learns to output text given both language and visual input. Furthermore, we showcase that a single generative agent can solve tasks with visual inputs of varying length, such as answering questions about static images, or executing actions given a sequence of previous frames and dialogue utterances. The demo system will allow users to interact conversationally with EMMA in embodied dialogues in different 3D environments from the TEACh dataset.