Generalized Visual-Language Grounding with Complex Language Context

Bhathiya Hemanthage


Abstract
My research focus on Visual Dialogues and Generalized Visual-Language Grounding with Complex Language Context. Specifically, my research aim to utilize Large Language Models (LLMs) to build conversational agents capable of comprehending and responding to visual cues. Visual-Language Pre-trained (VLP) models, primarily utilizing transformer-based encoder-decoder architectures, are extensively employed across a range of visual-language tasks, such as visual question answering (VQA) and referring expression comprehension (REC). The effectiveness of these models stems from their robust visual-language integration capabilities. However, their performance is constrained in more complex applications like multimodal conversational agents, where intricate and extensive language contexts pose significant challenges. These tasks demands language-only reasoning before engaging in multimodal fusion. In response, my research investigates the application of Large Language Models (LLMs) with advance comprehension and generation capabilities to enhance performance in complex multimodal tasks, particularly multimodal dialogues. In brief, my work in visual dialogues revolves around two major research questions. i) How to redefine visually grounded conversational agent architectures to benefit from LLMs ii) How to transfer the large body of knowledge encoded in LLMs to conversational systems.
Anthology ID:
2024.yrrsds-1.21
Volume:
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
Month:
September
Year:
2024
Address:
Kyoto, Japan
Editors:
Koji Inoue, Yahui Fu, Agnes Axelsson, Atsumoto Ohashi, Brielen Madureira, Yuki Zenimoto, Biswesh Mohapatra, Armand Stricker, Sopan Khosla
Venues:
YRRSDS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
57–59
Language:
URL:
https://aclanthology.org/2024.yrrsds-1.21
DOI:
Bibkey:
Cite (ACL):
Bhathiya Hemanthage. 2024. Generalized Visual-Language Grounding with Complex Language Context. In Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems, pages 57–59, Kyoto, Japan. Association for Computational Linguistics.
Cite (Informal):
Generalized Visual-Language Grounding with Complex Language Context (Hemanthage, YRRSDS-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.yrrsds-1.21.pdf