In the realm of dialogue-to-image retrieval, the primary challenge is to fetch images from a pre-compiled database that accurately reflect the intent embedded within the dialogue history. Existing methods often overemphasize inter-modal alignment, neglecting the nuanced nature of conversational context. Dialogue histories are frequently cluttered with redundant information and often lack direct image descriptions, leading to a substantial disconnect between conversational content and visual representation. This study introduces VCU, a novel framework designed to enhance the comprehension of dialogue history and improve cross-modal matching for image retrieval. VCU leverages large language models (LLMs) to perform a two-step extraction process. It generates precise image-related descriptions from dialogues, while also enhancing visual representation by utilizing object-list texts associated with images. Additionally, auxiliary query collections are constructed to balance the matching process, thereby reducing bias in similarity computations. Experimental results demonstrate that VCU significantly outperforms baseline methods in dialogue-to-image retrieval tasks, highlighting its potential for practical application and effectiveness in bridging the gap between dialogue context and visual content.
A Dialogue State Tracker (DST) is a core component of modular task-oriented dialogue systems. Tremendous research progress has been made in past ten years to improve performance of DSTs especially on benchmark datasets. However, their generalization to novel and realistic scenarios beyond the held-out conversations is limited. In this paper, we design experimental studies to answer: 1) How does the distribution of dialogue data affect the performance of DSTs? 2) What are effective ways to probe counterfactual matter for DSTs? Our findings are: the performance variance of generative DSTs is not only due to the model structure itself, but can be attributed to the distribution of cross-domain values. Evaluating iconic generative DST models on MultiWOZ dataset with counterfactuals results in a significant performance drop of up to 34.64% (from 50.91% to 16.27%) in absolute joint goal accuracy. It is believed that our experimental results can guide the future work to better understand the intrinsic core of DST and rethink the suitable way for specific tasks given the application property.
A Dialogue State Tracker (DST) is a core component of a modular task-oriented dialogue system. Tremendous progress has been made in recent years. However, the major challenges remain. The state-of-the-art accuracy for DST is below 50% for a multi-domain dialogue task. A learnable DST for any new domain requires a large amount of labeled in-domain data and training from scratch. In this paper, we propose a Meta-Reinforced Multi-Domain State Generator (MERET). Our first contribution is to improve the DST accuracy. We enhance a neural model based DST generator with a reward manager, which is built on policy gradient reinforcement learning (RL) to fine-tune the generator. With this change, we are able to improve the joint accuracy of DST from 48.79% to 50.91% on the MultiWOZ corpus. Second, we explore to train a DST meta-learning model with a few domains as source domains and a new domain as target domain. We apply the model-agnostic meta-learning algorithm (MAML) to DST and the obtained meta-learning model is used for new domain adaptation. Our experimental results show this solution is able to outperform the traditional training approach with extremely less training data in target domain.
In this paper, we propose a meta-learning based semi-supervised explicit dialogue state tracker (SEDST) for neural dialogue generation, denoted as MEDST. Our main motivation is to further bridge the chasm between the need for high accuracy dialogue state tracker and the common reality that only scarce annotated data is available for most real-life dialogue tasks. Specifically, MEDST has two core steps: meta-training with adequate unlabelled data in an automatic way and meta-testing with a few annotated data by supervised learning. In particular, we enhance SEDST via entropy regularization, and investigate semi-supervised learning frameworks based on model-agnostic meta-learning (MAML) that are able to reduce the amount of required intermediate state labelling. We find that by leveraging un-annotated data in meta-way instead, the amount of dialogue state annotations can be reduced below 10% while maintaining equivalent system performance. Experimental results show MEDST outperforms SEDST substantially by 18.7% joint goal accuracy and 14.3% entity match rate on the KVRET corpus with 2% labelled data in semi-supervision.