The task of query rewrite aims to convert an in-context query to its fully-specified version where ellipsis and coreference are completed and referred-back according to the history context. Although much progress has been made, less efforts have been paid to real scenario conversations that involve drawing information from more than one modalities. In this paper, we propose the task of multimodal conversational query rewrite (McQR), which performs query rewrite under the multimodal visual conversation setting. We collect a large-scale dataset named McQueen based on manual annotation, which contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, we provide the corresponding image box annotation. We then use the McQueen dataset to benchmark a state-of-the-art method for effectively tackling the McQR task, which is based on a multimodal pre-trained model with pointer generator. Extensive experiments are performed to demonstrate the effectiveness of our model on this task.
Continuous efforts have been devoted to language understanding (LU) for conversational queries with the fast and wide-spread popularity of voice assistants. In this paper, we first study the LU problem in the spatial domain, which is a critical problem for providing location-based services by voice assistants but is without in-depth investigation in existing studies. Spatial domain queries have several unique properties making them be more challenging for language understanding than common conversational queries, including lexical-similar but diverse intents and highly ambiguous words. Thus, a special tailored LU framework for spatial domain queries is necessary. To the end, a dataset was extracted and annotated based on the real-life queries from a voice assistant service. We then proposed a new multi-task framework that jointly learns the intent detection and entity linking tasks on the with invented hierarchical intent detection method and triple-scoring mechanism for entity linking. A specially designed spatial GCN is also utilized to model spatial context information among entities. We have conducted extensive experimental evaluations with state-of-the-art entity linking and intent detection methods, which demonstrated that can outperform all baselines with a significant margin.