PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes PaCE, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.


Introduction
Enabling seamless communication between humans and machines is a long-standing goal of artificial intelligence research. The recent emergence of chatGPT 1 has increased confidence in the potential for achieving this goal. Beyond the use of textual language as a unique interface between humans and machines, perceiving and utilizing multi-modal information, especially visual information, has become a crucial capability known as multi-modal dialogue (Shuster et al., 2020;. Figure 1: An example of multi-modal dialogue, which involves multiple tasks, including multi-modal intent classification, multi-modal state tracking, multi-modal dialog retrieval and response generation.
To facilitate the research on multi-modal dialogue, plenty of specific tasks and datasets have emerged in the community (Das et al., 2017;Shuster et al., 2018;Feng et al., 2022;Long et al., 2023). However, the overall quantity of data is still limited. Furthermore, multi-modal dialogue presents a greater challenge compared to traditional text-only dialogue track (Hui et al., 2021;He et al., 2022;, as it involves the integration of various modalities and more intricate task scenarios. As shown in Figure 1, the central tasks of multi-modal dialogue include multi-modal intent classification (Zang et al., 2021), multi-modal dialogue retrieval (Das et al., 2017;Zang et al., 2021), multi-modal dialogue state tracking (Liao et al., 2021), and multi-modal response generation (Kottur et al., 2021). Despite pre-training having become the consensus for multi-task learning in machine learning (Devlin et al., 2018;Radford et al., 2019Radford et al., , 2021, the research on pre-training models for multi-modal dialogue is an area that is yet to be fully explored. In this paper, we focus on building pre-trained models of multi-modal dialogue. A key challenge is to unify different modalities and task forms, and make the best use of existing multi-modal dialog and non-dialog data. A recent popular trend on textual tasks is to build unified pre-trained foundation models by multi-task learning, e.g., T5 (Raffel et al., 2020). However, it attempts to mix all tasks learned from scratch thus is difficult to control the learning process, which is a completely black box. Although the Mixture-of-Experts (MoE) (Fedus et al., 2021;Du et al., 2022) architecture attempts to select independent experts for each input sample through token-level routing, it lacks specific semantics, i.e., it is entirely unknown what the experts are responsible for. We hope to find a new way to handle many multi-modal dialog tasks simultaneously and combine existing concrete skills to learn new tasks more efficiently.
To this end, we propose PaCE, a unified multi-modal dialogue pre-training framework with Progressive and Compositional Experts. First, we decompose complicated multi-modal dialogue into fundamental sub-capabilities that could be learned with specific data. Different from traditional MoE, each expert in PaCE is tailored to one specific fundamental sub-capability of multi-modal dialogue, including CAPTION, CONTEXT, IMAGE, GROUNDING and GENERATION. Second, we propose a progressive pre-training strategy to evolve the model by controlling the combination of experts in different pre-training phases. Specifically, in stage I, we first train on multi-modal non-dialogue data to obtain CAPTION, IMAGE, and GROUNDING experts. In stage II, we train the CONTEXT expert, which is guided by the CAPTION expert on multimodal dialog data to learn the dependencies in context. Furthermore, a dialogue GENERATION expert is derived by adding a response generation task based on the previously learned experts. Third, for pre-training PaCE, we collect a multi-modal dialog corpus with 1.4 million dialogs and a multi-modal non-dialog corpus with 4 million samples. Once  the pre-training of PaCE is finished, we can flexibly select different capability experts to solve a specific downstream task.
As illustrated in Figure 2, PaCE achieves stateof-the-art performance across a broad range of multi-modal dialogue benchmarks spanning four diverse downstream tasks, i.e., multi-modal intent classification, multi-modal dialogue retrieval, multi-modal state tracking, and multi-modal response generation This demonstrates that PaCE not only possesses a flexible model architecture but also exhibits adaptable training methodologies, resulting in remarkable performance.

Related Work
Pre-trained Vision-Language Models The pretraining paradigm, with its successes in natural language processing (Devlin et al., 2018;Radford et al., 2019), has sparked a revolution in Multimodal Learning. ViLBERT (Lu et al., 2019) was the first work to adapt the BERT-like architecture for visual-language modeling, allowing for learning joint representation of images and texts. ViLT  constructed the vision module in the same way as the text module with a unified Transformer (Vaswani et al., 2017), eliminating the need for resource-intensive image feature extraction and significantly accelerating the model. CLIP (Radford et al., 2021) employed contrast learning to directly align images with natural language texts, eliminating the constraints of predefined image categories. ALIGN (Jia et al., 2021) and Florence  further generalized this idea on noisier but larger image-text pairs. These models have demonstrated the ability to learn strong image and text representations for crossmodal alignment tasks. In addition, a number of models (Cho et al., 2021;Wang et al., , 2022Yu et al., 2022;Alayrac et al., 2022) employed auto-regressive models to model the association between images and texts, using a unified generation approach to construct the task in an end-toend manner. Although pre-trained vision-language models have shown promising results, they mainly focus on caption texts which are intrinsically different from human conversations (Kulhánek et al., 2021). To our best knowledge, the proposed PaCE model is the first multi-modal dialogue pre-training model.

Multi-Modal Dialogue Modeling
Numerous advanced works have been proposed along with the development of multi-modal dialogue datasets (Das et al., 2017;Mostafazadeh et al., 2017;Shuster et al., 2018;Zang et al., 2021;Kottur et al., 2021;Liao et al., 2021;Feng et al., 2022). Several dialogue modeling works (Qi et al., 2020;Lee et al., 2021) have been conducted to improve the performance of conversational agents in image-grounded dialogue. Zang et al. (2021) proposed a dual-encoder model that utilized object labels to encode image features so as to perform a dialogue-based image retrieval task. Afterward, researchers  explored enriching textual expressions of generated dialogue responses through associative vision scenes. For textual response tasks,  proposed a multi-modal dialogue generation model based on Seq2Seq architecture, which was proved to be superior to the textual Seq2Seq model. Lee et al. (2022) proposed a joint multimodal encoder-decoder model to incorporate visual inputs. However, the above models have demonstrated success in specific sub-tasks with a particular dataset, which cannot meet the requirements of a wide range of multi-modal dialogue tasks. To address this challenge, we propose a unified multi-modal dialogue pre-training model based on a divide-and-conquer strategy, which can combine different experts to complete a series of tasks.

Pre-training Data Construction
In this paper, we collect both multi-modal nondialogue and multi-modal dialogue data for PaCE pre-training. The total statistics of our collected pre-training corpora are shown in Table 1. Multi-modal Non-dialogue Data (MultiNonDialog) Similar to previous work , we first collect four multi-model non-dialogue datasets for image and text representation learning, including MSCOCO (Lin et al., 2014), VG (Krishna et al., 2017), SBU (Ordonez et al., 2011) and GCC (Sharma et al., 2018). In MultiNonDialog, each image is accompanied by one or more captions whose lengths are generally constrained to 20 tokens. Since GCC and SBU provide only image URLs, we collect the images via the given URLs which are still accessible.
Multi-modal Dialogue Data (MultiDialog) We collect six existing multi-modal conversation corpora ranging from online forum chatting logs (Das et al., 2017;Shuster et al., 2018;Zang et al., 2021;Feng et al., 2022) to customer service conversations (Liao et al., 2021;Kottur et al., 2021) and build a large-scale multi-modal dialogue corpus. To ensure that each conversation has at least one corresponding image, we eliminate the text-only conversations from the original datasets. In addition, to satisfy the requirements of the Stage II pretraining, we use the BLIP model (Li et al., 2022b) implemented by Li et al. (2022a) to generate the appropriate textual caption for each image. The captions are constrained to 20 tokens.

Pre-training Method
Given a set of n multi-modal dialogue samples where U i and R i represent the dialogue context and response, respectively. Compared to traditional textual dialogue, both can incorporate various types of information including textual texts and visual images, where K and Q are the number of elements, and m ∈ {t, v} denotes the modality of U i (or R i ). The notation t indicates Multi-Head Self-Attention (a) Pre-training Stage I (L-F) x

Image Text Matching
Multi-Head Self-Attention Multi-Head Self-Attention

Response Generation Modeling
Multi-Head Self-Attention  Figure 3: Three-stage training based on different combinations of experts, where the represents multi-modal non-dialog data and works mainly in the first stage, while the represents multi-modal dialog data and works in the second and third stages. The represents the caption of the input image. textual utterances, while v indicates visual images.
We devise a divide-and-conquer pre-training strategy for multi-modal dialogue. Concretely, we decompose complicated multi-modal dialogue into five fundamental sub-capabilities and design five corresponding experts (i.e., CAPTION, CONTEXT, IMAGE, GROUNDING, and GENERATION experts). Then, we propose a progressive training strategy to evolve the model by controlling the combination of experts in different pre-training phases. Next, we describe the input representation learning module, the divide-and-conquer pre-training strategy, the pre-training objectives, and the fine-tuning process in detail.

Input Representation Learning
The proposed model is designed to handle input data from two modalities: visual representations and textual representations.

Visual Representations
The dialogue context and response can be either visual or textual data. We use Vision Transformer (Dosovitskiy et al., 2020) to learn visual representations of images. Formally, we process the visual image v ∈ R H×W ×C by dividing it into N = HW/P 2 patches v p ∈ R N ×(P 2 C) , where C is the number of channels, (H, W ) is the resolution of the input image, and P is the patch resolution. This allows the model to extract meaningful features from the image by considering it as a set of small regions, rather than a single large array of pixels. The image patches are then flattened into vectors and processed by a linear projection using a weight matrix W V ∈ R (P 2 ·C)×E and a position embed- The position embedding is used to add additional information about the position of the patch in the image. Finally, we obtain the visual representations H v 0 after summing patch embeddings and position embeddings.

Textual Representations
The input text t ∈ R L×|O| is embedded into a dense representation t ∈ R L×E by using a word embedding matrix W T ∈ R |O|×E and a position embedding matrix W pos T ∈ R (L+1)×E , where |O| is the size of the vocabulary, L is the length of text, and E is the dimension of embedding. It is noteworthy that we usually concatenate the context with the current utterance to form the final textual input. The textual representations can be denoted as H t 0 .

Divide-and-Conquer Pre-training Strategy
We devise a novel pre-training strategy in a divideand-conquer manner. Specifically, we first divide the complicated multi-model dialogue into several sub-problems, which can be learned in an easier way. The solutions to the sub-problems are then combined to give a solution to different downstream multi-modal dialogue tasks.
Multi-expert Architecture PaCE adopts an extension of the standard Transformer, which learns multiple semantic experts instead of a single feedforward network (FFN) as in the original Transformer (Bao et al., 2021). Concretely, the experts share the information from both textual and visual modalities through a multi-head attention mecha-nism (MSA), while each expert FFN expert has its own unique parameters to learn a different semantic representation. Formally, the unique information, which is obtained by switching experts in each block, can be formulated as: where H l−1 (l ∈ [1, L]) represents the output representation of the l-1 layer and L is the number of Transformer blocks. H expert k l is the representation of the k-th expert. The input representation could be formulated as Here, MSA and LN are the standard multi-head self-attention and layer normalization, respectively. 2 Modality and Capability Experts As illustrated in Figure 3, we divide the complicated multi-modal dialogue task into five easier sub-problems including CAPTION modeling, CONTEXT modeling, IM-AGE modeling, GROUNDING, and GENERATION. We design a semantic expert to solve each subproblem. These five experts can be divided into two categories: modality experts (CAPTION and IM-AGE experts) and capability experts (GROUNDING, CONTEXT MODELING and GENERATION experts) tailored for multi-modal dialogue. Ultimately, we activate the modality and capability experts in a hierarchical manner, with the bottom (L − F ) layers activating only the modality experts and the top F layers activating the capability experts, where F is a pre-defined hyper-parameter.

Experts Combination for Different Tasks
We propose a progressive cascade pre-training strategy that solves different multi-modal dialogue tasks by adaptively combining the solutions to the subproblems. We will introduce the details of progressive cascade pre-training in Section 4.3.

Pre-training Objectives
Our progressive cascade pre-training process consists of three phases, each with a tailored pretraining objective.
Stage I: Image-Text Matching In stage I, similar to ViLT , we use non-dialogue multi-modal data D n to learn the fundamental intermodal alignment, and this stage involves only three experts, including the CAPTION expert, IMAGE expert and GROUNDING expert. As depicted in Figure 3(a), following word and patch embeddings, the text and image are separately processed into text and image representations by specialized CAP-TION and IMAGE experts. These representations are then fused and fed into the GROUNDING expert, yielding a unified representation of the image and text. We then employ the representation of the '[CLS]' token from the expert output as the input for a binary classification network to predict the alignment between the current text and image. The loss function of image-text matching is defined as: In addition to L itm , we also employ the MLM loss L mlm in this stage for understanding unique textual modality. Concretely, following the method of BERT, we randomly select tokens in the text sequence and replace them with the [MASK] token. The model is trained to predict these masked tokens using the context of the remaining unmasked tokens and the visual clues. We adopt a masking probability of 15%. The final output vectors of the masked tokens are then fed into a classifier over the entire text vocabulary, with the training loss being the cross-entropy loss.
(3) whereT is a masked text, V is an original image and p mask (V,T ) denotes the model's predicted probability for the masked tokenT . D n and D d represent multi-modal non-dialogue and dialogue data, respectively.
The joint loss in stage I can be formulated as: Stage II: Image-Context Matching In stage II, we use multi-modal dialogue data D d to pre-train PaCE, which aims to model dialogue context for multi-modal dialogue tasks. At this stage, CAP-TION expert will be activated in addition to the three experts from the first stage. Concretely, in the second stage, the dialogue context C is input to CONTEXT expert, the images V are input to IM-AGE expert, and the corresponding image captions T are input to CAPTION expert. The loss function of image-context matching is defined as: In addition, we use the CAPTION expert learned in Stage I as a teacher to facilitate the learning of CONTEXT expert.
where H t L−F and H c L−F are the output of the {L−F }th-layer of CAPTION expert and CONTEXT expert, respectively.
Besides, we also employ MLM loss in stage II as defined in stage I, and the joint loss L II stage in stage II could be formulated as: Stage III: Generation Modeling The third stage aims to enable the model to generate responses. The GENERATION expert is activated, and the input to this expert is composed of the CONTEXT expert and the IMAGE expert. The loss function in stage III is defined as follows: Here, we model generative capability by autoregression, i.e., using past dialogue history C <n and associated images V to predict the current turn C n of a dialogue.

Fine-Tuning on Downstream Tasks
Once the pre-training of PaCE is finished, we perform fine-tuning on specific downstream tasks. Thanks to our divide-and-conquer pre-training approach, we can flexibly select different capability experts to solve a specific downstream task. Specifically, for understanding tasks, including intent prediction, dialog retrieval, and dialog state tracking, we activate CONTEXT expert, IMAGE expert, and GROUNDING expert. For the generation task, i.e. dialog state tracking, and response generation, we activate the CONTEXT expert, IMAGE expert, and GENERATION expert.

Downstream Datasets
To comprehensively evaluate our PaCE, we conduct extensive experiments on seven datasets belonging to four downstream tasks.

Multi-Modal Intent Prediction
For multimodal intent prediction, PhotoChat (Zang et al., 2021) and MMDialog (Feng et al., 2022) are selected as benchmark datasets. This task aims to identify the specific intent of the user in the multimodal context. More specifically, it predicts the probability of photo sharing in the upcoming conversation turn.

Multi-Modal Dialog Retrieval
For text-toimage retrieval, we select PhotoChat (Zang et al., 2021) as our benchmark dataset. It encompasses 12k dialogues, each accompanied by a user photo exchanged during the conversation. The goal of this task is to select the most appropriate photo given the dialog context. For image-to-text retrieval, we select Image-Chat (Shuster et al., 2018) to evaluate our model, which consists of 202k dialogues over 202k images.
Multi-Modal Dialog State Tracking MM-Conv (Liao et al., 2021) and SIMMC2.0 (Kottur et al., 2021) datasets provide a good base for carrying out multi-modal dialog state tracking. The MMConv dataset contains 5.1k dialogues collected by enabling multi-modal conversations between human-to-human role-playing pairs under real-life traveling scenarios. In contrast, the SIMMC2.0 corpus includes 11,000 task-oriented dialogs in the shopping domain that are grounded in immersive and photo-realistic contexts.

Multi-Modal Response Generation
Generating appropriate responses for satisfactory task completion is the ultimate goal of task-oriented dialogue agents. In this task, we selected MMConv (Liao et al., 2021) and SIMMC2.0 (Kottur et al., 2021) as our benchmark datasets.

Experimental Setting
We use the bert-base-uncased tokenizer to tokenize text inputs. We learn the textual embedding-related parameters from scratch, instead of fine-tuning them from pre-trained BERT. For all experiments, we use AdamW optimizer (Loshchilov and Hutter, 2017) with base learning rate of 10 −4 and weight decay of 10 −2 . The learning rate is warmed up for 10% of the total training steps and is decayed linearly to zero for the rest of the training. We set the total number of the Transformer layers L to 12, with the number of layers F for the top Transformer set to 3. We initialize the Transformer weights with the pre-trained ViT (Dosovitskiy et al., 2020). In the   (Raffel et al., 2020), Divter (Feng et al., 2022), SCAN , TransResNet (Shuster et al., 2018), BART-large (Lewis et al., 2019) and SimpleTOD (Hosseini-Asl et al., 2020).
pre-training process, we utilize 200K steps, 25K steps, and 10K steps, respectively, for the three stages on 8 NVIDIA A100 GPUs with a batch size of 4,096.

Evaluation Metrics
For intent prediction, we adopt the F1 score as the evaluation metric to measure the effectiveness of our model, similar to previous work (Zang et al., 2021). For multi-modal dialog retrieval, we use ranking-based evaluation metrics such as recall n at k including R@1, R@5 and R@10 in accordance with prior studies (Zang et al., 2021;Shuster et al., 2018). These metrics measure whether the ground-truth textual or visual outputs are ranked among the top k ∈ {1, 5, 10} positions among n candidate elements. For multimodal dialogue state tracking, we report Categorical, Non-categorical and overall scores as evaluation metrics following (Liao et al., 2021). To measure the quality of response generation, we employ BLEU (Papineni et al., 2002) as the evaluation metric for SIMMC2.0. For MMConv, we report a combined score (Comb.), which is computed via (Inf orm+Success)×0.5+BLEU as an overall evaluation measure as in (Mehri et al., 2019).

Quantitative Comparison
As shown in Figure 2 and Table 2, PaCE demonstrates state-of-the-art performances across a wide range of multi-modal dialogue tasks. Specifically, we have achieved a significant enhancement on the PhotoChat and MMConv dataset, with an improvement of 4.8 points in multi-modal dialog retrieval and 21.2 points in multi-modal dialog state tracking, respectively. It is worth noting that PaCE has a total parameter count of 338 million. In addition,  since some experts may be idle during the execution of specific downstream tasks, the parameter size will further decrease for specific downstream tasks. Below, we provide a detailed analysis of the results for each sub-task dataset.

Multi-Modal Intent Prediction
For the Pho-toChat dataset, we report the performances of strong baselines as in (Zang et al., 2021), including ALBERT-base (Lan et al., 2019), BERT (Devlin et al., 2018), T5-base, and T5-3B (Raffel et al., 2020). For the MMDialog dataset, we adopt DE++, Divter (Feng et al., 2022), and ViLT ) as our baseline models. As shown in Table 3, although some models such as T5-3B are much larger than ours, our model still achieves the best performance on all evaluation metrics.
Multi-Modal Dialog Retrieval For PhotoChat, we compare PaCE with strong baselines reported in (Zang et al., 2021), including BM25 (Robertson et al., 2009), DE * (Zang et al., 2021), VSE++ (Faghri et al., 2017) and SCAN . We also adapted VLMo (Bao et al., 2021) and ViLT  to perform multi-modal dialog retrieval. The results on PhotoChat are re-    ported in Table 4, PaCE achieves substantially better performance than the best performing baselines. For Image-Chat, we compare PaCE with TransRes-Net152 (Liao et al., 2021), VLMo and ViLT, and report baseline results as in Table 5. PaCE achieves the best results for image-to-text dialog retrieval with 3.0 improvement in terms of Sum.  Table 6 and Table 7, respectively. PaCE can achieve the best results on most of the evaluation metrics. Notably, we observed that the PaCE achieves competitive results at smaller parameter scales than previous SOTA in SIMMC2.0 slot F1.

Multi-Modal Response Generation
For the response generation task, we conduct experiments on SIMMC2.0 and MMConv datasets. For MMConv, we adopt the strong baseline SimpleTOD (Hosseini-Asl et al., 2020) implemented by (Liao et al., 2021). We summarize the experimental results of SIMMC2.0 and MMConv in Table 7 and   discriminative and generative tasks.

Ablation Study
Effectiveness of Pre-training Objectives To evaluate the effectiveness of each stage of pretraining, we conduct an ablation study by removing Stage I pre-training (PaCE w/o L I stage ), removing Stage II pre-training (PaCE w/o L II stage ), removing Stage III pre-training (PaCE w/o L III stage ), and removing both Stage II and Stage III (PaCE only L I stage ). For a fair comparison, the experimental setup of the ablation study is consistent with that of the primary experiments, utilizing the same hyper-parameters and downstream fine-tuning strategy. The ablation test results on PhotoChat and Image-Chat are provided in Table 9. We can observe that image-text matching (Stage I) and image-context matching (Stage II) play the most important role in PaCE. This is within our expectation since Stage I and Stage II are the basis of the latter generation modeling (Stage III). It is no surprise that combining all three stages achieves the best performance on the experimental datasets. We also investigate the impact of L tca by removing it from Stage II pretraining (denoted as PaCE w/o Ltca ). We can observe that L tca has a significant impact on the performance of PaCE in Stage II pre-training.
Effectiveness of Pre-training Data In addition, we also conduct an ablation study to verify the impact of different pre-training data on PhotoChat and Image-Chat datasets. We define the models that only use MultiNonDialog and MultiDi-alog for pre-training as PaCE only MultiNonDialog and PaCE only MultiDialog , respectively. The ablation test results on PhotoChat and Image-Chat are provided in Table 10. We can observe that both MultiNonDialog and MultiDialog pre-training corpora contribute great performance improvement to PaCE. This is within our expectation since the MultiNonDialog data helps our model learn impressive image-text representations and their alignment, while the MultiDialog data encourages PaCE to capture the dialog context information.   Table 10: Ablation test results on the multi-modal dialog retrieval task by using different pre-training data.

Case Study
To evaluate PaCE qualitatively, we choose two exemplary conversations from PhotoChat and Image-Chat test sets, and illustrate the retrieved responses by PaCE in Figure 4 and Figure 5. Our PaCE model can retrieve highly relevant candidates to the conversation scenario. For the text-to-image (T2I) retrieval task, since the candidate images could be quite similar, it is challenging to retrieve the exact ground-truth image from the candidates. Although PaCE may not obtain the ground-truth image, we can still obtain the relevant candidate images.

Conclusion
In this paper, we proposed PaCE, a unified, structured, compositional multi-modal dialogue pretraining framework, which adopted a divide-andconquer strategy. We first break down the complicated multi-modal dialogue generation task into several sub-capabilities, which could be learned in an easier way. Then, the solutions to the subcapabilities were combined to obtain an effective and efficient solution to each downstream multimodal dialogue task. Experimental results on eight benchmark datasets demonstrated that PaCE achieved new state-of-the-art performances.

Discussion
PaCE adopts a flexible model structure that decomposes complex multimodal dialogues into basic sub-capabilities. As a result, it can be trained progressively on different data and exhibits excellent expandability, making it applicable to new tasks. An additional advantage is that it aligns well with various attempts to enhance performance in terms of interpretability. However, we believe that there are still many aspects of PACE that are worth exploring. First is the exploration of incorporating additional modalities and investigating whether the self-attention layer can effectively handle a broader range of modalities for a unified representation. Another aspect worth exploring is the development of a more efficient approach for adapting multimodal models to diverse downstream applications, eliminating the necessity to fine-tune all parameters of the model. Furthermore, given the substantial variations in the model networks employed for text generation and image generation in contemporary research, exploring the integration of multi-modal generation into a unified framework is a worthwhile endeavor.

Limitations
To better analyze the limitations of PaCE, we carry out an analysis of the errors made by PaCE on the PhotoChat and SIMMC2.0 test sets. We reveal several reasons for the errors, which can be divided into the following categories. First, since there are many similar images in the datasets, PaCE fail to distinguish some gold image from similar candidates. This may be because we do not design an explicit fine-grained reasoning module to capture the details of images and texts. For example, for the context mentions "I and my dad both have a camera", our model can capture the entity "camera", but fails to reason the fact that there should be two cameras. One possible solution is to introduce a deep reasoning and comprehension strategy to empower the model with excellent reasoning ability. Second, due to the lack of fine-grained structural understanding of the images, the sentences generated by PaCE suffer from identifying the relative positions of entities. For example, PaCE may have difficulties recognizing the fact that the right side of a yellow shirt is black pants. This issue is particularly severe in SIMMC as there are many entities in the pictures and spatial descriptions of entities in the responses. One possible idea is to extract the relative positions of objects mentioned in the conversation as auxiliary data to guide the model's generation. Figure 5: Two cases on the Image-Chat test set. For each dialogue query, we show the top-5 ranked response from top to down. and JCYJ20200109113441941), and NSFC (no. 92270122). This work was supported by Alibaba Group through Alibaba Innovative Research Program.