Multi-Source Probing for Open-Domain Conversational Understanding

,


Introduction
Conversational understanding and response generation are critical for the success of open-domain dialogue systems.Recently, pre-trained open-domain dialogue models, including DialoGPT (Zhang et al., 2020), BlenderBot (Roller et al., 2020), and Meena (Adiwardana et al., 2020), have achieved impressive progress in a wide range of conversational tasks.The pre-trained model has become a solid foundation for the downstream fine-tuning process, such as generating empathetic (Zhong et al., 2020) and persona-coherent (Wolf et al., 2019b) responses, delivering knowledge-grounded conversations (Zhao et al., 2020;Wu et al., 2021), and completing task goals (Wu et al., 2020;Peng et al., 2020).While these generative dialogue models can produce fluent responses, they still have many limitations in conversational understanding (Saleh et al., 2020;Li et al., 2016), leading to irrelevant, repetitive, and generic responses (Li et al., 2017a;Welleck et al., 2019;Cho and Saito, 2021).
Research on conversational understanding can provide a holistic evaluation of dialogue models (Parthasarathi et al., 2020), contributing to the deployment and design of models.However, the analysis of open-domain dialogue models on conversational understanding remains a controversial topic (Dinan et al., 2020;Tao et al., 2018;Ji et al., 2022).Some work (Sankar et al., 2019;Saleh et al., 2020) demonstrates that dialogue models have difficulty in capturing the conversational dynamics in the dialog history and struggle with conversational understanding tasks such as question answering, contradiction inference, and topic determination.In contrast, Parthasarathi et al. (2020) affirms the conversational understanding of open-domain dialogue models and indicates recurrent dialogue models perform better than transformer-based models.These studies have limitations in probing methods and experimental settings (Ravichander et al., 2020), making them not applicable to present largescale open-domain dialogue models.
In this work, we propose a Multi-Source Probing (MSP) method to examine the conversational understanding ability of open-domain dialogue models.Specifically, MSP conducts dialogue comprehension tasks in a generative manner, which is coherent with the pre-trained dialogue generation task to take full advantage of model capabilities.In addition, considering that different tasks require various information of the dialogue context, MSP aggregates features from multiple sources to accomplish diverse tasks.We propose a multi-source crossattention mechanism to extract local features and adopt a late fusion module to incorporate global features.With the help of these components, MSP has the capability to evaluate generative dialogue models more accurately and comprehensively.
To expose the validity and reliability of our method, we conduct comprehensive experiments to compare the conventional MLP-based probing approach with MSP.Furthermore, we also set up a series of ablation experiments to verify the necessity of the multiple-source attention mechanism and the late-fusion module.
In order to verify that our method also applies to even more massive scale models, we extend our MSP method to provide insight into larger pretrained dialogue models.We found that larger models have stronger extraction capability for the information inferred via the pre-trained encoder.
Our study reveals three critical findings: • Different from the conclusion reached by the vanilla probing method, we find through MSP that encoder hidden states contain more information than original embeddings in pretrained dialogue models, as reflected by the higher accuracy obtained on our probing tasks.
• Generative dialogue models with a single decoder have a worse understanding of the conversation than encoder-decoder-based models, as the uni-directional attention mechanism only encodes partial context (content before each token) information for tokens, leading to asymmetric representations of dialogue history and current utterance.
• Dialogue models can capture the dialogue structure in conversational understanding.Larger dialogue models have a better understanding of conversational information and achieve higher accuracy on probing tasks.
2 Related Work

Open-domain Conversational Models
Recently, open-domain conversation systems have been largely advanced due to the increase of dialogue corpus and the development of largescale pre-training (Devlin et al., 2018;Radford et al., 2019;Brown et al., 2020).Pretrained open-domain dialogue models, such as Di-aloGPT (Zhang et al., 2020), BlenderBot (Roller et al., 2020) and Meena (Adiwardana et al., 2020), demonstrate outstanding conversation skills, including empathetic (Zhong et al., 2020) and persona-coherent (Wolf et al., 2019b) response generation, delivering knowledge-grounded conversa-tions (Zhao et al., 2020;Wu et al., 2021) and completing task goals (Wu et al., 2020;Peng et al., 2020).These dialogue models are capable of fluent response generation, but they still have many conversational understanding limits (Saleh et al., 2020;Li et al., 2016) that result in irrelevant, repetitive, and generic responses (Serban et al., 2017;Li et al., 2017a;Welleck et al., 2019;Cho and Saito, 2021).Several studies (Sankar et al., 2019;Bao et al., 2020) point out that generative dialogue models don't always properly exploit the existing dialog history and they are yet unable to understand the context to provide coherent and engaging conversations.Some work (Komeili et al., 2022;Zhou et al., 2018) introduce external knowledge to enhance conversational understanding, which facilitates the generation of relevant and coherent responses.

Probing Method
With the growing demand for natural language understanding, the probing method has been widely employed in machine translation (Belinkov et al., 2017(Belinkov et al., , 2018;;Dalvi et al., 2017;Yawei and Fan, 2021) and knowledge attribution (Alishahi et al., 2017;Beloucif and Biemann, 2021) to assess the linguistic properties of sentence representations learned by models.
Although several studies have been proposed to probe the conversational understanding capability of open-domain dialogue models (Dinan et al., 2020;Tao et al., 2018;Ji et al., 2022), this research area is still controversial.According to certain research (Sankar et al., 2019;Saleh et al., 2020;Das et al., 2020), conversational comprehension tasks including question answering, contradiction inference, and subject determination pose challenges for dialogue models in terms of capturing the conversational dynamics in the dialog history.In contrast, Parthasarathi et al. (2020) validates conversational comprehension of open-domain dialogue models and demonstrates that recurrent dialogue models outperform transformer-based models.In addition, previous work (Saleh et al., 2020;Alt et al., 2020;Ravichander et al., 2020;Parthasarathi et al., 2020;Richardson et al., 2020) usually adopted a certain probing method to perform model-level analysis of dialogue systems, thus they lacked an exhaustive comparison of different probing methods.These studies have reached opposite conclusions for two major reasons.First, these methods usually adopt a shallow Multi-Layer Perceptron (MLP) as the clas- Multi-Source Probing Late Fusion sifier, which cannot fully utilize the information encoded in intermediate representations to conduct probing tasks.Second, the experimental settings are also insufficient (Ravichander et al., 2020), including the probing tasks and probed model scales.

Vanilla Probing Approach
For the vanilla probing method, we first use pretrained generative dialogue models to extract encoder hidden states and word embeddings corresponding to the probing task texts and then feed the extracted representations to a two-layer Multi-Layer Perceptron (MLP) classifier.
In this way, we calculate the accuracy of probing tasks, which is empirically assumed to reflect the ability of the corresponding model to capture information that is beneficial for the goal of dialogue understanding.Here only the parameters of the classifier are trainable during the training of probing tasks, with encoder parameters kept fixed.The vanilla probing method is shown in Fig 1(a).
First, we extract inner states from the encoder to get the word-level representations on word embeddings and encoder states in the entire prob- cn ] denoting the tokens in the n th utterance, where c n denotes the word numbers in the utterance u n and w (n) i is the word embedding of the i th word w (n) i in the utterance u n .We require utterance-level representations for probing where mixed synthesized information is present since the probing tasks rely on high-level reasoning.To get R history and R current , we independently averaged the representations corresponding to historical and contemporary utterances.Then we concatenate the two averaged representations to get the final feature R probing for probing tasks, where the concatenation operation is denoted by ⊙.The process is defined as follows:

Multi-Source Probing Approach
The prior MLP-based approach has been considered fairly intuitive to detect if the dialogue model captures pertinent information in the encoder states.However, it is difficult to effectively utilize the information encoded by the dialogue models using these approaches due to the divergent objectives of downstream probing tasks and dialogue generation during pre-training (Liu et al., 2021;Schick and Schütze, 2021).
To address this issue, we propose a Multi-Source Probing (MSP) method to probe the dialogue comprehension abilities of open-domain dialogue models.MSP conducts probing tasks in a generative manner, which is consistent with the pre-trained task to take full advantage of model capabilities.
As various probing tasks require information from different aspects of the dialogue, which may differ greatly from the dialogue generation task, we propose a multi-source attention mechanism to aggregate features from multiple sources to accomplish diverse tasks.
Moreover, the application of potential understanding capabilities that might be encoded in the decoder parameters is also lacking in MLP-based probing methods, while MSP reapplies the discarded information and provides better modeling of contextual information.The overview of MSP is presented in Figure 1(c).

Continuous Prompt Learning
Continuous (or soft) prompt learning is one of the central parts in our MSP method, which is shown in Figure 1(b), as the generative approach of probing classification is considered to be more consistent with the goal in the pretraining phase, allowing the model to be more adaptable.
Specifically, we set the Prompt-Template as a sequence of different soft tokens for decoder input.Each soft token has a unique word embedding that is constantly adjusted and updated during the training stage.Consistent with the vanilla probing setting, the transformer decoder is finetuned in MSP, while the encoder parameters are fixed.
Since our probing task is based on classification, a verbalizer (Hu et al., 2022) class is constructed here to project the original probability distribution over the whole vocabulary to the set of label words given by the probing task.

Multi-Source Attention
Different downstream probing tasks have various focuses on the location of the required information (Sankar et al., 2019), while the general attention module of dialogue models tends to fail in locating and extracting information from multiple aspects and sources in a fine-grained way.
Therefore, to avoid overshadowing the key information contained in the dialogue context when the significance of information is unevenly distributed, we propose a multi-source attention mechanism, by using multiple cross-attention masks corresponding to different sources through the decoding process, to generate more reasonable attention that can extract relevant local features from different sources for probing classification.
As shown in Figure 2, the multi-source attention module takes turns allocating different crossattention to different parts of encoded representations for soft prompt tokens.We adopt three types of attention masks to operate the multi-source cross attentions, which are separately called historysource, current-source, and integrated-source.Only the relevant portion of the dialogue is given attention by the history-source and current-source attention functions.While the integrated-source cross-attention mask is built to gather information from the full context of the dialogue.
Late Fusion (a) Dialogue History Probing In Prompt-Template, the three consecutive tokens are grouped into one combined unit, which is designed to perform a round of coverage during the probing process.We assign the three soft tokens in the last combined unit as the prediction position where the averaged logit is passed to the subsequent verbalizer function, and then we can calculate the loss for the corresponding classes.This step is specially formulated to fuse the information captured by various soft-token representations that are focused independently on the history, current, and entire section of the input text.
The decoder f θ updates the hidden states H i conditioned on the past decoder states H <i with self-attention, the soft token embedding S i , as well as the encoded representations E k with crossattention, where k could take values in K = {history, current, integrated}, as follows:

Late Fusion
Previous studies (Vig, 2019;Vig and Belinkov, 2019) have shown that the attention mechanism is sensitive to local features, while global features of conversations should also be considered to produce a comprehensive representation.
Thus we introduce the late fusion module as a merging strategy for information integration to generate a comprehensive representation from encoder hidden states, which is combined with the probing decoder states after passing through an MLP layer with dropout.Late fusion is designed to capture the global information encoded by the language model which serves as the role of a complement to the probing decoder aimed at extracting precise local features, thus allowing a higher sensitivity to various features of the linguistic information.
Here we implement the late fusion by averaging the encoder hidden states E k that focus on the desired section of the text, where k could take K = {history, current, integrated}.We would receive the representation L k after the late fusion, and A k is the final representation obtained by combining the decoder hidden states H k and the result from late fusion module. (5) The probability of P M (y|x) with probing label y is calculated as: (7) where V is the language model head that projects the raw logits to the label space.Here n is the number of combined units in the soft token sequence.

Experiment
4.1 Probing Tasks TREC (Li and Roth, 2002) is a question classification dataset consisting of questions labeled with relevant answer types.The task aims to determine the information category the question is requesting.DialogueNLI (Welleck et al., 2019) is a natural language inference task consisting of dialog turns with entailment, contradiction and neutral labels.MultiWOZ (Eric et al., 2020) is a multi-domain, goal-directed conversational dataset for exploring natural language comprehension.Schema-Guided Dialog dataset (SGD) (Rastogi et al., 2020) is an intent-tracking task that requires reasoning over multiple turns of dialogue.SNIPS (Coucke et al., 2018) is an intent classification task with crowd-sourced, single-turn queries labeled for intent.ScenarioSA (Zhang et al., 2019) is a sentiment classification task with turn-level sentiment labels and inputs from multi-turn, open-ended dialogues.DailyDialog Topic uses dialogues from the Daily-Dialog dataset (Li et al., 2017b) to create a probing task where the goal is to make inferences about the topic of conversations (Saleh et al., 2020).

Probing Methods
Vanilla MLP-based Probing adopts a two-layer Multi-Layer Perceptron classifier (Saleh et al., 2020;Parthasarathi et al., 2020), which takes the word embedding and encoder states corresponding to the conversation context of the pre-trained dialogue models as input.Prompt-based Probing applies prompt learning as the principle for performing probing tasks.During training, only the embeddings of soft prompt tokens and the verbalizer parameters are fine-tuned while the pre-trained encoder and decoder parameters are fixed (Liu et al., 2021).Multi-Source Probing is the proposed approach, which is characteristic of its multi-source attention and late fusion module.The parameters of the verbalizer, prompt token embeddings, and decoders are fine-tuned during training.

Open-domain Dialogue Models
We first train three widespread generative dialogue models for 20 epochs on DailyDialog dataset (Li et al., 2017b), using the Maximum-likelihood objective (Sutskever et al., 2014).Specifically, we train Transformer from scratch and fine-tune BlenderBot-small and DialoGPT-small with pretrained parameters on DailyDialog for a fair comparison, as adopted in prior work (Saleh et al., 2020).
Besides, we adopt BlenderBot-medium [400M] and DialoGPT-medium [345M] without fine-tuning for the inspection of larger pre-trained models.Transformer (Vaswani et al., 2017) is a typical language model with multiple attention layers.We implement it in the form of encoder-decoder structure, with a 2-layer encoder and a 2-layer decoder.BlenderBot (Roller et al., 2020)  chitecture which is first pre-trained on 1.5B Reddit comment threads (Baumgartner et al., 2020) and later fine-tuned on Blended SkillTalk (BST) dataset (Smith et al., 2020).
DialoGPT (Zhang et al., 2020) is a dialogue response generation model for multi-turn conversations with a single decoder, which is pre-trained on large-scale Reddit data (Baumgartner et al., 2020).

Analysis
We will detail experimental results in this section, including the analysis of our main experiments on the performance of different probing methods, the ablation study of Multi-Source Probing architectures, and evaluations of different dialogue models and experimental settings.

Main Results
The main results are presented in Table 1, where each probing task is evaluated by calculating an average score of accuracy.We analyze the results from the following perspectives: Comparison between methods: For all probing tasks and dialogue models, MSP achieves the best performances, indicating that our method more effectively leverages the relevant information encoded in the intermediate representations to conduct probing tasks.Besides, the majority of encoder state results outperform word embedding results, indicating that encoder states contain more semantic features than word embeddings.Thus, to some extent, dialogue models learn semantic information from conversations, which is required in conversational understanding tasks.
For the MLP-based probing method, we observe a similar phenomenon of the Transformer model that the performances of encoder states are not superior in many tasks as reported in prior work (Saleh et al., 2020).This observation demonstrates that the vanilla probing method has limitations in utilizing encoder states to conduct conversational understanding tasks, due to the gap between downstream classification tasks and the pre-trained dialogue generation task (Liu et al., 2021  and Schütze, 2021).The prompt-based probing approach obtains the worst performance among the three probing methods.Although it performs the probing tasks in a generative manner, this approach cannot effectively extract relevant features required in conversational understanding tasks, leading to undesirable results.
Comparison between models: As the parameters of dialogue models increase, the performance on dialogue understanding tasks also improves based on the MSP method.However, DialoGPT doesn't outperform BlenderBot on some tasks and the performances of encoder states are not always significantly better than those of word embeddings.This is probably because DialoGPT adopts a single decoder structure with the uni-directional attention mechanism, which encodes partial context (content before each token) information for tokens, leading to asymmetric representations of dialogue history and current utterances.By contrast, the encoder-decoder-based BlenderBot applies the bi-directional attention mechanism to encode bidirectional information (content before and after each token) for tokens, achieving more consistent and superior performances on conversational understanding tasks.
We also evaluated the comprehension ability of pre-trained language models such as BERT, BART, and T5 through MSP, and the details are attached in Appendix 6.

Ablation Study
Here we set up a series of ablation experiments to investigate the validity and necessity of the components of our Multi-Source Probing approach, which is composed of two main parts: the multi-source attention mechanism and the late fusion module.The multi-source attention mechanism is designed to pay fair attention to both the historical and current turns of the conversational context, while the design of the late fusion module is motivated by the fact that many probing tasks need to capture global features while maintaining a high sensitivity to local features, so we add a layer of late fusion after the decoder for fusing global and local features in the final representation.Due to the length limit, the results are presented in Table 4 in the Appendix.

Ablation Setting
Several sets of ablation experiments are designed to verify the necessity and effectiveness of different modules: MSP w/o LF is an ablation setting where the late fusion module is removed compared to the complete MSP method.
MSP w/o MS is an ablation setting where the multi-source attention and late fusion module are removed.This approach has the same architecture as the prompt-based probing method except that the decoder is fine-tuned during training.

Effect of Late Fusion Module
In our experiment settings, we set ablation experiments on the effectiveness of late-fusion.Through the analysis, we discover that the late-fusion architecture enables a significant improvement in the ability of the probing model to extract the representational information encoded by the pre-trained language model.Furthermore, it not only incorporates the information provided by the encoder very well but also provides a nuanced insight for our probing architecture of representational informa-  tion at the shallow level.Consequently, this makes the structure more hierarchical and efficient in integrating different depths of encoded information.
We found from the results of our experiments that when the Multi-Source Probing Method does not have a late-fusion layer, the probing results of encoder states in several tasks are not better than those of word embeddings, which indicates that when we discard the late fusion, we also discard some classification local features needed for the task.We found that the ground-truth labels for these tasks are often determined by a combination of certain keywords in the historical utterance and some obvious prompt words in the current utterance of the dialog, so if we simply use the representations generated by the multi-layer transformer in the decoder, this tends to draw out only the global features and ignore the involvement of local features.So in this step of the experiment, we verified the significant role of the late fusion module for synthesizing local features.

Effect of Multi-Source Attention
We further developed ablation experiments to explore the impact of discarding the multi-source attention module of our MSP method, where an exciting discovery is found that the accuracy of the probing task was reduced by more than 10% when the MSP method did not efficiently utilize the multi-source attention module, suggesting that we need to introduce specialized designs to focus on different parts of the input text when evaluating the ability of a language model to understand a conversation.This demonstrates the need to introduce specialized designs to focus on different parts of the input text, which is a concern that generative dialogue models are constantly focusing on during the pre-training process.We experimentally found that a single cross-attention module is not effective in accomplishing our probing purpose, and thus a multiple-source attention mechanism is a very central component of our probing approach.

Extension Experiments
Considering the undesirable performance of the prompt-based probing approach as shown in Table 1, we adopt MSP and MLP methods in the extension experiments.

Impact of Model Scale
Table 2 shows the performance of large-scale pretrained conversational models on probing tasks.As can be seen, with the increase of model parameters, the performance on dialogue comprehension tasks also increases.The pre-trained dialogue models demonstrate a strong capability of conversational understanding even without fine-tuning on downstream dialogue corpus.For the MLP-based probing method, there is no obvious difference between the performance of encoder states and word embeddings in many tasks.By contrast, our approach is applicable to models of different scales from the 2-layer Transformer trained from scratch to the large-scale pre-trained BlenderBot and DialoGPT.

Impact of Word Embedding
During the training process of dialogue generation, word embeddings can learn and encode linguistic knowledge of conversations (Ravichander et al., 2020) as Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014).Thus, we substitute the trained word embeddings with randomly initialized ones and conduct probing experiments to investigate the impact of word embeddings.
The performance with random embeddings is shown in Table 3.As we can see, there is a significant gap between the performance of random word embeddings and original ones, indicating that the trained word embeddings encode semantic information required by conversational understanding.In addition, the performance with random embeddings achieves above 90% on some tasks, such as SNIPS and MWOZ, while only obtaining less than 70% on others.It shows that different tasks require different degrees of conversational semantics.

Impact of Dialogue Structure
To examine whether dialogue models leverage dialogue structure in conversational understanding, we shuffle the order of the input tokens within dialogue history and the current utterance respectively.The results are shown in Table 3.We note that the MLP method weakens the feature of the word order by the average pooling operation, while MSP offers superior modeling of the word order.
We observed that the performances of both word embeddings and encoder states decrease substantially with shuffled input.Furthermore, the performance gap is even larger for encoder states, indicating that dialogue models can capture the context and flow of the dialogue for conversational understanding, rather than just processing individual words or utterances in isolation.

Conclusion
In this paper, we propose a Multi-Source Probing (MSP) method to probe the dialogue comprehension abilities of open-domain dialogue models.It conducts probing tasks in a generative manner that is consistent with the pre-training task of dialogue models.Besides, we propose the multi-source attention mechanism to aggregate features from multiple sources and the late fusion module to capture global features for downstream tasks.Our experimental results indicate the validity and reliability of the MSP method, which could also offer insight into the impact of the model scale, embedding quality, and dialogue structure on the conversational understanding capability of dialogue models when particular experimental settings are presented.This research underscores the importance of a comprehensive probing framework for dialogue models and paves the way for future studies aimed at enhancing their understanding capabilities.

Limitations
Although the Multi-Source Probing (MSP) method can precisely detect the conversational understanding of open-domain dialogue models of different scales, it still faces two limitations.First, we focus on evaluating three widespread dialogue models in our experiments due to the limitation of computational resources.Dialogue models of different structures and scales could be probed with MSP in future work.Second, we adopt several representative classification tasks as our probing tasks, following previous work (Saleh et al., 2020).These tasks require different dialogue comprehension skills and have different degrees of difficulty, as analyzed in Section 5.3.2.In future work, a wide range of tasks of different complexity in different domains could be conducted based on MSP to construct a benchmark of conversational understanding.

A Discussion
A.1 Eliminate the Effect of Parameter Scales We have already found a positive correlation between parametric size and probe model performance earlier, so here we further explore the gap between MLP and MSP performance at the same parameter size to address the concern about the effect of the number of parameters on the probing results.
We extended the original two-layer MLP to a parameter scale consistent with the MSP, and the results are attached in Table 5.Although the parameter size of the MLP is increased to a level comparable to that of the MSP, the results are still sub-optimal because it does not operate in a manner consistent with the goals in the pretraining phase of the dialogue model.The supplementary experiment results demonstrate the effectiveness of the MSP structure.

A.2 Generalization on Pre-trained Language Models
We also evaluated the results of MSP over the stateof-the-art pre-trained language models.According to the results over BERT, BART, and T5 in Table 6, we could conclude that MSP still outperforms MLP even on the pre-trained language model.Among the three models of comparable parameter size, Bart has the most outstanding ability for dialogue understanding.Bert performs optimally on the topic classification task, and T5 has very good performance on the task of intent detection, which is consistent with the characteristics and pretraining goals of the individual models themselves.
We found that Blenderbot-Small outperformed all of the three general pre-trained models in terms  of accuracy on the MWOZ SGD and Topic tasks.While DialoGPT-Small performs best on the TREC and SNIPS tasks.Another point to note is that all three models included in the supplemental experiments have over 25% more parameters than the corresponding BlenderBot-Small DialoGPT.

A.3 Non-linear Probing Finetuning
Nonlinear probing is widely adopted in previous works (Belinkov et al., 2017;Belinkov and Glass, 2017;Conneau et al., 2018).In fact, non-linear probing and linear probing are essentially similar in probing.Classifiers with even a shallow linear structure can still fit well on these probing tasks.To this end, we add an experiment on linear probing, in which the original/random embeddings of the dialogue model were connected to a single linear layer for linear regression.The results in Table 7 show that the linear probing method can also achieve more than 95% accuracy on random embeddings in the SNIPS task.It also proves the previous conclusion that linear probing is not an exclusive probing skill and non-linear probing has stronger probing performance in many aspects.We also add the experimental results of MSP on Bert (See Table 6), where we could see that Bert does not outperform pretrained ODD models of comparable size on the task of probing for language understanding.

B Dataset Examples
Examples of probing dataset are shown in Table 9.

C Implementation Details
We implemented the above models with Py-Torch (Paszke et al., 2017), OpenPrompt (Ding et al., 2021) and Huggingface Library (Wolf et al., 2019a).When implementing the Multi-Source Probing method upon DialoGPT, we introduced randomly initialized cross-attention parameters together with the decoder for fine-tuning since Di-aloGPT is not an encoder-decoder-based model.We utilized Soft Verbalizer (Hambardzumyan et al., 2021;Hu et al., 2022), where a continuous vector is designed for each class label, to generate the probability distribution for class label space by calculating the dot product between the output of the language model and the class vector.The class vectors are initialized with the pretrained token embeddings and will be fine-tuned through training.All models in this paper are optimized through ADAM (Kingma and Ba, 2014) with learning rate and dropout rate optimized through grid search.The number of soft tokens is empirically set to 12 in our Multi-Source Probing method.The learning rate is searched within the range of {5 × 10 −5 , 1 × 10 −4 , 2 × 10 −4 , 3 × 10 −4 , 5 × 10 −4 } and dropout rate within {0.1, 0.2, 0.3, 0.4, 0.5}.The batch size for Transformer, BlenderBot, and DialoGPT are 128, 64, and 16 respectively.Accuracy and standard deviation data in this paper are calculated from the results of 5 replicate experiments.We conducted our probing experiments on the NVIDIA V100 Tensor Cores, the average run-time for each probing task is about 5 hours.

Table 2 :
The performance of MLP and MSP on two large-scale pre-trained dialogue models BlenderBot MEDIUM and DialoGPT MEDIUM .Best results are marked in bold, and data that passed the significance test (t-test, p-value < 0.05) are super-scripted with an asterisk * .The numbers in square brackets represent the standard deviation.

Table 3 :
The performance with random embeddings and shuffled order of words in dialogue context.Data that passed the significance test (t-test, p-value < 0.05) are super-scripted with an asterisk * .

Table 4 :
The performance of different dialogue models on probing tasks for ablation experiments.Here we introduce three ablation settings: 1) MSP: The complete Multi-Source Probing method, 2) MSP w/o LF: MSP without the late fusion module, 3) MSP w/o MS: MSP without the multi-source attention mechanism and the late fusion module.Best results are marked in bold, and data that passed the significance test ( t-test, p-value < 0.05) are super-scripted with an asterisk * .

Table 5 :
The performance of the MSP and MLP-Deep methods on probing tasks.Best results are marked in bold, and data that passed the significance test ( t-test, p-value < 0.05) are super-scripted with an asterisk * .

Table 6 :
The performance of the MSP and MLP methods with the state-of-the-art pre-trained language models BERT, BART, and T5 on several probing tasks.Best results are marked in bold, and data that passed the significance test ( t-test, p-value < 0.05) are super-scripted with an asterisk * .

Table 7 :
The performance of the MLP and Linear-Probing methods with original and random embeddings on several different probing tasks.Data that passed the significance test (t-test, p-value < 0.05) are super-scripted with an asterisk * .

Table 8 :
Training results of dialogue models.
Happy birthday , Jim !Here is a present for you.[User2]:Oh, great !I love it![User1]:I'm very glad to hear that .[User2]:Come here, let me introduce some friends to you.
relationship Table 9: Examples from probing tasks.