Do Encoder Representations of Generative Dialogue Models have sufficient summary of the Information about the task ?

Predicting the next utterance in dialogue is contingent on encoding of users’ input text to generate appropriate and relevant response in data-driven approaches. Although the semantic and syntactic quality of the language generated is evaluated, more often than not, the encoded representation of input is not evaluated. As the representation of the encoder is essential for predicting the appropriate response, evaluation of encoder representation is a challenging yet important problem. In this work, we showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models and, to that end, propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models. From experiments, we observe that some of the probe tasks are easier and some are harder for even sophisticated model architectures to learn. And, through experiments we observe that RNN based architectures have lower performance on automatic metrics on text generation than transformer model but perform better than the transformer model on the probe tasks indicating that RNNs might preserve task information better than the Transformers.


Introduction
The task of dialogue modeling requires learning through interaction, often, from humans.The model is expected to understand the input text for it to interact, and the interaction can be meaningful only when the language understanding gets better.Approaches for solving dialogue task include information retrieval based approaches like selecting a response from a set of canned responses (Lowe et al., 2015a) or keeping track of very specific information which are a priori marked as informative slot-value pairs (Guo et al., 2018;Asri et al., 2017) or generating the next response with token-by-token (Vinyals and Le, 2015;Lowe et al., 2015a;Serban et al., 2015;Li et al., 2016Li et al., , 2017;;Parthasarathi and Pineau, 2018).The evaluation of the different approaches have mostly relied on the output of the model -the slot predicted, response selected or generated.
The issues in evaluation -automatic evaluation metrics uncorrelated with human judgement -showcased by Liu et al. (2016) is still an open problem.Attempts to mimic human scores for better evaluation metric (Lowe et al., 2017) and other metrics that aim to correlate with the human judgement (Sinha et al., 2020;Tao et al., 2018) evaluate the quality of the text generated but do not evaluate the language understanding component of a model.The language understanding component of an agent more often than not goes unnoticed with only token-level evaluation metrics on the generated text.
To that end, we propose evaluating the encoder representation of dialogue models through probe tasks 1 constructed from the commonly used dialogue data sets -MultiWoZ (Budzianowski et al., 2018) and PersonaChat (Zhang et al., 2018).Concretely, we use the representation learnt by the encoders while training on dialogue generation tasks to solve a set of dialogue related classification tasks as a proxy to probe the information encoded in the encoder representation.We study the performance of language encoders in 17 different probe tasks with varying degree of difficulties -binary classification, multi-label classification and multi-label prediction.For example, predicting whether the current dialogue has single or multiple tasks, identifying the number of tasks, identifying the tasks, presence of a specific information provided by the user among many others.The probe tasks allow a way to quantify the understanding of a model and help identify biases, if any, in the task of dialogue prediction.We observed the performance of the models in the probe tasks to little fluctuate with different seed values thus allowing to analyse the encoder representation with minimal variance.Further, the experiments on probe tasks help in understanding deeper differences in between recurrent neural network (RNN) and Transformer encoders that were previously not evident from the token-level evaluation methods.
Our contributions in the paper are: • Showcasing the significantly high variance in human evaluation of dialogues.
• Proposing a list of probe tasks -2 semantic, 13 information specific and 3 downstream as an alternate evaluation of dialogue systems.
• Finding that the representation learnt by recurrent neural network based models is better at solving the probe tasks than the ones based on transformer model.

Related Work
Evaluating dialogue models has been an important topic of study.While many of the metrics have focussed on evaluating the generated text through n-gram overlap based heuristics -BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Lavie and Agarwal, 2007) -there have also been learned metrics like ADEM (Lowe et al., 2017), MAudE (Sinha et al., 2020), RUBER (Tao et al., 2018) among other metrics (Celikyilmaz et al., 2020).Though language generation has been an important component of study, there are not many studies that benchmark soundness of encoding information by dialogue systems.Probe tasks in language generation (Conneau et al., 2018;Belinkov and Glass, 2019;Elazar et al., 2020) has been used to understand the information encoded in continuous embedding of sentences.Such probe tasks are set up as classification tasks that are solved with model learnt representation.As it is easier to control the biases in probe tasks than in the downstream tasks, research in language generation has analysed models on probe tasks like using encoder representation to identify words in input (WordCont) to measuring encoder sensitivity to shifts in bigrams (Conneau et al., 2018;Belinkov and Glass, 2019).
Analysis using probe tasks has been done also in reinforcement learning (RL).Anand et al. (2019) learn state representation for an RL agent in an unsupervised setting and introduce a set of probe tasks to evaluate the representation learnt by agents.This includes using an annotated data set with markers for position of the agent, current score, items in inventory, target's location among others.The authors train a shallow linear classifier to identify specific entities in the embedded input that serves as a metric for the representational soundness of the learning algorithm.
Applications of computer vision like caption generation for images (Vinyals et al., 2015) or videos (Donahue et al., 2015) use attention based models to parse over the hidden states of a convolutional neural network (ConvNet) (LeCun et al., 1998).The attention over the ConvNet features are visualized to observe the words corresponding to different parts of the image.Visualizing the attention has been one of the qualitative probe task for text generation conditioned on images (Xu et al., 2015).

Dialogue probe Tasks
Like other tasks, dialogue task requires a learning agent to have sufficient understanding of the context to generate a response; at times the models have been shown to not have basic understanding leading to incorrect response prediction.Although dialogue models are evaluated on grammar, semantics, and relevance of the generated text, seldom has that been extended to evaluate the language encoding capacity of these models.The tasks proposed and discussed in this paper are shown in Table 1.

Basic Probe Tasks
The basic probe tasks evaluate if the encoder representation can be used to predict the existence of a mid-frequency token in the context (WordCont) (Belinkov and Glass, 2019), or test if the encoding of the context provides information of how long the dialogue has been going on (UtteranceLoc) (Sinha et al., 2020).For UtteranceLoc task, the conversation is split into 5 different temporal blocks and a classifier trained on the encoded context embedding is used to predict the appropriate label.

Information Specific Probe Tasks
We construct 12 information specific probe tasks to evaluate if specific information is retained in the encoder representation of input text.The informa- tion specific tasks have different levels of difficulty.For example, IsMultiTopic is a binary classification task, NumAllTopics is a multi-label classification task while AllTopics is a multi-label prediction.

Downstream probe Tasks
Further we evaluate the language understanding of dialogue models on their performance on relevant downstream tasks.Towards evaluating the model's understanding of the user utterance, the downstream probe tasks verify if the encoder representation allows to predict the user dialogue act.The dialogue state tracking measures the performance of a model on such tasks (Henderson et al., 2014) but seldom is it evaluated on generative dialogue models.Neelakantan et al. (2019) use entity, values and action information to train on the dialogue generation task but the performance of a generative dialogue model without explicitly training on the downstream tasks are not compared.Towards that, we propose ActionSelect, EntitySlots, EntityValues probe tasks.The details of the task are shown in Table 1.

Data sets
With the probe tasks we study different dialogue encoder architectures trained on next utterance generation on MultiWoZ 2.0 (Budzianowski et al., 2018) and PersonaChat (Zhang et al., 2018)   comprehensively compare several model selection criteria, we experimented with selecting models based on BLEU (Papineni et al., 2002), ROUGE-F1 (Lin, 2004), METEOR (Lavie and Agarwal, 2007) and Vector-Based (Average BERT embedding) metrics.We present the results from BLEU as a selection criteria in the paper.Further in the Appendix we compare the evolution of the performance of different models in the probe tasks over the entire training.
The classification tasks for probing the encoder representation are constructed for every generated response that requires information from the dialogue history thus far.We split the probe tasks in Train/Test/Valid corresponding to the splits the tasks are constructed from.First, we train the dialogue models on end-to-end dialogue generation and use the encoder representation to train and test on the probe tasks.To that, we store the encoder parameters after every epoch during dialogue generation training and compute the results of probe tasks after every epoch.

Models
We train 5 commonly used encoder architectures on the task of next utterance generation in the two data sets.
LSTM ENCODER-DECODER The architecture (Vinyals and Le, 2015) has an LSTM cell to encode the input context only in the forward direction.For a sequence of words in the input context (w i 1 , w i 2 , . . ., w i T ) LSTM encoder generates {h t } T 1 .The decoder LSTM's hidden state is initialized with h T t and the decoder outputs one token at each step of decoding.For the experiments, we used two layer LSTM cell where the first layer applies recurrent operation on the input to the model and the layer above recurs on the outputs of the layer below.The encoder final hidden state (from the 2nd layer) is passed as an input to the decoder.We train the model with cross entropy loss as shown in Equation 1.
where y t is the t th ground truth token distribution in the output sequence, ŷt is model generated token and p is the model learned distribution over the tokens.We train the model with Adam (Kingma and Ba, 2014) optimizer with teacher forcing (Williams and Zipser, 1989).

LSTM ENCODER-ATTENTION DECODER
The architecture is similar to the LSTM Encoder-Decoder with an exception of an attention module to the decoder.The attention module (Bahdanau et al., 2014) linearly combines the encoder hidden states h t T 1 as an input to the decoder LSTM at every step of decoding, unlike only having the last encoder hidden state.

HIERARCHICAL RECURRENT ENCODER DE-CODER
The model has encoding done by two encoder modules acting at different levels (Sordoni et al., 2015); sentence encoder to encode the sentences that feeds in as input to the context encoder.Both the encoders are LSTMs.The decoder is an attention decoder.

BI-LSTM ENCODER-ATTENTION DECODER
The encoder is a concatenation of two LSTMs that can read the input from forward and backward direction (Schuster and Paliwal, 1997).The hidden state is computed as the summation of the hidden states of the two encoders.The decoding is done with an attention decoder.
TRANSFORMER ARCHITECTURE This stateof-the-art architecture (Vaswani et al., 2017;Rush, 2018) is a transductive model that has multiple layers of attention to predict the output.We used the architecture in an encoder-decoder style by splitting half the layers for encoding and the remainder for decoding.We perform the probe tasks on the encoder hidden state computed as an average over word token attention.
The size of the models used in the experiments are detailed in Table 7 in Appendix.For the probe tasks, we select the untrained model, model with the best BLEU score on validation, and model from the last training epoch.We use packages pytorch (Paszke et al., 2017) and scikit-learn (Pedregosa et al., 2011) for our experiments.

Motivation for Dialogue Probe Tasks
The texts generated by the models are largely dependent on the choice of seed values and a slight variation could result in a model generating a very different response.Although the automatic metrics have greater agreement on the score across seed values, we see that human participants do not agree on the consistency of the generated response.We pose and evaluate an alternate hypothesis where we expect the participants to identify two responses to be similar when selected from different runs of the same model with different seed values that have similar BLEU scores.For the study, we sample 2000 context-response pairs in MultiWoZ dataset from the model with lower variance in BLEU score (Table 3) -Bi-LSTM Attention -from its two different runs.We ask the participants to select the response that is more relevant to the given context, similar to Li et al. (2015).The annotators can select either of the responses or a Tie 2 .For every context-response pair, we collect 3 feedback from different participants (Distribution corresponding to the 3 different human responses are shown with legend Human-Exp1, HumanExp2 and HumanExp3 in Figure 1).Usually human evaluation is done on 100-500 responses.To understand the variance in this set up and the lack of information at the token generation level, we sample 50000 sets of 200 human responses from the collected 2000 responses and compute the fraction of times there was a tie.We observed that distribution over the fraction of times the human participants selected a Tie was centered around 35% (Figure 1) with all of the probability mass within 50%.This shows that (a) text generated by the same model produce significantly different responses with different seed values (b) attributing the choice of seed value to the performance of a model creates confusion in the evaluation because the two seeds had similar BLEU scores.The results show that evaluating only based on the text generated by a model is not suggestive of the information encoding capacity of the encoder representation.Also, the dependence of the model generated text on seed value raises a valid concern; whether a model parameter initialized with a specific seed value mimic the token generation of a model that actually encodes sufficient information in the context.The lack of clarity leads to inconclusiveness of studies with human evaluation to show whether the dialogue models have sufficient information encoded to solve the task effectively.
proved by an IRB.

Probe Tasks
We train the models with the two dialogue data sets on next utterance generation.To understand the evolution on the probe task, we compare with 3 different parameter configurations of every model -Untrained, Last epoch, and BestBLEU.We use Logistic Regression classifier3 implementation from scikit-learn (Pedregosa et al., 2011) with default parameters except the max_iter set to 250 for all the probe tasks.The evaluation metric is F1-score with micro averaging in multi-class prediction tasks.

PROBE TASKS ON PERSONACHAT
The models are evaluated on three probe tasks (Table 4)two basic and one information specific.Utterance-Loc and WordCont measures if the encoded context suggests semantic awareness of the model while PersonalInfo measures the amount of knowledge the model has about its persona from encoding of conversation history.In other words, it evaluates the extent to which persona can be identified from the context encoding with a linear classifier.A better performance in these tasks indicate that the context encoding preserves information on persona and the temporal order of the dialogue.
The PersonalInfo task is not very specific to identifying personal information but acts as an indicator to the information embedded in dialogues that goes unnoticed in the encoding.It was surprising to see that no model scored a reasonable F1.Although Transformer model scored higher on BLEU, (Table 3) the performance of transformer on Person-alInfo task was decreasing throughout the training epochs(Table 4).
The tasks UtteranceLoc and WordCont evaluate if encoder representations are indicative of how far in the conversation is the model in and identify midfrequency words in the target response respectively.Bi-LSTM model performed the best in Utterance-Loc while the Transformer model was not in the top 3.
We observe that the inductive biases of the RNNbased models enable random projections that are informative even without training.This correlates with independent observations on the results in (Tallec et al., 2019)   probe tasks.
The RNN encoders project the context to a smaller manifold with its recurrent multiplication that regularizes its representation to observe structures, whereas Transformer network's attention operations project the context on to a larger manifold that prevents loss in encoding4 making the representation useful for the end task (Figure 2 ).This explains the RNN based encoders performing well on UtteranceLoc while Transformer model performing well on WordCont.The difference between the two classes of models is much more evident on the probe tasks in MultiWoZ data set.
PROBE TASKS ON MULTIWOZ In majority of information specific tasks and in the downstream tasks (Table 5), we observed that RNN based models perform significantly better than the Transformer model.Interestingly, we observed a pattern in Transformer in the two data sets, that the model's performance on the probe tasks decreased from the beginning of training till the end on all of the tasks, while for the rest of the models there was learning involved.
The downsampled encoder representation of the encoded contexts with PCA to 2 components (Figure 2) shows that the range of the two axes are different for RNN-based and Transformer models.The context encoding of transformers lie in a much larger manifold.The attention layers help in spreading the data in a large manifold thereby the model can retain almost all of the generation task related information it was trained on.This can be observed in higher BLEU score the model achieves in language generation.But, the reverse of generalizing from a small data is hard to come by because the model does not have sufficient direct information to cluster except the surface level signal of predicting the right tokens.This helps the Transformer model to perform well on the token prediction task in language modelling, while abstracting information and generalizing appears to be a difficult task as is observed from its performance on probe tasks.
The RNN-based models have inductive biases to squish the input through tanh or sigmoid operations.From the visualizations and from other results, we hypothesize that this aids the model in learning a regularized representation in a low-data set up.But, this can potentially be unhelpful when the input is a large set of samples and has rich structure as that requires a model to aggressively spread out.Transformer architecture can thrive in such a set up and that can be validated by the performance of large Transformer models like GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), BERT (Devlin et al., 2018(Devlin et al., , 2019)), RoBERTa (Liu et al., 2019) etc., whereas the results in the probe tasks show that RNN-based models are adept at learning unsupervised structures for better understanding of the input.Also we note that the performance in probe tasks can be a pseudo metric to measure the capacity of the model in generalizing to unobserved structures in inputs in a low data scenario.

Discussion
Systematic evaluation of language understanding through probe tasks is important to analyze the correlation between input and output in complex language understanding tasks.We observed that most of the data collected for dialogue generation tasks (Lowe et al., 2015b;Ritter et al., 2011) do not provide tasks to sanity check language understanding through probing encoder representations.Absence of probe tasks lead to draw imperfect correlations like the one between token-level accuracy and model's encoding of dialogue information from the context.At this point one may wonder, why not train the model with all the probe tasks as auxiliary tasks for an improved performance ?Although it is a possibility, such a set up does not evaluate a model's ability to generalize to understanding in (a) unseen dialogues.One could potentially train a model with a fraction of the probe tasks as auxiliary tasks and evaluate on the rest, we leave that for future work.
It is also interesting to draw parallels to Unit Testing in software engineering (Koomen and Pol, 1999), where the smallest software components of a system are tested for their design and logical accuracy.The difference between a deterministic application software and a stochastic decision making ML module is that the behavior of the ML system is data-driven while for a software system it is driven by logic.Despite the difference, the unit testing and probe tasks could share a common ground towards ensuring the better representation of the encoded contexts.DIALOGUE MODELS As an alternate to tokenlevel evaluation, comparison of different model architectures can be meaningfully made with an aggregate metric on the probe tasks in three groups of difficulty -easy ((Ave.SEQ2SEQ) Untrained F1 > .50),medium(0.25 < Untrained F1 ≥ 0.50), and hard (Untrained F1 < 0.25).Such an analysis, as shown in Table 6, allows better inspection of the model's language understanding and a fairer comparison between the models.We can see from Table 6 that the models have difficulty in learning to solve hard probe tasks from the encoder representations.The results can be used to build novel inductive biases for neural architectures that address one or a group of aspects in the language understanding of dialogue prediction models.DIALOGUE DATA SETS The challenges in dialogue modeling has been evolving majorly because of the complex data sets.But, data sets on chitchat dialogues often have little to no auxiliary tasks to evaluate the dialogue management abilities of a model.This limits the practitioners to validate the models only on the text generation task which, in this paper, is shown to have little to no correlation with the model's ability to understanding the encoded summary of natural language context.

Conclusion
We propose a set of probe tasks to evaluate the encoder representation of end-to-end generative dialogue models.We observed that mimicking surface level token prediction do not reveal much about a model's ability to understand a natural language context.The results on probe tasks showed that RNN-based models perform better than transformer model in encoding information in the context.We also found some probe tasks that all of the models find difficult to solve; this invites novel architectures that can handle the language understanding aspects in dialogue generation.Although language generation is required for a dialogue model, the performance in token/response prediction alone cannot be a proxy for the model's ability to understand a conversation.Hence, systematically identifying issues in language understanding through probe tasks can help in building better models and collecting challenging data sets.

Figure 1 :
Figure 1: The mean of the distribution of tie in three different experiments was centered around 35%, showing that the subjective scores on responses by humans are not sufficient to evaluate a model.
that argues random projections of temporal information hold non-negligible information.Similar observations are also made from the untrained Transformer model's performance on the Figure 2: Downsampled encoder hidden states on MultiWoZ data set with PCA show that Transformer model has high capacity to encode a large data set unlike the SEQ2SEQ models.

Table 1 :
The difficulty levels of different tasks is measured with the average performance of an untrained encoder.There is a natural grading in the selection of tasks that expects better language understanding to solve.+ indicate the task is present both in MultiWoZ and PersonaChat datasets.* indicate the task is only in PersonaChat.If no indicator is present, the task is evaluated only in MultiWoZ dataset.
data sets.The features of the data sets are shown in Table2.To

Table 2 :
Distribution of the dialogues in the data sets.

Table 3 :
BLEU scores of the models from runs with different seeds on PersonaChat and MultiWoZ data set.(Higher the better.We measure BLEU-2 (case insensitive).

Table 4 :
Performance of different models on the probe tasks on PersonaChat data set.The performance is measured as F-1 score (Higher the better).

Table 5 :
F1 scores of generative dialogue models on probe tasks in MultiWoZ dialogue data set (higher the better).SEQ2SEQ models perform significantly better than Transformer model on the probe tasks, despite the models falling behind in BLEU score.The Transformer model's performance decreased from initial to last epoch in majority of the tasks while SEQ2SEQ models have a learning curve.

Table 6 :
Aggregate F1 scores of the models on performance in probe tasks on MultiWoZ data set.