Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues

User Satisfaction Modeling (USM) is one of the popular choices for task-oriented dialogue systems evaluation, where user satisfaction typically depends on whether the user’s task goals were fulfilled by the system. Task-oriented dialogue systems use task schema, which is a set of task attributes, to encode the user’s task goals. Existing studies on USM neglect explicitly modeling the user’s task goals fulfillment using the task schema. In this paper, we propose SG-USM, a novel schema-guided user satisfaction modeling framework. It explicitly models the degree to which the user’s preferences regarding the task attributes are fulfilled by the system for predicting the user’s satisfaction level. SG-USM employs a pre-trained language model for encoding dialogue context and task attributes. Further, it employs a fulfillment representation layer for learning how many task attributes have been fulfilled in the dialogue, an importance predictor component for calculating the importance of task attributes. Finally, it predicts the user satisfaction based on task attribute fulfillment and task attribute importance. Experimental results on benchmark datasets (i.e. MWOZ, SGD, ReDial, and JDDC) show that SG-USM consistently outperforms competitive existing methods. Our extensive analysis demonstrates that SG-USM can improve the interpretability of user satisfaction modeling, has good scalability as it can effectively deal with unseen tasks and can also effectively work in low-resource settings by leveraging unlabeled data.Code is available at https://github.com/amzn/user-satisfaction-modeling.


Introduction
Task-oriented dialogue systems have emerged for helping users to solve specific tasks efficiently (Hosseini-Asl et al., 2020).Evaluation is a crucial part of the development process of such systems.Many of the standard automatic evaluation metrics, e.g.BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), have been shown to be ineffective in task-oriented dialogue evaluation (Deriu et al., 2021;Liu et al., 2016).As a consequence, User Satisfaction Modeling (USM) (Sun et al., 2021;Kachuee et al., 2021;Bodigutla et al., 2020;Song et al., 2019) has gained momentum as the core evaluation metric for task-oriented dialogue systems.USM estimates the overall satisfaction of a user interaction with the system.In task-oriented dialogue systems, whether a user is satisfied largely depends on how well the user's task goals were fulfilled.Each task would typically have an associated task schema, which is a set of task attributes (e.g.location, date for check-in and check-out, etc. for a hotel booking task), and for the user to be satisfied, the system is expected to fulfill the user's preferences about these task attributes.Figure 1 shows an example of USM for task-oriented dialogues.
Effective USM models should have the following abilities: (1) Interpretability by giving insights on what aspect of the task the system performs well.For instance, this can help the system to recover from an error and optimize it toward an individual aspect to avoid dissatisfaction.(2) Scalability in dealing with unseen tasks, e.g. the model does not need to retrain when integrating new tasks.(3) Cost-efficiency for performing well in low-resource settings where it is often hard to collect and expensive to annotate task-specific data.
Previous work in USM follows two main lines of research.First, several methods use user behavior or system actions to model user satisfaction.In this setting, it is assumed that user satisfaction can be reflected by user behaviors or system actions in task-oriented dialogue systems, such as click, pause, request, inform (Deng et al., 2022;Guo et al., 2020).A second approach is to analyze semantic information in user natural language feedback to estimate user satisfaction, such as sentiment Dialogue User's Task Goal Schema for Restaurant Task I'd like to look for a diner in Vacaville.I am searching for one that is intermediate priced.
Japanese Restaurant is a lovely diner around there.

Task Attributes:
"City": City in which the restaurant is located."Price_Range": Price range for the restaurant.

System
Figure 1: Task-oriented dialogue system has a predefined schema for each task, which is composed of a set of task attributes.In a dialogue, the user's task goal is encoded by the task attribute and value pairs.The user is satisfied with the service when the system succeeds to fulfill the user's preferences about the task attributes.
analysis (Sun et al., 2021;Song et al., 2019) or response quality assessment (Bodigutla et al., 2020;Zeng et al., 2020).However, both of these two lines of work do not take into account the abilities of interpretability, scalability, and cost-efficiency.
In this paper, we propose a novel approach to USM, referred to as Schema-Guided User Satisfaction Modeling (SG-USM).We hypothesize that user satisfaction should be predicted by the fulfillment degree of the user's task goals that are typically represented by a set of task attribute and value pairs.Therefore, we explicitly formalize this by predicting how many task attributes fulfill the user's preferences and how important these attributes are.When more important attributes are fulfilled, taskoriented dialogue systems should achieve better user satisfaction.
Specifically, SG-USM comprises a pre-trained text encoder to represent dialogue context and task attributes, a task attribute fulfillment representation layer to represent the fulfillment based on the relation between the dialogue context and task attributions, a task attribute importance predictor to calculate the importance based on the task attributes popularity in labeled and unlabeled dialogue corpus, and a user satisfaction predictor which uses task attributes fulfillment and task attributes importance to predict user satisfaction.SG-USM uses task attributes fulfillment and task attributes impor-tance to explicitly model the fulfillment degree of the user's task goals (interpetability).It uses an task-agnostic text encoder to create representations of task attributes by description, no matter whether the task are seen or not (scalability).Finally, it uses unlabeled dialogues in low-resource settings (cost-efficiency).
Experimental results on popular task-oriented benchmark datasets show that SG-SUM substantially and consistently outperforms existing methods on user satisfaction modeling.Extensive analysis also reveals the significance of explicitly modeling the fulfillment degree of the user's task goals, the ability to deal with unseen tasks, and the effectiveness of utilizing unlabeled dialogues.

Related Work
Task-oriented Dialogue Systems.Unlike chitchat dialogue systems that aim at conversing with users without specific goals, task-oriented dialogue systems assist users to accomplish certain tasks (Feng et al., 2021;Eric et al., 2020).Task-oriented dialogue systems can be divided into module-based methods (Feng et al., 2022b;Su et al., 2022;Heck et al., 2020;Chen et al., 2020a;Wu et al., 2019a;Lei et al., 2018;Liu and Lane, 2016) and end-to-end methods (Feng et al., 2022a;Qin et al., 2020;Yang et al., 2020;Madotto et al., 2018;Yao et al., 2014).To measure the effectiveness of task-oriented dialogue systems, evaluation is a crucial part of the development process.Several approaches have been proposed including automatic evaluation metrics (Rastogi et al., 2020;Mrkšić et al., 2017), human evaluation (Feng et al., 2022a;Goo et al., 2018), and user satisfaction modeling (Sun et al., 2021;Mehrotra et al., 2019).Automatic evaluation metrics, such as BLEU (Papineni et al., 2002), make a strong assumption for dialogue systems, which is that valid responses have significant word overlap with the ground truth responses.However, there is significant diversity in the space of valid responses to a given context (Liu et al., 2016).Human evaluation is considered to reflect the overall performance of the system in a real-world scenario, but it is intrusive, time-intensive, and does not scale (Deriu et al., 2021).Recently, user satisfaction modeling has been proposed as the main evaluation metric for task-oriented dialogue systems, which can address the issues listed above.User Satisfaction Modeling.User satisfaction in task-oriented dialogue systems is related to whether or not, or to what degree, the user's task goals are fulfilled by the system.Some researchers study user satisfaction from temporal user behaviors, such as click, pause, etc. (Deng et al., 2022;Guo et al., 2020;Mehrotra et al., 2019;Wu et al., 2019b;Su et al., 2018;Mehrotra et al., 2017).Other related studies view dialogue action recognition as an important preceding step to USM, such as request, inform, etc. (Deng et al., 2022;Kim and Lipani, 2022).However, sometimes the user behavior or system actions are hidden in the user's natural language feedback and the system's natural language response (Hashemi et al., 2018).To cope with this problem, a number of methods are developed from the perspective of sentiment analysis (Sun et al., 2021;Song et al., 2019;Engelbrecht et al., 2009) and response quality assessment (Bodigutla et al., 2020;Zeng et al., 2020).However, all existing methods cannot explicitly predict user satisfaction with fine-grained explanations, deal with unseen tasks, and alleviate low-resource learning problem.Our work is proposed to solve these issues.

Schema-guided User Satisfaction Modeling
Our SG-USM approach formalizes user satisfaction modeling by representing the user's task goals as a set of task attributes, as shown in Figure 1.
The goal is to explicitly model the degree to which task attributes are fulfilled, taking into account the importance of the attributes.As shown in Figure 2, SG-USM consists of a text encoder, a task attribute fulfillment representation layer, a task attribute importance predictor, and a user satisfaction predictor.Specifically, the text encoder transforms dialogue context and task attributes into dialogue embeddings and task attribute embeddings using BERT (Devlin et al., 2019).The task attribute fulfillment representation layer models relations between the dialogue embeddings and the task attribute embeddings by attention mechanism to create task attribute fulfillment representations.Further, the task attribute importance predictor models the task attribute popularity in labeled and unlabeled dialogues by the ranking model to obtain task attribute importance weights.Finally, the user satisfaction predictor predicts user satisfaction score on the basis of the task attribute fulfillment representations and task attribute importance weights using multilayer perceptron.

Text Encoder
The text encoder takes the dialogue context (user and system utterances) and the descriptions of task attributes as input and uses BERT to obtain dialogue and task attribute embeddings, respectively.Considering the limitation of the maximum input sequence length of BERT, we encode dialogue context by each dialogue turn.Specifically, the BERT encoder takes as input a sequence of tokens with length L, denoted as X = (x 1 , ..., x L ).The first token x 1 is [CLS], followed by the tokens of the user utterance and the tokens of the system utterance in one dialogue turn, separated by [SEP].The representation of [CLS] is used as the embed-ding of the dialogue turn.Given a dialogue with N dialogue turns, the output dialogue embeddings is the concatenation of all dialogue turn embeddings D = [d 1 ; d 2 ; ...; d N ].
To obtain task attribute embeddings, the input is a sequence of tokens with length K, denoted as Y = {y 1 , ..., y K }.The sequence starts with [CLS], followed by the tokens of the task attribute description.The representation of [CLS] is used as the embedding of the task attribute.The set of task attribute embeddings are denoted as T = {t 1 , t 2 , ..., t M }, where M is the number of task attributes.

Task Attribute Fulfillment Representation Layer
The task attribute fulfillment representation layer takes the dialogue and task attribute embeddings as input and calculates dialogue-attended task attribute fulfillment representations.This way, whether each task attribute can be fulfilled in the dialogue context is represented.Specifically, the task attribute fulfillment representation layer constructs an attention vector by a bilinear interaction, indicating the relevance between dialogue and task attribute embeddings.Given the dialogue embeddings D and i-th task attribute embedding t i , it calculates the relevance as follows, where W a is the bilinear interaction matrix to be learned.A i represents the attention weights of dialogue turns with respect to the i-th task attribute.Then the dialogue-attended i-th task attribute fulfillment representations are calculated as follows, The dialogue-attended task attribute fulfillment representations for all task attributes are denoted as: where M is the number of the task attributes.

Task Attribute Importance Predictor
The task attribute importance predictor also takes the dialogue and task attribute embeddings as input and calculates attribute importance scores.The importance scores are obtained by considering both the task attribute presence frequency and task attribute presence position in the dialogue.
First, we use the Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) to select the top relevant task attributes for the dialogue context.The selected task attributes are then used to calculate the task attribute presence frequency in the dialogue.The MMR takes the j-th dialogue turn embeddings d j and task attribute embeddings T as input, and picks the top K relevant task attributes for the j-th dialogue turn: where U is the subset of attributes already selected as top relevant task attributes, cos() is the cosine similarity between the embeddings.λ trades off between the similarity of the selected task attributes to the dialogue turn and also controls the diversity among the selected task attributes.The task attribute presence frequency vector for the j-th dialogue turn is computed as follows, where M is the number of the task attributes.However, the task attribute presence frequency vector does not reward task attributes that appear in the beginning of the dialogue.The premise of task attribute importance score is that task attributes appearing near the end of the dialogue should be penalized as the graded importance value is reduced logarithmically proportional to the position of the dialogue turn.A common effective discounting method is to divide by the natural log of the position: The task attribute importance predictor then computes the importance score on the basis of the sum of the discounted task attribute presence frequency of all dialogues.Given the dialogue corpus (including both labeled and unlabeled dialogues) with Z dialogues C = {D 1 , D 2 , ..., D Z }, the task attribute importance scores are calculated as follow: where Num() is the number of the dialogue turn in dialogue D l , and F l j is the discounted task attribute presence frequency of j-th dialogue turn in dialogue D l .

User Satisfaction Predictor
Given the dialogue-attended task attribute fulfillment representations T a and the task attribute importance scores S, the user satisfaction labels are obtained by aggregating task attribute fulfillment representations based on task attribute importance scores.This way, the user satisfaction is explicitly modeled by the fulfillment of the task attributes and their individual importance.
Specifically, an aggregation layer integrates the dialogue-attended task attribute fulfillment representations by the task attribute importance scores as follows: Then the Multilayer Perceptron (MLP) (Hastie et al., 2009) with softmax normalization is employed to calculate the probability distribution of user satisfaction classes:

Training
We train SG-USM in an end-to-end fashion by minimizing the cross-entropy loss between the predicted user satisfaction probabilities and the ground-truth satisfaction: where y is the ground-truth user satisfaction.Pretrained BERT encoders are used for encoding representations of utterances and schema descriptions respectively.The encoders are fine-tuned during the training process.
4 Experimental Setup

Datasets
We conduct experiments using four benchmark datasets containing task-oriented dialogue on different domains and languages (English and Chinese), including MultiWOZ2.1 (MWOZ) (Eric et al., 2020), Schema Guided Dialogue (SGD) (Rastogi et al., 2020), ReDial (Li et al., 2018), and JDDC (Chen et al., 2020b).MWOZ and SGD are English multi-domain taskoriented dialogue datasets, which include hotel, restaurant, flight, etc.These datasets contain domain-slot pairs, where the slot information could correspond to the task attributes.
ReDial is an English conversational recommendation dataset for movie recommendation.The task attributes are obtained from the Movie2 type on Schema.org.JDDC is a Chinese customer service dialogue dataset in E-Commerce.The task attributes are obtained from the Product3 type on Schema.org.cn, which provides schemas in Chinese.Specifically, we use the subsets of these datasets with the user satisfaction annotation for evaluation, which is provided by Sun et al (Sun et al., 2021).We also use the subsets of these datasets without the user satisfaction annotation to investigate the semi-supervised learning abilities of SG-USM.Table 1

Baselines and SG-USM Variants
We compare our SG-USM approach with competitive baselines as well as state-of-the-art methods in user satisfaction modeling.
HiGRU (Jiao et al., 2019) propose a hierarchical structure to encode each turn in the dialogue using a word-level gated recurrent unit (GRU) (Dey and Salem, 2017) and a sentence-level GRU.It uses the last hidden states of the sentence-level GRU to represent the dialogue.HAN (Yang et al., 2016) applies a two-level attention mechanism in the hierarchical structure of HiGRU to represent dialogues.
Transformer (Vaswani et al., 2017) is a simple baseline that takes the dialogue context as input and uses the standard Transformer encoder to obtain the dialogue representations.
BERT (Devlin et al., 2019)  turns.It uses the [CLS] token of a pre-trained BERT models to represent dialogues.USDA (Deng et al., 2022) employs a BERT to encode the whole dialogue context and incorporates the sequential dynamics of dialogue acts for user satisfaction modeling.
We also report the performance of two simpler SG-USM variants: SG-USM(L) only uses the dialogues with groundtruth user satisfaction labels to train the model.SG-USM(L&U) uses both labeled and unlabeled dialogues in the training process.It takes the dialogues without user satisfaction annotation as the inputs of task attribute importance predictor module to obtain more general and accurate task attribute importance scores.
For a fair comparison with previous work and without loss of generality, we adopt BERT as the backbone encoder for all methods that use pretrained language models.

Training
We use BERT-Base uncased, which has 12 hidden layers of 768 units and 12 self-attention heads to encode the utterances and schema descriptions.We apply a two-layer MLP with the hidden size as 768 on top of the text encoders.ReLU is used as the activation function.The dropout probability is 0.1.Adam (Kingma and Ba, 2014) is used for optimization with an initial learning rate of 1e-4.We train up to 20 epochs with a batch size of 16, and select the best checkpoints based on the F1 score on the validation set.

Overall Performance
Table 2 shows the results of SG-USM on MWOZ, SGD, ReDial, and JDDC datasets.Overall, we observe that SG-USM substantially and consistently outperforms all other methods across four datasets with a noticeable margin.Specifically, SG-USM(L) improves the performance of user satisfaction modeling via explicitly modeling the degree to which the task attributes are fulfilled.SG-USM(L&U) further aids the user satisfaction modeling via predicting task attribute importance based on both labeled dialogues and unlabeled dialogues.It appears that the success of SG-USM is due to its architecture design which consists of the task attribute fulfillment representation layer and the task attribute importance predictor.In addition, SG-USM can also effectively leverage unlabeled dialogues to alleviate the cost of user satisfaction score annotation.

Ablation Study
We also conduct an ablation study on SG-USM to study the contribution of its two main components: task attribute importance and task attribute fulfillment.

Effect of Task Attribute Importance
To investigate the effectiveness of task attribute importance in user satisfaction modeling, we eliminate the task attribute importance predictor and run the model on MWOZ, SGD, ReDial, and JDDC.As shown in Figure 3, the performance of SG-USMw/oImp decreases substantially compared with SG-USM.This indicates that the task attribute importance is essential for user satisfaction modeling.We conjecture that it is due to the user satisfac-   tion relates to the importance of the fulfilled task attributes.

Effect of Task Attribute Fulfillment
To investigate the effectiveness of task attribute fulfillment in user satisfaction modeling, we compare SG-USM with SG-USM-w/oFul which eliminates the task attribute fulfillment representation.
Figure 3 shows the results on MWOZ, SGD, Re-Dial, and JDDC in terms of F1.From the results, we can observe that without task attribute fulfillment representation the performances deteriorate considerably.Thus, utilization of task attribute fulfillment representation is necessary for user satisfaction modeling.

Case Study
We also perform a qualitative analysis on the results of SG-USM and the best baseline USDA on the SGD dataset to delve deeper into the differences of the two models.
We first find that SG-USM can make accurate inferences about user satisfaction by explicitly modeling the fulfillment degree of task attributes.For example, in the first case in Figure 4, the user wants to find a restaurant for ten people.SG-USM can also correctly predict the neural label by inferring that the fourth important task attribute "People" is not fulfilled.In the second case, the user wants to book a train ticket before 8 am.SG-USM can yield the correct dissatisfied label by inferring that the third important task attribute "Time" is not fulfilled.From our analysis, we think that SG-USM achieves better accuracy due to its ability to explicitly model how many task attributes are fulfilled and how important the fulfilled task attributes are.In contrast, the USDA does not model the fulfillment degree of task attributes, thus it cannot properly infer the overall user satisfaction.

Dealing with Unseen Task Attributes
We furhter analyze the zero-shot capabilities of SG-USM and the best baseline of USDA.The SGD,  MWOZ, and ReDial datasets are English dialogue datasets that contain different task attributes.Therefore, we train models on SGD, and test models on MWOZ and ReDial to evaluate the zero-shot learning ability.Table 3 presents the Accuracy, Precision, Recall, and F1 of SG-USM and USDA on MWOZ and ReDial.From the results, we can observe that SG-USM performs significantly better than the baseline USDA on both datasets.This indicates that the agnostic task attribute encoder of SG-USM is effective.We argue that it can learn shared knowledge between task attributes and create more accurate semantic representations for un-seen task attributes to improve performance in zeroshot learning settings.

Effect of the Unlabeled Dialogues
To analyze the effect of the unlabeled dialogues for SG-USM, we test different numbers of unlabeled dialogues during the training process of SG-USM.Figure 5 shows the Accuracy and F1 of SG-USM when using 1 to 4 thousand unlabeled dialogues for training on MWOZ, SGD, ReDial, and JDDC.From the results, we can see that SG-USM can achieve higher performance with more unlabeled dialogues.This indicates that SG-USM can effectively utilize unlabeled dialogues to improve the performance of user satisfaction modeling.We reason that with a larger corpus, the model can more accurately estimate the importance of task attributes.

Conclusion
User satisfaction modeling is an important yet challenging problem for task-oriented dialogue systems evaluation.For this purpose, we proposed to explicitly model the degree to which the user's task goals are fulfilled.Our novel method, namely SG-USM, models user satisfaction as a function of the degree to which the attributes of the user's task goals are fulfilled, taking into account the importance of the attributes.Extensive experiments show that SG-USM significantly outperforms the state-of-the-art methods in user satisfaction modeling on various benchmark datasets, i.e.MWOZ, SGD, ReDial, and JDDC.Our extensive analysis also validates the benefit of explicitly modeling the fulfillment degree of a user's task goal based on the fulfillment of its constituent task attributes.In future work, it is worth exploring the reasons of user dissatisfaction to better evaluate and improve task-oriented dialogue systems.

Limitations
Our approach builds on a task schema that characterizes a task-oriented dialogue system's domain.For example, the schema captures various attributes of the task.For some domains, when a schema is not pre-defined, it first needs to be extracted, e.g., from a corpus of dialogues.In this paper, we used BERT as our LM to be comparable with related work, but more advanced models could further improve the performance.A limitation of our task attribute importance scoring method is that it currently produces a static set of weights, reflecting the domain.In the future, the importance weights may be personalized to the current user's needs instead.

Figure 3 :
Figure 3: Performance of SG-USM by ablating the task attribute importance and task attribute fulfillment components across datasets.

Figure 4 :
Figure 4: Case study on SG-USM and USDA on SGD dataset.The yellow ★ represents the importance of task attributes.The task attributes in green are fulfilled ones, while the task attributes in red are not fulfilled ones.

Figure 5 :
Figure 5: Performance of SG-USM trained with different numbers of unlabeled dialogues on MWOZ, SGD, ReDial, and JDDC datasets.

Table 1 :
displays the statistics of the datasets in the experiments.Statistics of the task-oriented dialogue datasets.
concatenates the last 512 tokens of the dialogue context into a long sequence with a [SEP] token for separating dialogue *