Turn-Level User Satisfaction Estimation in E-commerce Customer Service

User satisfaction estimation in the dialogue-based customer service is critical not only for helping developers find the system defects, but also making it possible to get timely human intervention for dissatisfied customers. In this paper, we investigate the problem of user satisfaction estimation in E-commerce customer service. In order to apply the estimator to online services for timely human intervention, we need to estimate the satisfaction score at each turn. However, in actual scenario we can only collect the satisfaction labels for the whole dialogue sessions via user feedback. To this end, we formalize the turn-level satisfaction estimation as a reinforcement learning problem, in which the model can be optimized with only session-level satisfaction labels. We conduct experiments on the dataset collected from a commercial customer service system, and compare our model with the supervised learning models. Extensive experiments show that the proposed method outperforms all the baseline models.


Introduction
Task-oriented dialogue systems have been widely studied recently (Gao et al., 2019;, and many have been widely deployed to realworld applications, such as intelligent assistants and customer service in industry. However, due to the limitation of model capability, the system may fail to understand the intent of users or complete the task, which makes it common for users to become dissatisfied with the system (Kiseleva et al., 2016b;Lopatovska et al., 2019).
In this paper, we focus on the problem of user satisfaction estimation (Chowdhury et al., 2016;Kiseleva et al., 2016a) in E-commerce customer service, where users may ask for E-commerce transactions, claim a refund or make a complaint to the customer service. An actual E-commerce customer service may serve thousands of users simultaneously, many of whom may feel dissatisfied, more or less. It is imperative to offer manual service to those users who are exhibiting signs of dissatisfaction. Nevertheless, the manual service resources are usually limited. Therefore, estimating user satisfaction can help us assign manual service priority to the users by sorting the ongoing dialogues with satisfaction scores.
Ideally, the satisfaction score estimation and sorting process should be in a timely and turn-level manner. Take Figure 1 for an example. In the first two turns 1 , the system responses are consistent with the user utterances. Therefore, the satisfaction score until the second turn should be high, and the user should not be allocated human service. But in the third turn, the system seems to ask a weird question instead of responding to the special situation the user encounters. Therefore, the satisfaction score until the third turn should be lower than that until the second turn. And after the fourth turn, the satisfaction score should get even lower since the system still responds improperly. Whether the user will be offered human resources in the third and the fourth turn is determined by the rank of the satisfaction score among all the ongoing dialogues.
However, in actual scenario we can only collect the satisfaction labels for the whole dialogue sessions through user feedback (Park et al., 2020), because asking the users to provide turn-level feedback will lead to poor user experience. Consequently, most of the existing works only tackle the session-level satisfaction prediction problem, where they can only predict the satisfaction label after the whole session finishes, lacking the ability to adjust the satisfaction score as the dialogue proceeds.
To address this problem, we formalize the turnlevel user satisfaction estimation as a reinforcement learning problem. With carefully designed actions and reward function, we can optimize the turn-level satisfaction estimator with only session-level satisfaction labels.
To summarize, we utilize reinforcement learning to achieve turn-level satisfaction estimation in Ecommerce customer service when only the sessionlevel labels are available. Extensive experiments verify the effectiveness of our method.

Related Work
User satisfaction estimation for dialogue systems has been an important research topic over the past decades. Most of the existing work focused on the session-level user satisfaction estimation (Jiang et al., 2015;Hashemi et al., 2018;Park et al., 2020). Walker et al. (1997) first proposed PAR-ADISE framework, which can estimate the user satisfaction in spoken dialogue systems through a task success measure and dialogue-based cost measures. Yang et al. (2010) extended the PARADISE framework by an item-based collaborative filtering model. Some works on user satisfaction estimation focused on extracting useful features from usersystem interaction (Kiseleva et al., 2016a;Sandbank et al., 2018). Others modeled a dialogue as a sequence of dialogue actions (Jiang et al., 2015) or utterances (Hashemi et al., 2018;Choi et al., 2019). However, these methods can predict user satisfaction only after the dialog is completed, which can not be adopted in an E-commerce customer service scenario where timely satisfaction estimation is preferred.
While some works also addressed the turn-level online satisfaction estimation, they needed turnlevel human annotations (Ultes et al., 2017;Bodigutla et al., 2020). These methods are not scalable in terms of annotation costs due to the large volumes of user data in E-commerce. Choi et al. (2019) used elaborate rules to generate turn-level satisfaction labels and trained the model in a supervised manner, but rules do not generalize well to the rapid growth of new data in a commercial system. Recently, Kachuee et al. (2020) suggested a self-supervised contrastive learning approach to use unlabeled data and transfer to user satisfaction prediction with labeled data, but the size of labeled data is still very large.
In our work, we propose to leverage reinforcement learning to achieve turn-level user satisfaction estimation. Only requiring session-level labels, our model is more suitable for industrial E-commerce customer service than existing methods.

User Satisfaction Estimation
We formally define the task in our work as follows: the tth turn of a dialogue, denoted by T t , consists of user request T u t and system response T s t . Each dialogue d contains a few turns, namely d = (T 1 , T 2 , ..., T T ), and we estimate the satisfaction score sc t of a user at each turn T t (t = 1, 2, ..., T ).
We now describe the proposed method in detail, which consists of three components: dialogue encoder, satisfaction score estimator, and reinforcement learning module. Figure 2 shows the overview of the proposed method.

Dialogue Encoder
Following (Choi et al., 2019), we extract features from each turn and model a dialogue as a sequence of features, such as turn index and input channel 2 . Suppose there are m features and we denote the one-hot vector for the jth feature in turn T t as f j t . Then the feature for the tth turn is For better understanding of natural languages, we use BERT (Devlin et al., 2019) to encode the pair of user and system utterances at each turn, and apply it as a part of the input features f t .
Then, we use the gated recurrent units (GRU) (Chung et al., 2014) to get the hidden state h t of the dialogue history up to the tth turn:

Satisfaction Score Estimator
For satisfaction score estimation, our insight is that a the degree of a user's dissatisfaction will accumulate if he/she encounters successive improper system response (where the satisfaction score is negative and decreases over time), or can be relieved by a satisfactory reply (where the satisfaction score increases). Therefore, it is natural to predict the increment of user satisfaction score, not only because it is in line with the intuition that users who experience more dis-satisfactory turns are more likely to give up interacting with the system, but also the predicted increment of user satisfaction score can be regarded as the actions in reinforcement learning (see Section 3.3 for details). Formally, having encoded the dialogue, we first predict the increment of user satisfaction score ∆sc t with a multilayer perceptron (MLP): Then, we sum up the increments of user satisfaction score to get the user satisfaction score up to the tth turn:

Reinforcement Learning Module
To optimize the satisfaction score estimator, we sample a pair of a satisfying dialogue (where the user is satisfied with the system at the session level) and a dissatisfying dialogue and compare the two predicted satisfaction scores. Our key insight is that although it is hard to directly assign each turn with the absolute value of satisfaction, the predicted satisfaction score of satisfying dialogue must be higher than that of the dissatisfying dialogue. We model the satisfaction score estimator as an agent assigning increment of satisfaction score to each turn given the dialogue context, and the aforementioned fact can be utilized to design the reward signal in reinforcement learning setting. Formally, the training set D is split into satisfying dialogues S D and dissatisfying dialogues S D . In each episode of reinforcement learning, we randomly sample a satisfying dialogue d ∈ S D with T turns and a dissatisfying dialogue d ∈ S D with T turns. Then the satisfaction score estimator is regarded as the agent, and predicts the increment of satisfaction score of each turn for d and d successively. Thus, the length of an episode is T + T .
For the first turn of satisfying dialogue (i.e., the 1st time step), the state is initialized with the features of the first turn (of satisfying dialogue). The rest states of the satisfying dialogue (i.e., the 2rd ∼ T th time steps) are updated by the features of current turn and GRU hidden states encoding features of history turns (of satisfying dialogue). Similarly, for the first turn of dissatisfying dialogue (i.e., the (T + 1)th time step), the state is reinitialized with the features of the first turn (of dissatisfying dialogue). The rest states of the dissatisfying dialogue (i.e., the (T + 2)th ∼ (T + T )th time steps) are also updated by features of current turn and GRU hidden states encoding features of history turns (of dissatisfying dialogue). Formally, the state is defined as: The action a t = ∆sc t is sampled from the policy π(a t |s t ) ∼ N (M LP (GRU (s t )), σ 2 ), where σ is a hyper-parameter. The rewards r t for each time step t are all 0 except the T th and (T + T )th step. The rewards for these two steps are 1 if the agent predicts sc 1:T > sc T +1:T +T , and -1 otherwise.
Let the expectation of return J(π θ ) = where the policy is parameterized by θ, and γ denotes the discount rate. Following the REIN-FORCE (Williams, 1992) algorithm, the gradient of the expectation of return can be calculated as follows: 4 Experimental Setting

Dataset
The dataset in this experiment is sampled from a commercial customer service system, where users communicate with the intelligent assistant about the E-commerce transactions, such as claiming a refund and requesting a receipt. The users are allowed to request manual service during the dialogue if they feel dissatisfied with the automatic system. The dataset contains 1294 dialogue sessions in total, 840 and 454 of which are labeled as satisfying and dissatisfying, respectively.

Evaluation Metric
We aim at deploying our satisfaction estimator to online services, where thousands of dialogues are handled simultaneously. As the manual service resources are limited, we need to sort the ongoing dialogues by the satisfaction scores estimated by our model, and allocate manual service resource to the least satisfied users.
To evaluate the model in this scenario, we use the Area Under the Receiver Operating Characteristic Curve (AUC) (Fawcett, 2006) as the evaluation metric. In our scenario, AUC equals the probability that the satisfaction score of a randomly sampled satisfying dialogue is higher than the score of a randomly sampled dissatisfying dialogue.

Baseline
We compare our model with the following baselines: (1)  We train the baseline models using session-level labels with supervised learning, then treat the subdialogue (i.e., the first n turns of dialogue history) as a whole dialogue session to estimate turn-level user satisfaction during evaluation. We also add an augmented variant of supervised learning: we augment the training set with turn-level labels by directly copying the session-level labels as the training signals of the sub-dialogues.

Turn-Level Satisfaction Estimation
To investigate how well the model can estimate user satisfaction in a timely manner, we first compare the AUC of each model with different number of remaining turns n, where we predict the satisfaction score n turns before the end of each dialogue (i.e., we predict sc 1:T −n for a dialogue with T turns). In this way, we can test whether our model is capable of estimating the user's satisfaction tendency before a dialogue finishes or fails. Figure 3 shows the AUC of satisfaction estimation with respect to remaining turns. Our proposed method outperforms all other methods with all remaining turns. And the improvement of our proposed method over the other methods increases as the number of remaining turns grows. The reason is that the distribution of incomplete dialogues differs from the complete ones. Since the supervised learning model only learns to score the complete dialogues during the training period, it cannot properly score the incomplete ones during the test period. In contrast, since the reinforcement learning model learns to make turn-level estimation during the training time, its estimation performance is much better than that of supervised learning model when the number of remaining turns is large. Augmenting the training data with sub-dialogues benefits the supervised learning process, but the performance is still worse than the reinforcement learning. To verify the effectiveness of each feature in dialogue encoding, we conduct ablation study. We remove one feature in each experiment, and the model makes satisfaction estimation with access to the complete dialogues in the test set.
The results of ablation study are shown in Table  1. The model with all the features have the best performance, indicating that every feature is useful for making satisfaction estimation.

Model Behaviour Analysis
To understand the behaviour of our proposed model, we draw the distribution of satisfaction score predicted by our model up until each specific turn. As shown in Figure 4, at the first few turns, the absolute value of satisfaction score is usually small, as users usually express their demands in the beginning with no satisfaction tendency. When the dialogue continues, the dialogues will exhibit more clues about satisfaction or dissatisfaction. Therefore, the predicted satisfaction scores go up (or down) in the satisfying (or dissatisfying) dialogues as depicted by orange (or blue) figures. This verifies the ability of distinguishing the dissatisfying dialogues from the satisfying ones by our method.

Conclusion
We present a reinforcement learning method to estimate turn-level satisfaction scores with only session-level labels. We verify that our model can effectively estimate satisfaction scores of customer service dialogues. In the future work, we will explore algorithms for retraining the customer service system with the help of user satisfaction estimator.

Acknowledgments
This work was partly supported by the NSFC projects (Key project with No. 61936010 and regular project with No. 61876096). This work was also supported by the Guoqiang Institute of

A Implementation details
The dataset is split into training set (70%), vadidation set (15%) and test set (15%). In all experiments, the dimension of GRU output vector is 32. Each MLP is a two-layer neural network, whose hidden size is 32 and the activation function is ReLU. We use Adam as the optimizer and the learning rate is 0.0001. The batch size is 4, and the discount rate for reinforcement learning is 1. The extracted features for each dialogue turn is listed in Table 2.

Feature Explanation Turn index
The index of the current turn in a dialogue session. Each turn consists of a pair of user and system utterances. The dimension is 10 (1, 2, ..., 9, ≥10).

Frequence
How many times the (exactly) same question has been proposed by other users in one month on the system. We manually divide the scope of frequence into 8 disjoint intervals, and the dimension is therefore 8. Input channel The channel for each turn that users input through (e.g., keyboard and shortcut button). The dimension is 6.

User intent
The detected user intent for each turn (e.g., making a complaint and claiming a refund). The dimension is 10.

B Case Study
To better understand the turn-level satisfaction estimation behaviour of our model, we conduct case study. We sample two dialogue cases from the test set and display their contents as well as the satisfaction increment ∆sc t estimated by our model for each turn. It is worth noting that in this Ecommerce customer service, the system might respond in rich text format, including tables, images and links. In such case, the system response will be represented by the title of the knowledge (e.g., Knowledge: Why I'm not eligible for the quick refund?).    Table 3 shows a dialogue case where the user is dissatisfied. At the first turn, the user selects the order. Since it is common for users to select order in the first turn, the absolute value of the estimated satisfaction increment is small. This suggests that our model finds no clear satisfaction or dissatisfaction tendency of the user. In the second turn, the user raises a question about the quick refund. Since this is a common question and system responds with relevant knowledge, our model predicts a positive satisfaction increment (i.e., the user is likely to be more satisfied). However, in the third turn, the user asks for manual service, which usually indicates that the user is dissatisfied with the content of the last response. Therefore, our model predicts a negative satisfaction increment with large absolute value, showing that the user might become quite dissatisfied with the automatic system. At the fourth turn, the user continues asking for manual service, and therefore our model continues predicting a negative satisfaction increment with large absolute value. Table 4 illustrates a dialogue case where the user is satisfied. At the first turn, the user also selects the order, and therefore the absolute value of the predicted satisfaction increment is small. In the following turns, the user consecutively clicks the knowledge recommendation links and shortcut buttons in the user interface. This is a good phenomenon because the user can conveniently get the desired information through simple clicks, without the need for typing the questions through the keyboard. Hence, our model keeps making estimation of positive satisfying increment, showing the belief that the user is satisfied.
The above cases illustrate that our proposed model can make reasonable turn-level satisfaction estimation in various situations, verifying the effectiveness and great interpretability of our reinforcement learning method.