A Role-Selected Sharing Network for Joint Machine-Human Chatting Handoff and Service Satisfaction Analysis

Chatbot is increasingly thriving in different domains, however, because of unexpected discourse complexity and training data sparseness, its potential distrust hatches vital apprehension. Recently, Machine-Human Chatting Handoff (MHCH), predicting chatbot failure and enabling human-algorithm collaboration to enhance chatbot quality, has attracted increasing attention from industry and academia. In this study, we propose a novel model, Role-Selected Sharing Network (RSSN), which integrates both dialogue satisfaction estimation and handoff prediction in one multi-task learning framework. Unlike prior efforts in dialog mining, by utilizing local user satisfaction as a bridge, global satisfaction detector and handoff predictor can effectively exchange critical information. Specifically, we decouple the relation and interaction between the two tasks by the role information after the shared encoder. Extensive experiments on two public datasets demonstrate the effectiveness of our model.


Introduction
Chatbot, as one of the recent palpable AI excitements, has been widely adopted to reduce the cost of customer service (Qiu et al., 2017;Ram et al., 2018;Zhou et al., 2020). However, due to the complexity of human conversation, auto-chatbot can hardly meet all users' needs, while its potential failure perceives skepticism. AI-enabled customer service, for instance, may trigger unexpected business losses because of chatbot failures (Radziwill and Benton, 2017;Rajendran et al., 2019). Moreover, for chatbot adoption in sensitive areas, such as healthcare (Chung and Park, 2019) and criminal justice (Wang et al., 2020a), any subtle statistical miscalculation may trigger serious health and legal * Corresponding authors.

What a business! It has been a week! utter 3
We will ship goods about a week after placing the order. Please be patient. We will try our best to improve your shopping experience. Thank you for your understanding and patience.  consequences. To address this problem, recently, scholars proposed new dialog mining tasks to autoassess dialogue satisfaction, a.k.a. Service Satisfaction Analysis (SSA) at dialogue-level (Song et al., 2019), and to predict potential chatbot failure via machine-human chatting handoff (MHCH) at utterance-level (Huang et al., 2018;Liu et al., 2021). In a MHCH context, algorithm can transfer an ongoing auto-dialogue to the human agent when the current utterance is confusing. Figure 1 depicts an exemplar dialogue of online customer service. In this dialogue, the chatbot gives an unsatisfied answer about shipping, thus causing the customer's complaint (local dissatisfaction utter 2 and utter 3 ). Ideally, chatbot should be able to detect the negative (local) emotion (utter 3 ) and tries to appease complaints, but this problem remains unresolved. If chatbot continues, the customer may cancel the deal and give a negative rating (dialogue global dissatisfaction). With MHCH (detects the risks of utter 2 and utter 3 ), the dialogue can be transferred to the human agent, who is better at handling, compensating, and comforting the customer and enhance customer satisfaction. This example illustrates the cross-impact between handoff and dialogue (local+global) satisfaction. Intuitively, MHCH and SSA tasks can be compatible and complementary given a dialogue discourse, i.e., the local satisfaction is related to the quality of the conversation (Bodigutla et al., 2019a(Bodigutla et al., , 2020, which can support the handoff judgment and ultimately affect the overall satisfaction. On the one hand, handoff labels of utterances are highly pertinent to local satisfaction, e.g., one can utilize single handoff information to enhance local satisfaction prediction, which ultimately contributes to the overall satisfaction estimation. On the other hand, the overall satisfaction is obtained by combining local satisfactions, which reflects the quality in terms of answer generation, language understanding, and emotion perception, and subsequently helps to facilitate handoff judgment. In recent years, researchers (Bodigutla et al., 2019a,b;Ultes, 2019;Bodigutla et al., 2020) explore joint evaluation of turn and dialogue level qualities in spoken dialogue systems. In terms of general dialogue system, to improve the efficiency of dialogue management, Qin et al. (2020) propose a co-interactive relation layer to explicitly examine the cross-impact and model the interaction between sentiment classification and dialog act recognition, which are relevant tasks at the same level (utterancelevel). However, MHCH (utterance-level) and SSA (dialogue-level) target satisfaction at different levels. More importantly, handoff labels of utterances are more comprehensive and pertinent to local satisfaction than sentiment polarities. Meanwhile, customer utterances have significant impacts on the overall satisfaction (Song et al., 2019), which motivates us that the role information can be critical for knowledge transfer of these two tasks.
To address the aforementioned issues, we propose an innovative Role-Selected Sharing Network (RSSN) for handoff prediction and dialogue satisfaction estimation, which utilizes role information to selectively characterize complex relations and interactions between two tasks. To the best of our knowledge, it is the pioneer investigation to leverage the multi-task learning approach for integrating MHCH and SSA. In practice, we first adopt a shared encoder to obtain the shared representations of utterances. Inspired by the co-attention mechanism (Xiong et al., 2016;Qin et al., 2020), the shared representations are then fed into the roleselected sharing module, which consists of two directional interactions: MHCH to SSA and SSA to MHCH. This module is used to get the fusion of MHCH and SSA representations. We propose the role-selected sharing module based on the hypothesis that the role information can benefit the tasks' performances. The satisfaction distributions of utterances from different roles (agent and customer) are different, and the effects for the tasks are also different. Specifically, the satisfaction of agent is non-negative. The utterances from agent can enrich the context of customer's utterances and indirectly affect satisfaction polarity. Thus, directly employing local satisfaction of agent into the interaction with handoff may introduce noise. In the proposed role-selected sharing module, we adopt local satisfaction based on the role information: only the local satisfaction from customer can be adopted to interact with handoff information. By this means, we can control knowledge transfer for both tasks and make our framework more explainable. The final integrated outputs are then fed to separate decoders for handoff and satisfaction predictions.
To summarize, our contributions are mainly as follows: (1) We introduce a novel multi-task learning framework for combining machine-human chatting handoff and service satisfaction analysis.
(2) We propose a Role-Selected Sharing Network for handoff prediction and satisfaction rating estimation, which can utilize different role information to control knowledge transfer for both tasks and enhance model performance and explainability. (3) The experimental results demonstrate that our model outperforms a series of baselines that consists of the state-of-the-art (SOTA) models on each task and multi-task learning models for both tasks. To assist other scholars in reproducing the experiment outcomes, we release the codes and the annotated dataset 1 .

Related Work
Due to the complexity of human conversation, current automatic chatbots are not mature enough and still fail to meet users' expectations (Brandtzaeg and Følstad, 2018;Jain et al., 2018;Chaves and Gerosa, 2020). Besides exploring novel dialogue models, dialogue quality estimation, service satis-faction analysis, and human intervention are vital strategies to enhance chatbot performance.
Dialogue Quality and Service Satisfaction Analysis. Interaction Quality (IQ) (Schmitt et al., 2012) and Response Quality (RQ) (Bodigutla et al., 2019b) are dialogue quality evaluation metrics for spoken dialogue systems. Automated models to estimate IQ (Ultes et al., 2014;El Asri et al., 2014) and RQ (Bodigutla et al., 2019a,b, 2020 utilize various features derived from the dialogue content and output from spoken language understanding components. For chat-oriented dialogue system, Higashinaka et al. (2015a,b) introduce Dialogue Breakdown Detection task to detect a system's inappropriate utterances that lead to dialogue breakdowns. To efficiently analyze dialogue satisfaction, Song et al. (2019) introduce the task of service satisfaction analysis (SSA) based on multi-turn customer service dialogues. The proposed CAMIL model can predict the sentiment of all the customer utterances and aggregate those sentiments into overall service satisfaction polarity. Nevertheless, the sentiment of customer utterance is only one of the factors that influence service satisfaction.
Machine Human Chatting Handoff. Another perspective of further enhancing the chatbot's performance is to combine chatbots with human agent. Recently, there are several works about humanmachine cooperation for chatbots. Huang et al. (2018) propose the crowd-powered conversational assist architecture, namely Evorus, which integrates crowds with multiple chatbots and a voting system. Rajendran et al. (2019) utilize reinforce learning framework to transfer conversations to human agents once encountered new user behaviors. Different from them, Liu et al. (2021) mainly focus on detecting transferable utterances which are one of the keys to improve user satisfaction. They propose a DAMI network that utilizes difficultyassisted encoding and matching inference mechanisms to predict the transferable utterance.
Multi-task learning in dialogue system. For satisfaction estimation, Bodigutla et al. (2020) propose to jointly predict turn-level RQ labels and dialogue-level ratings. They utilize features from spoken dialogue system and BiLSTM (Hochreiter and Schmidhuber, 1997) based model to automatically weight each turn's contribution towards the rating. Ma et al. (2018) propose a joint framework that unifies two highly pertinent tasks. Both tasks are trained jointly using weight sharing to extract the common and task-invariant features while each task can still learn its task-specific features. To learn the correlation between two tasks, Qin et al. (2020) propose a DCR-Net. It adopts a stacked co-interactive relation layer to incorporate mutual knowledge explicitly. This model ignores the contextual information and isolated two types of information when performing interaction. Figure 2 shows the overall architecture of RSSN, which consists of three parts: Shared Utterance and Matching Encoder, Role-Selected Interaction Layer, and Decoder for MHCH and SSA. In this section, we will describe them in detail.

Methodology
Given Transferable indicates the dialogue should be transferred to the human agent, whereas normal indicates there is no need to transfer. The satisfaction polarity of dialogue D is noted as y s , where y s ∈ Ω and Ω = {well satisfied, met, unsatisfied}. Note that we perform the multi-task learning with the supervision of handoff labels and dialogue's satisfaction only. The local satisfaction distributions of utterances are only the latent estimation, which helps to predict the dialogue's satisfaction.

Shared Utterance and Matching Encoder
The shared encoder consists of a bidirectional LSTM (BiLSTM) to learn the utterance representation and a masked matching layer to capture the contextual matching information.
Suppose u t = [w 1 , ..., w |ut| ] represents a sequence of words in the t-th utterance. These words are mapped into corresponding word embeddings E ut ∈ R n×|ut| , where n is the word embedding dimension. By adopting semantic composition models with word embeddings, we can learn the utterance representation. In this work, we adopt a BiLSTM model and concatenate hidden states of forward and backward LSTM to learn the contextsensitive utterance representation v t ∈ R 2k , where k is the number of hidden units of LSTM cell. Formally, we have v t = BiLSTM(E ut ).
In a dialogue, preceding utterances for each utterance provide helpful context information to estimate local satisfaction. Thus, within a dialogue, there is a high probability of inter-dependency with respect to their context clues. To encapsulate the contextual matching and information flow in the dialogue, we feed the utterance representation into a unidirectional matching mechanism:

Shared Utterance and Matching Encoder Role-Selected Interaction Layer Decoder for MHCH and SSA
After masking out the future information of the present utterance, the matching features of dialogue D is a lower triangular matrix with the diagonal values removed. Then we concatenate the matching features with utterance representation to getv t = [v t ; v t ]. Finally, we obtain the initial shared utterances representations of MHCH

Role-Selected Interaction Layer
In customer service dialogue, the roles of different participants would exhibit different characteristics (Song et al., 2019). Besides, we conjecture that MHCH and SSA have different impacts on each other. These two tasks indirectly establish a connection through various factors such as dialogue quality, satisfaction, and sentiment. At the same time, role information also plays an important role in both tasks. On the one hand, the utterances from agent can enrich the context of customer utterances and indirectly affect satisfaction polarity. In contrast, customer utterances tend to have a more direct impact on the dominating satisfaction polarity. On the other hand, the utterances of any participants can trigger machine-human chatting handoff. Thus, we propose the Role-Selected Interaction Layer, which contains two interaction directions: SSA to MHCH and MHCH to SSA, to model the relations and interactions between the two tasks separately.
We first apply two Dense layers over the handoff information and satisfaction information respectively to make them more task-specific, which can be noted as H = Dense(H) and S = Dense(S), where H ∈ R L×d and S ∈ R L×d . Note that d is the number of hidden units of the Dense layer.
SSA to MHCH. Co-attention is an effective and widely used method to capture the mutual knowledge among the correlated tasks (Xiong et al., 2016;Qin et al., 2020). Inspired by the basic co-attention mechanism, we design the interaction mechanism separately according to the characteristics of tasks. In this way, task-relevant knowledge can be transferred mutually between two tasks. Specifically, the SSA to MHCH module produces comprehensive handoff representations incorporating the local satisfaction information. Since the agent utterances indirectly affect satisfaction polarity, directly employing local satisfaction of agent into the interaction with handoff may introduce noise. As a consequence, we only adopt the local satisfaction information of customer to interact with handoff information. The process can be defined as follows: where M ∈ R L×d and Mask c denotes that we mask out (setting to −∞) all values of the future information and agent utterances.
MHCH to SSA. As shown in Figure 3, we observe that the dialogue satisfaction rating is related to the handoff position. Intuitively, a handoff can be triggered by the local unsatisfied attitude of the customer, and the later handoff means users are unsatisfied before the end of the conversation. Prior study, Song et al. (2019), also found that user satisfaction at the dialogue level is usually determined by the attitudes of the last few utterances. We can derive that handoff at the later period of the conversation may result in a lower satisfaction rating. Thus, we adjust the interactive attention by positional weights, which can be computed as below: where is element-wise product and I p (·) denotes a zero masking identity matrix to mask out future information. Finally, the positional weights Γ = [β 1 ; ...; β L ], where Γ ∈ R L×L . The mechanism gives more weight to the later handoff information.
We apply the positional weights to the interaction: where Mask denotes that we mask out the future information (setting to −∞), and LayerNorm denotes the layer normalization (Ba et al., 2016).

Decoder for MHCH and SSA
After the role-selected interaction layer, we can get the outputs M = [m 1 , ..., m L ] and Q = [q 1 , ..., q L ]. Then we adopt separate decoders to predict handoff and satisfaction rating.
In terms of machine-human chatting handoff, the tendency of handoff also depends on the dialogue context. Thus, we feed the outputs of the interaction layer into an LSTM to connect the sequential information flow in the dialogue: where h t ∈ R k is the hidden state for u t . Since there are no dependencies among labels, we simply use a softmax classifier for handoff prediction: where W ∈ R |Ψ|×k and b ∈ R |Ψ| .ŷ h t ∈ R |Ψ| is the predicted handoff probability distribution of u t .
For service satisfaction analysis, we first apply a transformer block (Vaswani et al., 2017) to model the long-range context of the dialogue further. Formally, we haveQ = Transformer(Q), whereQ = {q 1 , ...,q L |q t ∈ R k }.
Then we utilize a softmax function for estimating local satisfaction distribution z t ∈ R |Ω| of u t : where W ξ ∈ R |Ω|×k and b ξ ∈ R |Ω| . Since only a fraction of customer utterances can contribute to the final satisfaction rating, we introduce an attention strategy that enables our model to attend to customer utterances of different importance when merging the local satisfaction distribution. Formally, we measure the importance of each customer utterances as below: where α ∈ R L . W µ ∈ R z×k , b µ ∈ R z , and g ∈ R z are trainable parameters. z is the number of attention units. Mask c denotes the masking function used to reserve customer utterances. g can be perceived as a high-level representation of a fixed query "Which is the critical utterance?". Finally, we obtain the overall satisfaction distributionŷ s ∈ R |Ω| as the weighted sum of local customer satisfaction distribution: where α t is the t-th weight of utterance u t in α.

Joint Training
The objective function of MHCH is formulated as: The objective function of SSA is formulated as: Finally, we minimize the joint cross-entropy loss L, which is obtained as follow: where η ∈ R + denotes the trade-off parameter, δ denotes the L 2 regularization weight, and Θ denotes all the trainable parameters of model. We use backpropagation to compute the gradients of the parameters, and update them with Adam (Kingma and Ba, 2015) optimizer.

Dataset and Experimental Settings
Our experiments are conducted based on two publicly available Chinese customer service dialogue datasets, namely Clothes and Makeup 2 , collected by Song et al. (2019) from Taobao 3 . Both datasets have service satisfaction ratings from customer feedbacks and annotated sentiment labels of utterances. Note that the sentiment labels do not participate in our training process and are only used for test. Meanwhile, we also annotate the transferable/normal labels for both datasets according to the existing specifications (Liu et al., 2021). Two annotators with professional linguistics knowledge participated in the annotation task.
A summary of statistics, including Kappa value (Snow et al., 2008) for both datasets are given in Table 1. Clothes is a corpus with 10K dialogues in the Clothes domain, which has an imbalanced satisfaction distribution at dialogue level. Makeup is a corpus with 3,540 dialogues in the Makeup domain, which has a balanced satisfaction distribution dialogue level. Note that we do not adopt the original word segmentation. Figure 3 shows the relative handoff position distributions in different satisfaction ratings, where we take explicit request, negative emotion, and unsatisfactory answer handoffs into consideration. It indicates that handoff at the later phase of the conversation is more likely to get a lower service satisfaction rating.
Except BERT-based model, all texts are tokenized by a popular Chinese word segmentation utility called jieba 4 . The datasets are partitioned for training, validation, and test with an 80/10/10 split. For the BERT-based methods, we fine-tune the pre-trained model. For the other methods, we apply the pre-trained word vectors initially trained on Clothes and Makeup corpora by using CBOW (Mikolov et al., 2013). The dimension of word embedding is set as 200. Other trainable model parameters are initialized by sampling values from the Glorot uniform initializer (Glorot and Bengio, 2010). The sizes of hidden state k, Dense units d, attention units z, and batch size are selected from {32, 64, 128, 256, 512}. The dropout (Srivastava et al., 2014) rate and the loss weight η are selected from (0, 1) by grid search. Finally, we train the models with an initial learning rate of 1.5 × 10 −3 and 2×10 −5 for regular baselines and BERT-based models. All the methods run on a server configured with a Tesla V100, 32 CPU, and 32G memory.

Baselines
We compare our model with 14 strong dialogue classification baseline models, which come from MHCH, SSA, and other similar tasks.
For DAMI, we adopt the open-sourced code 5 to get the results. For DialogueRNN, we adapt the open-sourced code 6 to MHCH by keeping the core component unchanged. For HAN, MILNET, HMN, and CAMIL of SSA, we adopt the reported results from Song et al. (2019). We re-implement the other models. For BERT+LSTM, we adopt Chinese BERT-base model 7 .

Comparative Study
Following Song et al. (2019), we adopt Macro F1 (Mac. F1) and Accuracy (Acc.) for evaluating the SSA task. For evaluating the MHCH task, we adopt F1, Macro F1 (Mac. F1), and Golden Transfer within Tolerance (GT-T) (Liu et al., 2021). GT-T considers the tolerance property of the MHCH task by the tolerance range T , which allows a "biased" prediction within it. The adjustment coefficient λ of GT-T penalizes early or delayed handoff. Likewise, we set λ as 0, and set T to range from 1 to 5 https://github.com/WeijiaLau/MHCH-DAMI 6 https://github.com/senticnet/conv-emotion 7 https://github.com/google-research/bert  3 corresponding to GT-I, GT-II, and GT-III. The results of comparisons are shown in Table 2.
We can observe that: (1) The proposed method outperforms all state-of-the-art models specific to one task in terms of all metrics on two datasets. This indicates that our proposed model can effectively capture useful information in both tasks by utilizing role and positional information to explicitly control the interaction between the two tasks. Hence, the performance of the two tasks can be boosted mutually.
(2) By integrating MHCH with SSA, the multi-task learning model can obtain further improvements. Specifically, we find that the MHCH task has a positive influence on detecting the unsatisfied dialogue. Overall, DCR-Net and our model perform better than standalone models on US F1 of satisfaction prediction. Intuitively, it is mainly because the interaction with handoff can more comprehensively reflect the local dissatisfaction dialogues than solely sentiment polarity analysis, which helps the joint model better identify dissatisfied dialogues for the SSA task.  Figure 4: An example dialogue with predictions and attention distribution. C i /A i denotes Customer/Chatbot utterance, followed by true labels. The sentiment labels of Customer utterances are also given along with the handoff labels. The other columns are the predictions of our model, CAMIL and DAMI, respectively. The satisfaction ratings of ground truth and predictions are in the last row of the table. N/T denotes Normal/Transferable.

Ablation
We perform several ablation tests in our model on two datasets and the results are recorded in Table  3. The results demonstrate the effectiveness of different components of our model. w/o Interact: We modify the full version of our model by only sharing parameters of the Utterance and Matching Encoder. The performance degradation demonstrates the effectiveness of modeling the relations between two tasks with interaction. w/o Select: We remove the Role-Select mechanism to ignore the role information during the interaction process. The performance degradation indicates that straightforward interaction may bring noisy information for both tasks. w/o Position: We remove the positional weights in the MHCH to SSA sub-module. It performs well but worse than Full Model since the position information provide prior knowledge for controlling context interaction. Average, Voting, and Last: Average takes the average of the local satisfaction distributions of customer utterances for classification. Voting directly maps the majority local satisfaction distributions of customer into satisfaction prediction. Last takes the last customer's satisfaction distribution as classification result. Average, Voting and Last are sub-optimal choices and perform worse than the Full Model. This is because the local satisfaction distributions contribute unequally to the overall satisfaction polarity. Also, the majority satisfaction polarity does not directly correlate with the overall satisfaction.

Case Study
Figure 4 illustrates our prediction results with an example dialogue, which is translated from Chinese text. In this case, three utterances (A 4 , C 5 and C 6 ) are labeled as transferable, and two of them (C 5 and C 6 ) are labeled as "negative emotion". Among them, A 4 is an unsatisfactory response, which arouses negative emotions of the customer. DAMI only predicts C 5 and C 6 as transferable utterances. However our model successfully detects all the transferable utterances. By mapping local satisfaction distribution of utterances to sentiment of utterances, our model is able to predict reasonable sentiment polarities for customer utterances (detailed analysis is in Subsection 4.6). Considering the context, the customer describes his/her skin problem at C 3 and asks for a recommendation. However, the chatbot does not give any recommendations and returns an irrelevant answer at A 4 . We provide the attention distributions of the utterances on the right side of the example dialog. α s 5 and α s 6 are the SSA to MHCH attention distributions of C 5 and C 6 ; α m 5 and α m 6 are the MHCH to SSA attention distributions of C 5 and C 6 . We can observe that attention distributions are concentrated on A 4 rather than other utterances. It is because A 4 is the main cause of negative emotion and dissatisfaction. This again demonstrates that our model can capture the mutual influence between local satisfaction and handoff, which is useful for prediction. In terms of final satisfaction rating, although CAMIL correctly predicts the sentiments of customer utterances, it gives a wrong prediction of satisfaction rating. Our model correctly predicts the satisfaction rating as Unsatisfied by considering the negative emotions and its cause of the unsatisfied response.   of the dialogue's satisfaction labels only during the training process. Similarly, our satisfaction prediction is based on the estimation of local satisfaction distributions while the utterance sentiment or satisfaction labels are unobserved. To compare and analyze the performance of utterance-level sentiment classification, we map these distributions into sentiments of utterances as the sentiment prediction results according to the distribution polarities, i.e., unsatisfied → negative (NG), met → neutral (NE), well-satisfied → positive (PO).

Results on Sentiment Classification
In Table 4, we compare the sentiment prediction results of MILNET, CAMIL, and our model. On Clothes dataset, our RSSN performs better than other baselines, while it performs worse than CAMIL on Makeup dataset. It is worth noting that our model achieves the best performance on both Clothes and Makeup datasets in terms of NG F1 metric. It indicates that MHCH task is sensitive to negative emotion and contributes more to negative emotion recognition than separate SSA models. From Table 2, we can also see that our model performs better than separate SSA models in terms of US F1, which is consistent with the findings of sentiment classification.

Conclusions and Future works
In this paper, we propose an innovative multi-task framework for service satisfaction analysis and machine-human chatting handoff, which deliberately establishes the mutual interrelation for each other. Specifically, we propose a Role-Selected Sharing Network for joint handoff prediction and satisfaction estimation, utilizing role and positional information to control knowledge transfer for both tasks. Extensive experiments and analyses reveal that explicitly modeling the interrelation between the two tasks can boost the performance mutually. However, our model has not been calibrated to account for user preferences and biases, which we plan to address in future work. Moreover, we will further explore how to adjust the handoff priority with the assistance of personalized information.