Restatement and Question Generation for Counsellor Chatbot

Amidst rising mental health needs in society, virtual agents are increasingly deployed in counselling. In order to give pertinent advice, counsellors must first gain an understanding of the issues at hand by eliciting sharing from the counsellee. It is thus important for the counsellor chatbot to encourage the user to open up and talk. One way to sustain the conversation flow is to acknowledge the counsellee’s key points by restating them, or probing them further with questions. This paper applies models from two closely related NLP tasks — summarization and question generation — to restatement and question generation in the counselling context. We conducted experiments on a manually annotated dataset of Cantonese post-reply pairs on topics related to loneliness, academic anxiety and test anxiety. We obtained the best performance in both restatement and question generation by fine-tuning BertSum, a state-of-the-art summarization model, with the in-domain manual dataset augmented with a large-scale, automatically mined open-domain dataset.


Introduction
Advances in dialog modeling have facilitated chatbot use in many domains (Li et al., 2016;Zhou et al., 2020;Wang et al., 2020a). They are now also increasingly deployed for mental health assistance, including counselling (Fitzpatrick et al., 2017).
Dialogs in counselling share some common characteristics with those in other domains. Advice generation, for example, can be implemented with a Q&A model that retrieves counselling materials from a knowledge base (Liu et al., 2013;Huang et al., 2015). Empathetic language -words that reflect the feelings of one's interlocutors -is conducive to establishing rapport with the counsellee. Research in empathetic response generation has led to systems that can recognize the emotional state of the user, and generate responses tailored to that state (Lubis et al., 2018;Lin et al., 2019). The counsellor must also encourage the counsellee to open up and talk in order to gain an adequate understanding of the issues at hand. A common strategy to sustain the conversation flow is to use "encouragers" (Ivey and Ivey, 2003), such as backchannel phrases, restatements and questions. A good restatement acknowledges main points from the counsellee by paraphrasing or summarizing them. A helpful question elicits elaboration on a key point and invites collaborative problem solving. Table 1 shows some examples.
This paper focuses on automatic generation of restatements and questions for counselling dialogs. Specifically, it addresses two research questions: • Text summarization and question generation are NLP tasks that are potentially relevant to the counselling domain. Can we adapt models designed for these tasks to produce highquality restatements and questions for a counsellor chatbot?
• Dialog data for domain-specific tasks such as counselling is often limited.
Can we leverage open-domain dialog data to improve restatement and question generation?
Our experiments compare a number of summarization, question generation and dialog models for the single-turn reply generation task. We obtained the strongest model by fine-tuning BertSum (Liu and Lapata, 2019), a state-of-the-art summarization model, with an in-domain, manually annotated dataset augmented with a large-scale, automatically mined open-domain dataset.
After summarizing previous work (Section 2) and presenting our dataset (Section 3), we describe our approach for restatement and question generation (Section 4). We then report experimen-

Previous work
While chatbot response generation has exploited models from machine translation (Ritter et al., 2011) and question answering (Liu et al., 2013), there has been less effort in leveraging those from other NLP tasks such as text summarization and question generation. This section reviews research in these two fields.

Text summarization
Text summarization models, which condense an input text into a shorter version, can generate short summaries or headlines (Rush et al., 2015). Pretrained language models such as BERT (Devlin et al., 2019) have been shown to boost the quality of summarization, among many other NLP tasks. Among the best-performing models is BertSum, which uses a document-level BERT-based encoder to express the semantics of the input text document and obtain sentence representations (Liu and Lapata, 2019). Its fine-tuning schedule adopts different optimizers for the encoder and the decoder, and has been shown to improve performance by alleviating the mismatch between them. Compared to open-domain dialogs, a human counsellor more often gives shorter replies and reflects the points made by the counsellee. Summarization models can therefore potentially be helpful in generating restatements in the counselling domain. Generic summarization models, however, likely need to be fine-tuned since restatements are not identical to summaries. In Table 1(c), for instance, the perspective changes from first person to second person ('I'll get a headache' → 'You'll get a headache'); empathetic words are also inserted to diagnose the counsellee's emotion ('You worry ...'). To our knowledge, this is the first reported evaluation on applying a summarization model to counselling dialog generation.

Question generation
A question generation model composes a question from an input text. Neural question generation algorithms have recently attained state-of-the-art performance. For example, a sequence-to-sequence model with an attention mechanism has been proposed by Du et al. (2017). Answer separation techniques have further improved question quality (Kim et al., 2019).
Question generation is slightly different in the dialog context in that the answer should generally not be found in the input text, i.e., the previous utterances, so that the question would not seem redundant. Question generation models have been deployed to engage users in a conversation (Mostafazadeh et al., 2016), but the research was focused on images. Template-based approaches, as exemplified by ELIZA (Weizenbaum, 1983), can also transform the user's statements into questions. These templates are labor-intensive to Post-reply Pairs Length type post reply Post-restatement 12,634 40.1 7.9 Post-question 9,036 36.8 11.1  construct, however, and may not provide sufficient coverage.

Data
Our data consists of post-reply pairs, a term that will be used henceforth to refer to both postrestatement and post-question pairs. This section describes the construction process of two datasets, which contain in-domain, manually crafted (Section 3.1) and open-domain, automatically mined (Section 3.2) post-reply pairs, respectively.

Manual dataset
We recruited 10 undergraduate students to collect Cantonese social media posts with content concerning loneliness, academic and test anxiety. For each of the 6,294 posts collected, human annotators marked a text span as their "target phrase", and composed a restatement and/or question for that phrase. As shown in Table 2, the dataset contains 12,634 post-restatement pairs and 9,036 postquestion pairs. There are on average 2.2 gold restatements per post, and 1.6 gold questions per post.

Automatically mined dataset
This dataset was automatically mined from the LCCC dataset (Wang et al., 2020b), which consists of 6.8 million Mandarin dialogs; and from 89K post-reply pairs crawled from Cantonese discussion forums in Hong Kong. We used two methods to generate post-reply pairs: Extraction. To produce post-restatement pairs, we identified the longest common string of the post and the reply in each post-reply pair in the open-domain corpora above. We extracted all pairs whose longest common string contains at least four characters, and used the repeated string in the post as the restatement. To extract post-question pairs, we identified post-reply pairs whose reply starts with a short question, defined as a question mark preceded by no more than 10 characters.
Matching. We identified all posts that contain a text span that matches a target phrase in the manual dataset (Section 3.1). We then reused the restatement and/or question for that target phrase to form a new post-restatement and/or post-question pair.

Approach
We first construct and evaluate models for restatement generation and for question generation separately (Section 4.1). We then combine these models to interleave restatements and questions in a counselling dialog (Section 4.2).

Restatement and Question Generation
We focus on generation-based rather than retrievalbased models, in order to tailor restatements and questions specifically to the content in the post. For each of the following approaches, we trained a restatement generation model by fine-tuning the pre-trained model with post-restatement pairs in the manual dataset (Section 3.1); we then separately trained a question generation model in a similar fashion.
DialoGPT We used GPT2 for Chinese chitchat 1 , a dialog model that is based on Di-aloGPT  and trained on GPT2-Chinese (Du, 2019). We fine-tuned the pre-trained model with our post-reply pairs (Section 3.1). 2 mT5 Competitive question generation models can be built by fine-tuning the Google T5 model (Pan et al., 2021). Adopting a similar approach with mT5 (Xue et al., 2021), a multilingual variant of T5, we fine-tuned the mT5-base model with our post-reply pairs. 3 BertSum BertSum is a state-of-the-art summarization model (Liu and Lapata, 2019). We used the abstractive summarization model, which uses a standard encoder-decoder framework. The encoder is the pre-trained Bert and the decoder is a 6-layered Transformer with random initialization. We fine-tuned its pre-trained bert-base-chinese model with our post-reply pairs. 4 Global Encoding The Global Encoding framework, which has shown competitive result in text summarization, seeks to improve the representations of the source-side information by using global information of the source context (Lin et al., 2018). Similar to above, we fine-tuned the pre-trained model with our postreply pairs. 5 Oracle Retrieval To gauge the maximum performance of a retrieval-based paradigm, this algorithm selects the highest-scoring reply in the training set in terms of ROUGE-L.
We further fine-tuned the DialoGPT, mT5, Bert-Sum and Global Encoding models with the automatically mined dataset (Section 3.2). The resulting models are denoted as DialoGPT + , mT5 + , BertSum + , and Global Encoding + .

Interleaving restatements and questions
A conversation becomes monotonous and even irritating if the counsellor repeatedly gives restatements or asks questions. Using DialoGPT and BertSum + , the two strongest models for question generation (Table 5), we investigated the following methods to choose between a restatement candidate and question candidate as the reply.
BertSum + R+Q This model is trained with the same settings as BertSum + (Section 4.1), except that it is fine-tuned with both post-restatement and post-question pairs.
BertSum + (threshold) This algorithm responds with a question when the BertSum + model for 4 We used two Adam optimizers with β1 = 0.9 and β2 = 0.999 for the encoder and the decoder, respectively, and learning rate lrE = 0.002 and lrD = 0.1. All models were trained for 200,000 steps. Model checkpoints were saved and evaluated on the validation set every 2,500 steps. We selected the best checkpoint based on their evaluation loss on the validation set. 5 We used Adam with learning rate 0.0003 and learning rate decay parameter 0.5. We fine-tuned the model for 30 epochs with batch size 64. questions surpasses a confidence threshold; otherwise, it responds with a restatement. The tuning of the threshold will be described in Section 5.3.
BertSum + (random) This algorithm randomly chooses either the BertSum + model for restatements or the BertSum + model for questions.
BertSum + (ceiling) Designed to measure the maximum performance of BertSum + , this algorithm identifies the subset of posts for which BertSum + generates the highestscoring questions in terms of ROUGE-L. It replies to these posts with the generated questions, and to the remainder with restatements.

Experimental results
All results are based on 5-fold cross-validation on the manual dataset (Section 3.1). Following previous research, our evaluation metrics include BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). In addition, we report results with METEOR (Banerjee and Lavie, 2005) and BertScore (Zhang et al., 2019). Table 4 shows the results for restatement generation. When fine-tuned on the manual dataset only, DialoGPT yielded a ROUGE-L score of 0.5525, outperforming Global Encoding (0.4031), mT5 (0.4960) and BertSum (0.4938).

Restatement generation
When augmented with the automatically mined post-restatement pairs, BertSum + achieved the best ROUGE-L score (0.7142). It also outperformed other models in terms of BLEU, METEOR and BertScore. In terms of ROUGE-L, it even surpassed Oracle Retrieval (0.6932), which means that the restatements generated by the model were superior to the best available in the training set.

Question generation
Generally, automatically generated questions have lower ROUGE scores than restatements (Table 5). DialoGPT achieved only 0.4160 ROUGE, compared to 0.5525 for restatements. It outperformed both Global Encoding (0.3766) and Bert-Sum (0.3602).   When augmented with the automatically mined dataset, BertSum + again showed significant gains in performance. It achieved the highest ROUGE-L score (0.4665), followed by Global Encoding + (0.3990) and DialoGPT + (0.3848). Although mT5 is designed for question generation, its output scored lower than the other models in ROUGE-L, both when it is trained without (0.3699) and with the automatically mined data (0.3472).

Interleaving restatements and questions
Since it is more challenging to generate questions than restatements, a fair comparison between the algorithms requires a constant question frequency -i.e., the proportion of posts in the evaluation data to which the chatbot offers a question as response. The BertSum + R+Q model generated questions 27.1% of the time and restatements 72.9% of the time. 6 We therefore set the confidence threshold for the BertSum + (threshold) model such that its question frequency would also be 27.1%. We 6 The output is considered a question if it achieves a higher ROUGE-L score with the gold output in the post-question pair than the post-restatement pair (Section 3.1). likewise configured the BertSum + (random) model to randomly choose 27.1% of the posts to reply with questions.
As shown in Table 6, BertSum + (threshold) achieved the best performance at 0.7013 ROUGE-L, higher than its random counterpart (0.6730), BertSum + R+Q (0.6702), as well as DialoGPT (ceiling) (0.5604). It suffered only a degradation of 0.04 in comparison to BertSum + (ceiling), which picks the optimal posts for question generation. This result suggests the effectiveness of selecting reply type with a confidence threshold.
One advantage of BertSum + (threshold) over BertSum + R+Q is the ease with which question frequency can be adjusted to suit different conversation styles. Figure 1 plots its ROUGE-L score at various question frequencies. Since question generation is more difficult, the score decreases as questions are selected as the reply to a larger proportion of posts. BertSum + (threshold) outperformed both its random counterpart and DialoGPT (ceiling) at all question frequencies.  Table 6: Model performance on response generation of either restatement or question (the + superscript means the training set includes the automatically generated data) Figure 1: ROUGE-L score of BertSum + (threshold), BertSum + (ceiling), BertSum + (random) and Di-aloGPT (ceiling) at various question frequencies.

Conclusion
Restatements and questions are common conversation strategies in counselling. This paper has investigated automatic generation of these two reply types by exploiting models of two closely related NLP tasks -summarization and question generation. We obtained the best generation performance for both reply types by fine-tuning BertSum, a stateof-the-art summarization model, with an in-domain, manually annotated dataset augmented with a largescale, automatically mined open-domain dataset. We then showed that restatements and questions can be interleaved with a confidence score threshold.
To the best of our knowledge, this is the first reported application of summarization models on chatbot response generation in the counselling domain. It is hoped that our proposed techniques can improve the quality of a counsellor chatbot for the public. Further research is needed to take into account the progress of the counselling session when selecting a reply (Althoff et al., 2016;Zhang and Danescu-Niculescu-Mizil, 2020), and to measure correlation with counselling outcomes.