Retrieve, Discriminate and Rewrite: A Simple and Effective Framework for Obtaining Affective Response in Retrieval-Based Chatbots

Obtaining affective response is a key step in building empathetic dialogue systems. This task has been studied a lot in generation-based chatbots, but the related research in retrievalbased chatbots is still in the early stage. Existing works in retrieval-based chatbots are based on Retrieve-and-Rerank framework, which have a common problem of satisfying affect label at the expense of response quality. To address this problem, we propose a simple and effective Retrieve-Discriminate-Rewrite framework. The framework replaces the reranking mechanism with a new discriminate-andrewrite mechanism, which predicts the affect label of the retrieved high-quality response via discrimination module and further rewrites the affect unsatisfied response via rewriting module. This can not only guarantee the quality of the response, but also satisfy the given affect label. In addition, another challenge for this line of research is the lack of an off-theshelf affective response dataset. To address this problem and test our proposed framework, we annotate a Sentimental Douban Conversation Corpus based on the original Douban Conversation Corpus. Experimental results show that our proposed framework is effective and outperforms competitive baselines.


Introduction
Expressing affect is a key factor to build humanlike dialogue systems, which can significantly promote affective communication and enhance user satisfaction (Prendinger and Ishizuka, 2005;Partala and Surakka, 2004) during human-computer interactions. This problem has been studied a lot in generation-based chatbots Song et al., 2019;Shen and Feng, 2020), which is usually defined as obtaining an affective response given an affect label and the context of a conversation (Yuan et al., 2020). However, the related research in retrieval-based chatbots is still in the early stage (Qiu et al., 2020). Retrieval-based chatbots have an advantage over generation-based chatbots in obtaining diverse and informative responses, which is also widely used. Therefore, research on obtaining affective response in retrieval-based chatbots is meaningful. In existing studies, affect is regarded as the term that subsumes emotion, feeling and sentiment (Fleckenstein, 1991). In this paper, we focus on sentiment and study how to obtain a specific polarity (positive or negative) response in retrieval-based chatbots. Different from generation-based chatbots that can generate new responses, retrieval-based chatbots must obtain responses based on the candidates retrieved from a response repository. Therefore, under the objective of obtaining affective response, how to effectively use the candidates is an important issue. Existing works in retrieval-based chatbots are based on Retrieve-and-Rerank framework (Lubis et al., 2019;Qiu et al., 2020), which employs a reranking mechanism to the retrieved candidates. Specifically, the framework firstly obtains the candidates via retrieval module, then adjusts ranking or matching score according to the given affect label, and finally outputs a response that is appropriate in both affect and content.
However, the Retrieve-and-Rerank framework is not sufficient since it satisfies given affect label at the expense of response quality (Qiu et al., 2020). This means that high-quality but affect unsatisfied responses will be discarded, which directly reduces the core advantage of retrieval-based chatbots. For example, in Figure 1(a), when affect label is not considered, high-quality candidate 2 should be the best one, but since the affect label is given, only candidate 3 with ordinary quality can be selected.
To guarantee content and affect at the same time, we propose a simple and effective Retrieve-Discriminate-Rewrite framework. The framework replaces the reranking mechanism with a new discriminate-and-rewrite mechanism, which preferentially selects high-quality candidate response and rewrites the response whose affect is discriminated to be unsatisfied. For example, in Figure 1(b), our new framework preferentially selects high-quality candidate 2, then discriminates that the affect of the candidate 2 is unsatisfied, and finally makes the affect of the candidate 2 satisfied with a small amount of modification. This shows that the framework can not only guarantee the quality of response, but also satisfy the given affect label.
In addition, another challenge for this line of research is the lack of an off-the-shelf affective dataset. Such a dataset can not only be used in our framework, but also necessary for existing methods which employs the reranking mechanism. To address this problem and test our framework, we annotate a Sentimental Douban Conversation Corpus based on the original Douban Conversation Corpus which is widely used by many previous works in retrieval-based chatbots. We conduct experiments on this dataset, and experimental results show that our framework with a simple architecture is effective and outperforms competitive baselines.
The contributions of this work are summarized as follows: • We propose a Retrieve-Discriminate-Rewrite framework for obtaining affective response in retrieval-based chatbots, which solves the problem of low-quality responses in the Retrieve-and-Rerank framework.
• We annotate and publish an affective response dataset, which solves the problem of the lack of necessary dataset in this line of research.
• Experimental results on the dataset show that our framework is effective and outperforms competitive baselines.

Related Work
Existing works for obtaining affective response in dialogue systems can be categorized into two branches. The first category is the generation-based method, which generates affective response for given conversation context based on the Seq2Seq model (Shang et al., 2015;Sordoni et al., 2015). The generation-based method has the advantage of generating new responses and has been studied a lot Song et al., 2019;Shen and Feng, 2020). The second category is the retrieval-based method, which obtains affective response for given conversation context based on the candidates retrieved from the response repository. The retrieval-based method has the advantage of obtaining diverse and informative responses (Song et al., 2018), which is still competitive compared to the generation-based method. This paper focuses on the second category. Different from the generation-based method, the related research on the retrieval-based method for obtaining affective response is still in the early stage. Lubis et al. (2019) proposed a reranking strategy for positive emotion elicitation whose method can also be applied to obtain affective response. Qiu et al. (2020) presented an emotionaware matching network, which incorporated emotional factors and realized emotional control. From the perspective of using candidates, these methods are all based on Retrieve-and-Rerank framework. Although these methods can already obtain affective responses, Qiu et al. (2020) observed that these methods prefer responses that satisfied given affect label, even if they are not relevant to the context,  Figure 2: Overview of the Retrieve-Discriminate-Rewrite framework. We use digital numbers to show the processing flow of our framework. The framework includes three components: retrieval module, discrimination module and rewriting module. The retrieval module is used to retrieve a high-quality response, the discrimination module is used to discriminate the polarity of the response, and the rewriting module is used to correct the polarity of the response from unsatisfied to satisfied. These modules work together to obtain affective responses.
which reduces the core advantage of retrieval-based chatbots. How to balance rich information and given affect label is still being explored, and in this paper we focus on this problem.
Another branch of research touched in our work is style transfer in natural language processing. The rewriting mechanism in our framework will modify the polarity of the response, which has been studied in some style transfer works. Some existing works (Shen et al., 2017;Fu et al., 2018;Prabhumoye et al., 2018;Xu et al., 2018) focus on how to get style independent sentence representation and then generate a sentence with target style. These works have certain effectiveness, but they usually lack fine-grained control and cause poor content preservation, which is inconsistent with our goal of fine-grained control of polarities. Meanwhile, some existing works Sudhakar et al., 2019) focus on how to remove style-related words in the sentence and then generate a sentence with target style. Inspired by these works, we also regard the polarity rewriting as a similar two-stage process. But different from these works, our polarity rewriting will involve a large number of situations from neutral expression to affective expression, not just the transfer between different affective expressions. Lack of processing neutral expression will lead to poor performance in our task, and our handling of this problem is different from these works. In addition, some other existing works Lample et al., 2019;Dai et al., 2019) have realized style transfer from other different perspectives. These works aim at more general style transfer issues and also lack fine-grained control of polarities, which does not match the goal of our work.

Overview
In this work, our goal is to obtain an affective response given an affect label and the context of a dialogue in retrieval-based chatbots. In particular, the affect label we focus on is sentiment polarity (positive or negative).
The problem can be formulated as follows: given a conversation context C = {u 1 , u 2 , ..., u N } with N utterances, a response repository P = {r 1 , r 2 , ..., r M } related to the context C, and a target polarity s, the objective is to obtain a response Y based on the candidates retrieved from the response repository P , which not only is coherent with the context C, but also matches the target polarity s.
For this problem, our Retrieve-Discriminate-Rewrite framework is shown in Figure 2. The framework consists of three components: (1) Retrieval Module: This module is used to be compatible with existing retrieval-based chatbots, which can provide high-quality response for subsequent modules.
(2) Discrimination Module: This module receives the retrieved high-quality response from the retrieval module, which can discriminate the polarity of the retrieved high-quality response and output the response with satisfied polarity.
(3) Rewriting Module: This module receives the response with unsatisfied polarity from the discrimination module, which can correct the polarity of the response from unsatisfied to satisfied.
In the following sections, we will describe these components in detail, and introduce how the framework uses them to obtain affective response.

Retrieval Module
In our framework, the retrieval module is used to be compatible with existing retrieval-based methods. To verify that our framework is universal, we select the following retrieval-based methods to obtain high-quality responses in our framework, and we will conduct experiments based on these methods separately.
GTM This is the Ground Truth model, which always outputs correct responses. We use this ideal model to study the performance of our framework when the retrieval result is perfect. SMN (Wu et al., 2017) This is a classic work in retrieval-based chatbots, which proposed a sequential matching network to match a response with each utterance on multiple levels of granularity and accumulate the obtained matching vectors with RNN for the final matching score.
MSN (Yuan et al., 2019) This is a recent work in retrieval-based chatbots, which proposed a multihop selector network to alleviate the side effect of using unnecessary context utterances. It is one of the most effective methods recently.

Discrimination Module
In our framework, the discrimination module is used to receive the retrieved high-quality response from the retrieval module, and discriminate the polarity of the response. For the response with satisfied polarity, the module outputs it directly.
Noting that the module handles a classification task, so we can utilize many existing classifiers. In this work, we choose the pre-trained BERT model as our classifier, which has achieved state-of-the-art performances across a variety of NLP tasks.
For the pre-trained BERT model, given a response R = {w 1 , w 2 , w 3 , ..., w n }, the input can be expressed as: Following the usual practice, we use the hidden representation for the [CLS] token to represent the response, and then feed it into a softmax layer for classification.

Rewriting Module
In our framework, the rewriting module is used to receive the response with unsatisfied polarity from the discrimination module, and correct the polarity of response from unsatisfied to satisfied.
Inspired by previous works in style transfer Sudhakar et al., 2019), we regard the polarity rewriting of the response as a two-stage process: Delete and Generate. The first Delete stage employs a pretrained sentiment classification model to delete the affective expressions in the response, and the second Generate stage adopts two transformer-based generators to produce a response with satisfied polarity. We introduce the two stages in the following sections.

Delete
In this stage, our goal is to identify and delete the affective expressions in affective responses. For neutral responses, we do nothing at this stage.
Our approach is based on a pretrained sentiment classification model to automatically identify wordlevel affective expressions. For a sentiment classification model, the affective expressions in a sentence are the key to recognize the polarity of the sentence. Therefore, an intuitive idea is to measure the importance of different words to sentence sentiment classification, and the most important words should be the key affective expressions.
Specifically, we design a word ranking mechanism for identifying word-level affective expressions in the response. We calculate the importance score I w i for each word w i in the response R. The method is to remove the word w i in the response, and compare the target polarity prediction score before and after the deletion, which are S e (R [w i ] ) and S e (R [w/o w i ] ) respectively. The importance score I w i for each word w i can be formally defined as follows: We calculate the importance score for each word and choose the top λ% of words as affective expressions. Then, we delete these affective expressions and send the modified response to the next stage.

Delete
That restaurant is awesome.
That restaurant is _____.

Neutral Expression Generator
That restaurant is near the lake.

Affective Expression Generator
That restaurant is awesome.

Neutral Expression Generator
I went there with my _____.
I went there with my friends.

Delete
That restaurant is terrible.
That restaurant is _____.
That restaurant is near the lake.

Affective Expression Generator
That restaurant is awesome.

Generate
In this stage, our goal is to generate a response with a specific polarity. Different from existing works in style transfer Sudhakar et al., 2019), our polarity rewriting will involve a large number of cases from neutral expression to affective expression, not just the transfer between different affective expressions. An obvious problem is affective responses have affective expressions that can be deleted, but neutral responses have no affective expressions to delete. After the Delete stage, although both will become neutral, the sentence distribution of the two is obviously different. If only affective responses can participate in generation training, it will lead to poor performance of the generator for neutral responses.
To address this problem, we propose two generators: neutral expression generator and affective expression generator. We introduce the two generators in the following sections.
Neutral Expression Generator The neutral expression generator is used to complete an incomplete neutral response to a complete neutral response. In the training phase, this generator completes the incomplete neutral responses from the Delete stage, which can provide additional training data for the affective expression generator. Thus, the affective expression generator will receive two types of neutral responses at the same time in generation training, which solves the above problem of inconsistent distribution. The architecture of the generator is consistent with Generative Pre-trained Transformer (GPT) (Radford et al., 2018).
Affective Expression Generator The affective expression generator is used to generate a response with satisfied polarity from an incomplete or complete neutral response. In the training phase, the incomplete neutral response is obtained after the Delete stage, and the complete neutral response is provided by the neutral expression generator. The architecture of the generator is also consistent with GPT, and we add different special symbols to input for different target polarities to distinguish.
Training and Testing To train the two generators, an affective corpus is required, which contains positive, negative and neutral sentences. The training process consists of two stages, which is shown in Figure 3. In training stage 1, we use neutral sentences to train the neutral expression generator. The input is a processed sentence with λ% words deleted randomly, and the target is the original neutral sentence. In training stage 2, we use affective sentences to train the affective expression generator. The input is an affective sentence, which is processed into an incomplete neutral response and a complete neutral response, and the target is the original affective sentence. In the testing stage, the input is an affective or neutral sentence, and the target is a sentence with specified polarity.

Sentimental Douban Conversation Corpus
In this paper, to solve the problem of no off-theshelf dataset, we annotate the Douban Conversation Corpus (Wu et al., 2017) in terms of sentiment polarity to support the research of obtaining affective response in retrieval-based chatbots.

Douban Conversation Corpus
This dataset contains open domain multi-turn conversations in Chinese, and it is constructed from Douban group which is a popular social networking service in China. For each dialogue in training and validation sets, the last turn is taken as a positive re- sponse, and another randomly sampled response is taken as a negative response. For each dialogue in test set, it has 10 candidate responses which is collected by an index system and annotated manually.
The data statistics are shown in Table 1.

Sentiment Annotation
As mentioned previously, there is no off-the-shelf dataset to support this task. Such a dataset can not only be used in our framework, but also necessary for existing methods which employs the reranking mechanism. To address this problem, we annotate the Douban Conversation Corpus in terms of sentiment polarity, and obtain a new Sentimental Douban Conversation Corpus. Specifically, we give annotation guidelines and examples to three human annotators, who then manually annotate sentiment labels for 1,400 dialogues with a total of 10,712 utterances. We extract 1,000 utterances as a shared annotation part of all annotators, and divide the remaining utterances into 3 parts as independent annotation part of each annotator. We measure pairwise inter-annotator agreement among the three annotators in the shared part using Cohen's kappa, and their scores are 0.81, 0.79 and 0.80. For the remaining 498,600 dialogues, we train a classifier using manual annotation data to annotate them automatically. In this work, the manual annotation data is used to train baseline models and our discrimination module, and the automatic annotation data is used to train our rewriting module. Our classifier is a fine-tuned RoBERTa-large model whose pre-training parameters are derived from Chinese RoBERTa (Cui et al., 2020), and obtains the accuracy of 82.79% and the macro-F1 of 79.18% on the divided test set of manual annotation data. A summary of statistics for sentiment annotation is shown in Table 2.

Baselines
As mentioned previously, the related research in retrieval-based chatbots is still in the early stage, thus there are very few closely-related baselines. In this paper, we choose two suitable baselines: Base (w/o. control) This is a basic baseline, which directly outputs the best response matched by the retrieval model without considering target polarity. This baseline represents the ability of the standard retrieval model to obtain affective responses. Note that this baseline only selects responses based on relevance, so it is a very strong baseline in terms of response content quality.
Reranking (Lubis et al., 2019) This baseline is a reranking strategy, which first uses the retrieval model to perform semantic matching on response candidates, and then reranks them according to whether a response satisfies the target polarity. In our experiments, we use the same classifier as our discrimination module. If there are responses satisfying the target polarity, we output the one with the highest semantic matching score. Otherwise, we directly output the best one without considering the target polarity.

Evaluation Metrics
In this section, we introduce the metrics to evaluate the performance of our proposed framework.
Inspired by related works in generation-based chatbots Song et al., 2019;Shen and Feng, 2020), we perform human evaluation to analyze the quality of the responses from content (Con.), fluency (Flu.) and polarity accuracy (Acc.). First, we randomly sample 100 dialogues from the test set. For each dialogue, we require both positive and negative responses. We present the triples of (context, response, polarity) to three human annotators without order, and they evaluate responses on content, fluency and polarity accuracy independently. Content is measured by a 5-scale rating, which is determined by whether a response is coherent and meaningful for the context. Fluency is measured by a 5-scale rating, which is determined by whether a response is fluent and grammatical. Table 3: Experimental results on the Sentimental Douban Conversation Corpus. The results are divided into different groups according to different retrieval-based methods, and the comparison of different models is performed within the group. "Con.", "Flu." and "Acc." denote content, fluency and polarity accuracy, respectively. Polarity accuracy is measured by a 2-scale rating, which is determined by whether a response satisfies the target polarity. Note that we do not use automatic metrics because they are not applicable to this task, the detailed explanation can be found in Appendix A.

Experimental Settings
The architecture and training process of our framework have been introduced in the previous sections. Further training details and hyperparameter values can be found in Appendix B. Our dataset and the implementation for our model are released at https://github.com/luxinxyz/RDR/.

Overall Results
We compare our proposed framework with the baseline methods, and the experimental results are shown in Table 3. We divide the results based on different retrieval-based models into different groups, and the comparison of the experimental results in each group is fair. From the perspective of content, Base (w/o. control) is the baseline which only considers content without considering polarity, so its content score is the highest among the three methods. Our framework is second only to Base (w/o. control) and is significantly better than Reranking, which preliminarily illustrates the advantages of our framework in content. From the perspective of fluency, our framework is slightly weaker than Base (w/o. control) and Reranking because of the modification of the response, but it is also close to the full score. From the perspective of polarity accuracy, our framework is the best among the three methods, which shows the advantages of our framework in polarity accuracy.
Based on the above results, we can see that our framework can obtain affective response better than the baseline methods, especially on the basis of ensuring polarity accuracy, effectively avoiding the low-quality response problem of the reranking mechanism. In addition, the reproduced results of the retrieval-based methods can be found in Appendix C, and more detailed response examples can be found in Appendix E.

Impact of Affective Candidate Size
We analyze the impact of affective candidate size to further explain the problem of the Retrieveand-Rerank framework and the advantage of our Retrieve-Discriminate-Rewrite framework. Specifically, we control the ratio of affective responses in the response repository by discarding them to simulate retrieval-based dialogue systems with different level affective information, and then plot the performance trends of different methods on content, polarity accuracy and the mean of both after normalization. All experiments are under the MSN settings, and the results are shown in Figure  4. Under normal circumstances, with the increase of affective candidates(the decrease of discard ratio) in dialogue systems, the content score should gradually increase, just like Base (w/o. control) and our framework. However, the content score of Reranking gradually decreases, which confirms low-quality response problem of the Retrieve-and-Rerank framework we mentioned in the introduction. From the perspective of polarity accuracy, our  framework can always maintain a high level. Finally, considering the overall results of content and polarity accuracy, our framework is better than the other two methods, which proves the effectiveness of our framework.

Impact of Discrimination Module
We analyze the impact of our discrimination module on the final performance. Specifically, we replace the classifier of our discrimination module from BERT to different architectures, such as CNN and BiLSTM, and then explore the relationship between the performance of our discrimination module and the final performance. All experiments are under the GTM settings, and we use macro-F1 to evaluate the classifier performance. As presented in Table 4, the best performance classifier corresponds to the best final performance, which illustrates the importance of a good discrimination module in our framework.

Analysis of Rewriting Module
We analyze the rewriting module in our framework. Specifically, we reproduct a style transfer model named DeleteRetri  to compare with our proposed rewriting module. We choose this model because it also includes the process of deletion and generation, but has no special design for neutral response. And in order to verify the ability to process neutral response, we evaluate the polarity accuracy when the input of these models is affective (Acc-A.) and neutral (Acc-N.) respectively. All experiments are under the GTM settings, and the results are shown in Table 5. From the table, we observe that the content and fluency of the two models are similar, but the polarity accuracy of our rewriter is significantly better. DeleteRetri has the problem that the performance of neutral input is significantly lower than that of affective input while our rewriter does not have such a problem, which shows the effectiveness of our improvement. We also compare with other style transfer models, and the results can be found in Appendix D.

Conclusion
In this paper, we propose a Retrieve-Discriminate-Rewrite framework for obtaining affective response in retrieval-based chatbots, which solves the problem of low-quality responses in the Retrieve-and-Rerank framework. Our framework contains three components: retrieval module, discrimination module and rewriting module, which can preferentially select high-quality candidate response and rewrite the response whose affect is discriminated to be unsatisfied. Considering the lack of necessary dataset in this field, we further annotate and publish a Sentimental Douban Conversation Corpus. The empir-ical studies show that our framework outperforms competitive baselines, and extensive analyses further proves the effectiveness of our framework.

Ethical Considerations
In this section, we address relevant ethical considerations that were not explicitly discussed in the main body of our paper.
Intended Use The reported technique is intended for building affective chatbots used in daily life. We anticipate that this technique will significantly promote affective communication and enhance user satisfaction during human-computer interactions, which is an enhancement to existing chatbots.
Potential Misuse In some cases, our proposed model may produce effects similar to mental health support. This may mislead users that the model has professional psychotherapeutic capabilities, leading to misuse. In fact, this model is not developed from the perspective of professional psychology applications. Applying it to professional-level mental health support is extremely risky, and in extreme cases it may cause harm to users. We reiterate once again that the reported technique is only intended for building affective chatbots used in daily life.

Failure Modes
The main failure mode is that the model may learn some bad expressions in the training data which are harmful to users. Based on the consideration of compatibility with existing works, we performed additional annotations on a widely used dataset and trained the model based on it. This dataset is an early classic dataset which does not represent current norms and practices, so there is indeed a possibility of harmful responses(but actually very few), which may involve offensive speech, hate speech, etc. In order to reduce this risk, one idea is to clean the harmful responses in the dataset, and the other is to detect the harmfulness of the results output by the model. Both of these can be achieved based on some recent works on offensive speech detection (Ranasinghe and Zampieri, 2020) or hate speech detection (Vidgen et al., 2021). In addition, in order to provide an intuitive reference for users of the model, we conducted an empirical evaluation of the harmfulness of the model. Specifically, we randomly sampled 1,000 responses output by the model and asked three human annotators to evaluate whether the responses might make users uncomfortable. The evaluation results show that only 19 responses will make users slightly uncomfortable, and there are no responses that make users seriously uncomfortable. This result is not enough to completely eliminate concerns, but in a sense, it shows the actual performance of the model trained on the dataset.

A Additional Description of Evaluation Metrics
Note that we do not use automatic metrics to evaluate affective responses, because existing automatic metrics are not suitable. For the content perspective, the automatic metrics such as MAP, MRR, P@1 and R n @k for retrieval evaluation are not suitable for the responses that are not in the response repository, and the automatic metrics such as Perplexity, BLEU scores and Embedding scores for generation evaluation are not suitable for the responses retrieved from the response repository. Therefore, although we can use automatic metrics to test our framework, we cannot form a valid comparison with the baseline models. For the polarity accuracy perspective, we can not use a sentiment classifier to evaluate affective responses, because the baseline methods and our framework already rely on sentiment classifiers, especially Reranking completely relies on the results of the sentiment classifier. For the above reasons, we only perform human evaluation to analyze the quality of affective responses.

B Training Details
For our retrieval module, we use open-source codes 1,2 provided by the authors to reproduct SMN (Wu et al., 2017) and MSN (Yuan et al., 2019) with the same settings as original papers. For our discrimination module, we use manual annotation data of the dataset for training. We initialize the module with the pre-training parameters provided by Chinese RoBERTa (Cui et al., 2020), and our implementation is based on the PyTorch implementation of BERT-large 3 . We use Adam (Kingma and Ba, 2015) as optimizer with a learning rate of 1e-5 and a batch size of 16, and use the linear learning rate decay schedule with warmup over 0.1. We set the maximum number of epochs to 5 and select the model with the best performance on the validation set. The average runtime is 3 hours on a Tesla V100 32GB GPU machine.
For our rewriting module, in the Delete stage, we use manual annotation data of the dataset for training, and the settings are similar to the discrimination module. In the Generate stage, we use automatic annotation data of the dataset for training.  For our two generators, we use the same structure as Generative Pre-trained Transformer (GPT) (Radford et al., 2018), and our implementation is based on the PyTorch implementation of GPT 4 . For both models, we use Adam (Kingma and Ba, 2015) as optimizer with a learning rate of 1e-4 and a batch size of 256, and use the linear learning rate decay schedule with warmup over 0.1. For the neutral expression generator, we train the model for 60 epochs on 8 Tesla V100 16GB GPU machines, which takes about 40 hours. For the affective expression generator, we set the maximum number of epochs to 10 and select the model with the best performance on the validation set. The average runtime for the generator is 20 hours on a Tesla A100 40GB GPU machine. Based on the model performance on the validation set, the λ is set to 25 in the Delete and Generate stage. In the testing phase, we using the Nucleus Sampling (Holtzman et al., 2020) with a threshold 0.9 and temperature 0.7 to decode responses.

C Retrieval Model Performance
We reproduct some retrieval-based methods as the retrieval module of our proposed framework, and we use the automatic metrics such as MAP, MRR, P@1 and R n @k for retrieval evaluation to evaluate these methods. The results of retrieval-based methods are shown in Table 6. The results marked by † are from the paper of SMN, and the results marked by ‡ are from the paper of MSN. From the table, we observe that our results reach the level of original papers. In the above experiments, we directly use these models as our retrieval module.

D Style Transfer Experiment
In addition to DeleteRetri  , we also reproduce other style transfer models (Shen et al., 2017;Wang et al., 2019;Yi et al., 2020;Luo et al., 2019) to compare with our proposed rewriting module. All experiments are under the GTM settings, and the results are shown in Table 7. From the table, we observe that the content and fluency of these models are similar, but the polarity accuracy of our rewriting module is the best among these models. The main reason for the result is our model adopts a rewriting architecture that can achieve fine-grained control of affect, which is more suitable for this task than other models.

E Sample Affective Responses
We show some examples obtained from baselines and our framework under the GTM settings. As presented in Table 8, Base(w/o.control) can output high-quality but polarity unsatisfied response in most cases, and Reranking can output polarity satisfied but poorly relevant response in most cases, while our framework can always output polarity satisfied high-quality response by the discriminateand-rewrite mechanism. These cases show that our proposed framework is effective.