MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling

Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.


Introduction
In recent years, open-domain chatbots have achieved impressive progress in the natural language processing (NLP) field (Li et al., 2016b;Zhou et al., 2018a;Chan et al., 2019;Liu et al., 2020;Zhu et al., 2021b).Given the input utterance, the chatbot responds with an appropriate response.However, for the same input utterance, chatbots usually provide similar responses for all users.This one-size-fits-all strategy cannot satisfy the varying needs of users and is lean to generate safe but meaningless responses, such as "I don't know" (Li et al., 2016a;Zhu et al., 2020).To solve this problem, some researchers have begun to endow chatbots with personality and develop personalized chatbots (Zhang et al., 2018;Ma et al., 2021b;Zhong et al., 2022).When equipped with personal information (either given by a predefined profile or learning from dialogue history), personalized chatbots can generate more user-specific and informative responses.
Although the existing methods for personalized chatbots are capable of generating some personalized responses, two major shortcomings hurt user profile building and response generation.First, like non-personalized chatbots, they only use the single response generation loss to train the entire model.It has been found that such an optimization way is easy to suffer from the data sparsity scenarios, where users may have limited dialogue history in real-world applications (Song et al., 2019b).Second, they overemphasize the quality of the final generated response, while the correlations and the fusion between the user's historical responses are not captured in the data representations.Such rough data representations are used for building inaccurate user profiles and hurt the generalizability of the model.Recent studies in pre-trained language models have shown that effective data representation is a key factor to improve the performance of various downstream tasks (Kong et al., 2020).Therefore, there is a need to rethink the learning paradigm for developing more effective personalized chatbots.
To address the above issues, we propose to apply the self-supervised learning framework into personalized chatbots.We design three novel pre-training tasks to leverage the correlations and self-supervised signals brought by dialogue history.
Self-supervised learning has achieved great success in various NLP (Devlin et al., 2019;Bansal et al., 2021) and Information Retrieval (Guo et al., 2022;Ma et al., 2021a;Zhu et al., 2021a;Zhou et al., 2021) tasks.It aims to let the model learn the intrinsic structure of the raw data without additional supervised signals.In fact, the user dialogue history contains massive supervised signals for capturing personalized information.For example, if one user gave two responses to the same user within a short time, these two responses tend to be semantically relevant, thus the model should build similar representations for these two responses.Such supervised signals hidden in dialogue history can help the model to build more accurate data representations, and we can construct amounts of training samples for pre-training the model in a self-supervised manner.It can be seen that the self-supervised learning framework can perfectly help to solve two major issues: rough data representations and data sparsity.
Based on the above observations, we propose a novel self-supervised learning approach MCP for personalized chatbots that is a two-stage neural framework and designs three pre-training tasks based on the dialogue history.To build the user profile from the user's dialogue history, we design two encoders for learning the single utterance representation and the user profile representation, i.e., utterance encoder and history encoder.The utterance encoder aims to learn the semantics and representations of user responses, and the history encoder focuses on encoding the user's historical response sequence and building the user profile representation.We design a two-stage framework for applying self-supervised learning to personalized chatbots.In the first stage, we design three pre-training tasks to generate supervised signals from the dialogue history and pre-train the utterance encoder and the history encoder towards these designed objectives.We construct three types of contrastive pairs from the user's dialogue history as the pre-training samples, namely response pairs, sequence augmentation pairs, and user pairs.Via such a pre-trained method, our model can learn better data representations based on the pre-training data, further adapting data representations to personalized chatbot scenarios.In the second stage, we use the parameters of the pre-trained encoders to initialize the encoders in the personalized encoderdecoder model and use the utterance encoder and history encoder to build the user's profile represen-tation.The user profiles are fed into the encoderdecoder model to drive the personalized response generation process.To verify the effectiveness of our model, we conduct extensive experiments on two large-scale datasets, i.e., Weibo and Reddit.Experimental results show that MCP achieves stateof-the-art performance compared to a number of competitive methods.
Our contributions can be summarized as follows: (1) We propose a two-stage framework for applying self-supervised learning to train personalized chatbots for better data representation.To the best of our knowledge, it is the first time that pre-training is leveraged based on a user's dialogue history for personalized chatbots.(2) We design three pretraining tasks based on the correlations and supervised signals hidden in the user's dialogue history.We construct response pairs, sequence augmentation pairs, and user pairs from different views of the dialogue history for better pre-training the encoder models.(3) The experimental results on two realworld large-scale datasets show the effectiveness of our proposed model.

Related Work
Open-domain Generation-based Chatbots.Here we briefly introduce some generation-based methods, as they are most relevant to ours.In some early studies, dialogue response generation was considered a statistical machine translation problem (Ritter et al., 2011).Nowadays, sequenceto-sequence (Seq2Seq) models are the mainstream method due to the fast development of deep learning and neural network (Serban et al., 2016).Many studies have tried to endow Seq2Seq-based methods with human attributes, such as: (1) introducing additional commonsense knowledge to generate a more reasonable response (Zhou et al., 2018b); (2) tracking the speakers' emotions to make a more suitable reply (Zhou et al., 2018a); (3) responding with more diverse sentences to avoid a boring user experience (Li et al., 2016a); and (4) generating responses with personas to make the dialogue more consistent (Qian et al., 2018).Personalized Chatbots.Building a personalized chatbot has attracted public interest in recent years.With a more steady personality, chatbots can have more consistent and informative dialogues with users.Existing methods can be summarized as: (1) Learning with user ID information (Li et al., 2016b;Al-Rfou et al., 2016;Bak and Oh, 2019;Chan et al., 2019).These methods embed user IDs into vectors and use them to guide the dialogue.However, it is impractical to maintain a huge user ID embedding table for real-world large-scale datasets.(2) Learning with explicit user profiles (Qian et al., 2018;Zhang et al., 2018;Olabiyi et al., 2019;Song et al., 2019a).This group of methods aims at using descriptive information (such as persona sentences or attribute tables) to generate personalized responses, but such information is hard to collect, and it cannot keep track of user interest changes.(3) Learning with implicit user profiles (Ma et al., 2021b;Zhong et al., 2022).These methods extract personalized information automatically from a user's dialogue history to generate personalized responses.
However, user dialogue history is limited and persona-sparse, and it is challenging to learn robust representations of the user's implicit profile.Therefore, we aim to tackle this problem and enhance representations to generate personalized responses.

Methodology
In this section, we first provide the problem statement and an overview of our model.Then, we elaborate the details of each component in the model.

Problem Statement and Overview
Considering a set of users defined as U = {u 1 , u 2 , • • • , u l }, for a specific user u i , their dialogue history is denoted as where n represents the length of user dialogue history, and t i k represents the response time of r i k .Note that q i j is a query issued by other users, while r i j is the response given by the user u i .We only use the historical responses that reflect the personal information of the user u i , so With the above notations, our task is defined as generating a personalized response r for the user u i to reply to a new query q with the personalized information extracted from the user's dialogue history H i .
The overview of our MCP model is shown in Figure 1.For user dialogue history, we design an utterance encoder to independently represent each utterance into vectors; and a history encoder to model the interactions among utterances and capture their sequential information.The obtained history representation (containing personal information) and the input query are fed into a Transformer-based Seq2Seq structure to generate a personalized re-  sponse.
As the main contribution of this work, we design three self-supervised pre-training tasks for an utterance encoder and a history encoder.By sampling contrastive samples at utterance-, dialogue history-, and user-level, the two encoders are trained to capture more personalized information (such as user interest) from user dialogue history and create more robust representations.We will elaborate the details of the three pre-training tasks and our model in the following sections.

Utterance-level Contrastive Learning
The first contrastive pre-training task is based on our observation on conversations between two people: the consecutive utterances within a short period are usually about one topic.Intuitively, the representation of two utterances under similar topics in the dialogue history should be closer than other utterances.Following this assumption, we extract contrastive samples from the dataset by the rule: the two utterances should be replied to the same user and their issued time should be close.Formally, given a user u, we select two triplets (q i , r i , t i ) and (q j , r j , t j ).Both queries q i and q j are issued by the same user u ′ , and the response time interval satisfies |t i − t j | ≤ t, where t is a time threshold.As a result, r i and r j construct a pair of contrastive samples.We then apply a contrastive learning objective to pull close their representations and push apart the representation of other utterances.
Specifically, we use an utterance encoder to represent the responses r i and r j as: where Mean(•) is the mean pooling operation.The general contrastive learning objective aims at reducing the distance of positive pairs and increasing the distance of negative pairs (Gutmann and Hyvärinen, 2010).In our case, for a positive response representation pair (r i , r j ) in a mini-batch of N pairs, we consider the response r in the other N − 1 pairs as negative set R − .Hence, the loss function can be defined as: where the function CosSim(•) denotes the cosinesimilarity.

History-level Contrastive Learning
In personalized response generation, the main task is to capture the personalized information (e.g., the user's interests) from the dialogue history.To enhance the history encoder and obtain a more robust history sequence representation, we apply sequence augmentation to construct different views of the dialogue history.By comparing the original sequence and the augmented one, the history encoder is encouraged to highlight the information that best reflects the user's personality.Specifically, we design the following four strategies: (1) Session masking.When chatting with others, a user often shows the same interest to different people.It inspires us that after masking all responses to a single person, the rest of the user's dialogue history can still represent the user's intrinsic interest.Based on this assumption, for each user, we select a series of her responses that respond to the same user and then mask them in the sequence of her dialogue history.The altered sequence and the original sequence construct a positive pair.With this task, we could improve the generalizability of the user profiling model.
(2) Sequence random masking.Generally, a user's long-term interests are stable.This inspires us to randomly mask k% sentences in sequence to prevent our model from overly focusing on several certain responses.The new sequence and the original one construct a positive pair.The unmasking part of the dialogue sequence doesn't affect the user's long-term interests.Furthermore, the robustness of our model is enhanced with this task.
(3) Sequence re-ordering.Although the order of user dialogue history has an effect on the user profile, simply re-ordering two responses from dialogue history should not affect the user's long-term and stable interests.Thus, we randomly choose pairs of responses that respond to different people and swap their positions in the sequence.The augmented sequence and the original one construct a positive pair.The long-term and stable interests are highlighted with this task.
(4) Short interval sequence masking.Inspired by research on session-based social recommendation (Zhang et al., 2020b), a pair of consecutive responses with tight time intervals imply more user interests.Thus, we choose pairs from consecutive responses which satisfy the condition that time interval t ≤ t, where t denotes a time threshold.By simply masking the latter response in the chosen pairs, the user profile should remain unchanged.
For a specific user dialogue history sequence [r 1 , • • • , r n ], we augment it and obtain a new sequence as following the aforementioned four strategies.These two sequences are treated as a positive pair and encoded by the history encoder as: where [;] is the concatenation operation.As the obtained vector containing the user's personalized information, we call it user profile vector.Thereafter, we get positive augmented pairs (U, U aug ).Similarly, the negative set S − contains N − 1 augmented sequences sampled from the mini-batch of N size, and the loss is as follows: . (1)

User-level Contrastive Learning
Besides, we also notice that user relationships affect the similarity of user profiles.For two users u i and u j , the set of people they respond to can be respectively described as U ′ i and U ′ j .Following research towards a recommendation system on social chatting forum (Ma, 2014), the larger common group of users they chat with, the more common topics and interests they share with.Thus, we only choose pairs which support , where ŝ denotes a threshold.
After sampling among users, we get a positive pair of similar user profiles (U, U pos ), and a negative set U − which consists of other user profiles in the mini-batch.The loss learned by this task is as follows: . (2)

Personalized Response Generation
We apply an encoder-decoder architecture based on Transformer (Vaswani et al., 2017) to generate personalized responses.First, the Transformer encoder encodes both user profile U and current query p combined together.Then, the decoding module decodes the outputs from the first step, added with a linear layer so as to project them into a space with a vocabulary-size dimension.The process can be defined as: where ŷ ∈ R v , and v is the size of the vocabulary.Later, ŷ is normalized by a softmax layer and represents the generation probability: ŷprob = softmax(ŷ). (5)

Training and Optimization
At the pre-training stage, our goal is to optimize the loss of the proposed tasks at three levels: At the fine-tuning stage, the model is trained to maximize the generation probability of the groundtruth response y, which can be optimized by crossentropy loss as: 4 Experiments

Datasets
Following previous study (Ma et al., 2021b), we conduct experiments on two datasets from Chinese Weibo (Qian et al., 2021) and English Reddit (Zhang et al., 2020a).Both datasets are collected from open-domain social media platforms, where users can post with various interests, and respond to other users in who they are interested.Each utterance in the dialogue has its own timestamp and user ID.We consider that a training sample contains three parts: (1) a query; (2) a corresponding response; and (3) a dialogue history containing several responses issued before the current response.
Weibo Dataset Following (Ma et al., 2021b), we use a subset of the PChatbotW dataset (Qian et al., 2021) containing 300K users.The dataset is collected from the Weibo online platform1 .We construct the corresponding query and response to a query-response pair.Moreover, we compare the response timestamps and consider the replies before the current response as the dialogue history.Also, we refer to the method introduced in (Qian et al., 2021).Specifically, we remove hashtags, URLs, emojis and swear words from sentences.Then, we remove sentences that include more than 100 words or less than 5 words or contain multi-languages.
Reddit Dataset To measure model performance in different language, we also introduce the English dataset, which is collected from the English social platform Reddit2 .In terms of the specific tree structure of Reddit, we consider the pair of parent node and its child node as a query-response pair.We use the same cleaning process applied to the Weibo dataset to handle the Reddit dataset.It comprises 315.34K users.

Baseline Methods
We evaluate our model with four groups of relevant and typical baseline methods: (1) Non-personalized response generation models: Seq2SeqWA (Bahdanau et al., 2015) applies attention (Luong et al., 2015) module to the GRUbased Seq2Seq model.MMI (Li et al., 2016a) optimizes the maximum mutual information loss to improve the diversity of generated responses.
(2) Personalized models using user ID: Speaker (Li et al., 2016b) improves the Seq2SeqWA model by incorporating user ID embeddings into the input of the decoder.Person-aWAE (Chan et al., 2019) generates personalized responses with a Wasserstein autoencoder.The user ID embeddings are used to build a Gaussian mixture distribution for personalization.
(3) Personalized models using explicit user profiles: GPMN (Zhang et al., 2018) designs a memory module to encode and store the persona profile to enhance the Seq2Seq model.PerCVAE (Zhao et al., 2017) applies a conditional variational autoencoder to improve the diversity of response.
(4) Personalized models using implicit user profiles: VHRED-P (Serban et al., 2017) is a multiturn response generation method that models dependencies among user's dialogue context.We replace the dialogue context with the user's historical post-response pairs to achieve personalization.ReCoSa-P (Zhang et al., 2019) is also a multi-turn response generation model that measures the con-text with an attention mechanism.Similarly, we simulate the context through the historical postresponse pairs.DHAP (Ma et al., 2021b) constructs general user profile from user response history, and establishes a key-value memory network to build a dynamic query-aware user profile.Then, the model generates personalized responses with a personalized encoder.This is the state-of-the-art method in personalized response generation.

Implementation Details
For a user, we use the former 80% dialogue history to construct the training set, the middle 10% part for the validation set, and the last 10% for the test set.For each response, we use the corresponding query as the current query and the previous responses as the dialogue history.The mentioned thresholds t, t, and ŝ are set respectively to 25% and 5% quantile time of all consecutive responses on datasets, and 1.The history length is set as 20, and the hyper-parameter k is set as 30.We initialize the utterance encoder by the parameters of bert-base-chinese3 and bert-base-uncased4 checkpoint respectively.
The number of Transformer layers is 6.The hidden size and number of heads of the Transformer are set respectively to 768 and 8.We use beam search with a beam width of 12 to decode words.

Evaluation Metrics
Automatic Evaluation.We introduce three groups of automatic evaluation metrics to measure the performance of our model.
(1) Overlap-based Metric.We use BLEU-1/2 (Papineni et al., 2002) and ROUGE-L (Lin and Och, 2004) to measure the n-gram overlap between the generated response and ground-truth response.(2) Embedding Similarity.We employ embedding-based metrics (Chan et al., 2019) to measure the semantic similarity between the generated response and the ground-truth one.(3) Personalization.Following previous studies (Ma et al., 2021b), we employ Persona-F1 (Lian et al., 2019) to measure the unigram F1 between the generated responses and historical responses, and Persona Coverage (Song et al., 2019a) to calculate IDF-weighted word overlapping score between the generated responses and dialogue response history.Human Evaluation.Considering the diversity of utterances in the real world, a response different from the ground-truth may also be appropriate.Therefore, we randomly select 100 test triplets (i.e., query, response, and user dialogue history) and manually evaluate their quality in terms of three aspects: readability, informativeness, and personalization, which are suggested by Chan et al. (2019).The detailed evaluation criteria are given in Appendix A.

Experimental Results
Automatic Evaluation .The metric-based evaluation results are shown in Table 1.We can observe that our MCP method achieves the best performance on both datasets in terms of most metrics.The improvement is statistically significant (t-test with p-value < 0.05), which demonstrates the effectiveness of applying self-supervised tasks to pre-train the encoder.We also have the following observations: (1) In general, the models using dialogue history perform better than those using explicit user profiles or user-id.This indicates that the user's dialogue history contains sufficient personalized information and is a more suitable way for building personalized chatbots.Furthermore, using any kind of personalized information can improve   2 shows the result of human evaluation on Weibo dataset.The Fleiss Kappa is around 0.51, which shows a moderate agreement by the annotators.Specifically, MCP outperforms DHAP by 2.65%/2.87%/5.36%respectively on the three aspects.All these results demonstrate the superiority of MCP in generating more fluent, informative, and personalized responses, and the potential of contrastive pre-training tasks on personalized response generation. (1) (2) (3) (4)    Influence of History Length Since the personalized information is extracted from the dialogue history, the length of the history will influence the final performance.To explore such an influence, we conduct an experiment with MCP and two other strong baselines (DHAP and ReCoSa-P) by using various numbers of historical responses.The results are shown in Figure 2. We have the following observations: (1) In general, the performance of all three methods is increasing along with more historical responses used.This is consistent with our assumption that more dialogue history can provide more sufficient information.(2) MCP performs better with different lengths of dialogue history, indicating its robustness and generalizability.
(3) When no dialogue history is provided, ReCoSa-P performs much better than both MCP and DHAP.The potential reason is that ReCoSa-P has a more complex structure to encode the user's input query and generate the response.
Ablation Study To investigate the effect of two encoders in MCP, we conduct an ablation study as: (1) We remove the utterance encoder (w/o Utter.Enc), and the utterance embedding is replaced by the mean pooling over all word embeddings.(2) We remove the history encoder (w/o His.Enc), and the user profile representation is replaced by the mean pooling over all utterance embeddings.The results are shown on the upper side of Table 3.We can see that both encoders are important for our model since removing either of them leads to performance degradation.Concretely, the utterance encoder has a greater influence on the final performance.This is because the utterance encoder works at a lower level and is designed to capture fine-grained information from the utterance.If it is removed from the model, the poor utterance representation will also influence the history encoder.On the contrary, the history encoder aims at aggregating information from each historical utterance.It has relatively less influence when valuable information has already been refined into the utterance representation.
We further study the effectiveness of our proposed three pre-training tasks, and the results are shown in the lower side of Table 3.We denote the model without pre-training on utterance-/history-/user-level as w/o Pre.Utter./His./User., respectively.It is clear to see that all three tasks are beneficial to our model.This demonstrates the great potential of contrastive learning on optimizing the representation of the utterance and dialogue history.
Effect of Self-supervised Learning To investigate the effect of our proposed self-supervised learning, we measure the representation differences before and after pre-training.
(1) Utterance Representation Changes.The purpose of the self-supervised pre-training task is to close the distance between responses sharing the same interests.Thus, we randomly select a user from the Reddit dataset, extract historical responses, and map their embeddings to a two-dimensional space through PCA with scaling.As shown in Figure 3, topic-related responses are marked with the same color and categorized into four groups.In the initialized distribution, responses with the same color are dispersed, but they are obviously closer after self-supervised pretraining.Moreover, the boundary between the four groups is wide and clear, which shows that the utterance encoder has the ability to sense different topic-related information in each response.
(2) History Representation Changes.The history encoder is designed to close the distance between representations of similar users.The ability to distinguish users affects the personalization of generated responses.Thus, we randomly select 1000 users from the Weibo dataset and calculate the distance between representations of every two users.Figure 4 shows that the initialized model cannot distinguish users well.After training with selfsupervised tasks, the differences between users are magnified, and the distance between similar users is reduced.An interesting finding is that the group of dissimilar users is much larger than the group of similar ones.This may benefit from our user-level contrastive learning task, which samples out similar users and closes their distance of representation.

Conclusion
In this paper, we proposed an MCP model for personalized response generation.Different from previous personalized work, first, we proposed a personalized responses generator that consists of an utterance encoder and a history encoder.Next, we designed three self-supervised tasks from two levels to pre-train the two encoders.Experimental results on two real-world datasets confirm the effectiveness of our model in generating informative and personalized responses.

Limitation
Though our method achieves promising results on two real-world datasets, there are still many limitations: (1) The architecture of our method is relatively simple compared with baseline methods.Other advanced structure for modeling the dialogue history may bring more improvement.(2) We have noticed that pre-trained language models have been applied to dialogue generation (e.g., DialoGPT (Zhang et al., 2020a)).However, our method is currently based on a standard Transformer without pre-training on large-scale text datasets.Therefore, how to combine our proposed pre-training task with existing pre-trained language models is still under exploration.

Ethical Statement
The datasets used in this paper are available publicly online.This paper does not contain any data collection or release, so there are no privacy issues.The research will not pose any ethical issues.We hired three well-educated and part-time annotators to conduct human evaluation with a pay of 100 CNY/hour during their evaluation.

Figure 2 :
Figure 2: Experiments with different lengths of user dialogue history on the Weibo dataset.
the Macbeth notes with me too?What subjects did you have?You're going to regret not applying for aid at Stanford.Not HYPS but I got into Grinnell.I'm interested, although I'm a senior.They should come on Friday.Pm me your email id.

Figure 3 :
Figure 3: The response representation distribution after PCA with scaling (before and after self-supervised pretraining) of user #60 on the Reddit dataset.

Figure 4 :
Figure 4: Distribution of similarity between users on the Weibo dataset.

Table 1 :
The results of automatic evaluations on Weibo and Reddit datasets.We categorize the baselines into four groups: (1) non-personalized; (2) using user-id; (3) using explicit user profiles; and (4) using dialogue history.The best results are in bold." †" indicates that our model achieves significant improvement over the state-of-the-art result in paired t-test with p-value < 0.05.

Table 2 :
Human evaluation results on Weibo dataset.

Table 3 :
Ablation study results on Weibo dataset.