BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data

Maintaining a consistent persona is essential for dialogue agents. Although tremendous advancements have been brought, the limited-scale of annotated personalized dialogue datasets is still a barrier towards training robust and consistent persona-based dialogue models. This work shows how this challenge can be addressed by disentangling persona-based dialogue generation into two sub-tasks with a novel BERT-over-BERT (BoB) model. Specifically, the model consists of a BERT-based encoder and two BERT-based decoders, where one decoder is for response generation, and another is for consistency understanding. In particular, to learn the ability of consistency understanding from large-scale non-dialogue inference data, we train the second decoder in an unlikelihood manner. Under different limited data settings, both automatic and human evaluations demonstrate that the proposed model outperforms strong baselines in response quality and persona consistency.


Introduction
Various approaches have been explored to introduce explicit personas in dialogue models (Qian et al., 2018;Zheng et al., 2020;. The PERSONA can be defined as a composite of elements of identity, such as profiles and background personal facts. In persona-based dialogues, the generated responses are conditioned not only on the dialogue context but also on some predefined personas, so the presenting personality could be more consistent. Existing persona-based dialogue models heavily utilize a set of persona-related dialogue data Golovanov et al., 2019), such as the PersonaChat (Zhang et al., 2018). This kind of crowd-sourced dataset covers rich persona features, * Wei-Nan Zhang is the corresponding author.
Persona: I've a son who is in junior high school Query: You have any children?
GPT-2: No kids. I work at home depot so I'm busy. namely "persona-dense". Nevertheless, the scale of such crowd-sourced datasets is limited by the expensive costs: two annotators are asked to act the part of a given provided persona and chat naturally to get to know each other during the conversation. On the other hand, conversations in daily life are not always persona-related. According to Twitter content analysis, less than 10% messages on Twitter reveal personal anecdote or activities at home or work and even less for personally identifiable information (Naaman et al., 2010;Humphreys et al., 2014). As a result, the large-scale data collected from social media would only contain a limited amount of persona-related dialogues, which is "persona-sparse". The limited-scale of crowd-sourced data and the persona-sparsity in large-scale data present one common challenge: a model trained on limited personalized data cannot sufficiently understand persona consistency. As shown in Figure 1, a 12-layer GPT2 (Radford et al., 2019) finetuned on the PersonaChat dataset still shows a lack of consistency.
After rethinking the essence of persona-based dialogue generation, we can find that it requires the dialogue agent to own the capabilities to 1) understand the persona-response consistency and 2) generate a persona-related response given the dialogue context. Obviously, an ideal dataset that satisfies both features are difficult to annotate. However, once we disentangle persona-based dialogue generation into two sub-tasks: consistency understanding and dialogue generation, it is easy to find abundant data resources for them. For consistency understanding, we may leverage large-scale nondialogue inference data, such as SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) as the training data. As for dialogue generation, we already have various large-scale persona-sparse datasets.
Inspired by the aforementioned motivation, in this work, we explore to learn a consistent personabased dialogue model from limited personalized dialogues, with the assistance of large-scale nondialogue inference data. Specifically, the proposed model consists of an encoder E, an auto-regressive decoder D 1 for response generation, and a bidirectional decoder D 2 for consistency understanding. Given personas P and dialogue query Q, the E and D 1 jointly work in an encoder-decoder manner to capture a typical query to response mapping F G (S|Q, P ), and generate a coarse response representation R 1 . Then R 1 and personas P are fed into the bidirectional decoder D 2 to map R 1 to final response representations R 2 : F U (R 2 |S, P ). Since the consistency understanding part F U (R|S, P ) is independent of the dialogue query Q, it can be learned on non-dialogue inference datasets. Here an unlikelihood training objective (Welleck et al., 2019a) is applied to make contradicted cases in the inference data less likely so that D 2 could acquire the ability of consistency understanding.
We initialize all modules from BERT (Devlin et al., 2019) and name the proposed model BERTover-BERT (BoB). To verify the effectiveness of our model, we experiment on two limited data scenarios: 1) a persona-dense scenario (Zhang et al., 2018) with low-resource settings (Zhao et al., 2019), and 2) a persona-sparse scenario (Zheng et al., 2019). Both automatic and human evaluations indicate that our model generalizes well under different settings and outperforms strong baselines on most metrics, especially on persona consistency.
Contributions in this work are three-fold: • We disentangled the task of persona-based dialogue generation into two sub-tasks: consistency understanding and dialogue generation.
• A BERT-based generative framework, BoB, was proposed for training persona-based dialogue models from limited data.
• An unlikelihood training method with nondialogue inference data was introduced to enhance persona consistency understanding.

Related Work
Persona-based Dialogues Recent studies on persona-based dialogue generation focus on a datadriven manner. They learn persona-related features directly from personalized dialogue datasets, either with implicit persona embeddings (Li et al., 2016b) or with explicit profiles (Qian et al., 2018) and personal facts (Mazaré et al., 2018). Following this research line, more sophisticated neural models are emerging, such as modeling mutual-persona  and multi-stage persona-based dialogue generation (Song et al., 2020a). Meanwhile, various pre-training methods have also been applied in this field.  and Golovanov et al. (2019) show that fine-tuning pre-trained GPT on the persona-dense dataset can improve the quality of generated responses. Zheng et al. (2020) propose an attention-routing mechanism in a GPT-based model to control the flow of persona information. Lin et al. (2020) explore how to leverage BERT model for dialogue generation. Different large-scale pretrained chatbots Madotto et al., 2020) also show their effectiveness on persona-based dialogues.
Disentangled Representation The concept of "disentangling" can be defined as transformations that only change some properties of the underlying model while leaving all other properties invariant (Higgins et al., 2018). The variational autoencoder (Kingma and Welling, 2013) could be regarded as a disentangled representation learning framework, and various methods are built within it (Kim and Mnih, 2018;Locatello et al., 2019).
Unlikelihood Training Likelihood tries to maximize the probability of target sequence, while unlikelihood corrects known biases by minimizing the probability of negative candidates (Welleck et al., 2019a). Closely related to our work, Li et al. (2020) first explored unlikelihood training in addressing dialogue logical contradictions. They get contradicted dialogues from PersonaChat according to DNLI (Welleck et al., 2019b), a PersonaChatoriented dialogue inference dataset. Then unlikelihood training is applied to reduce the probability of contradicted responses. Different from Li et al. (2020), with carefully designed decoders, our model could learn from large-scale non-dialogue inference datasets, making it generalizable to different scenarios, such as persona-dense and personasparse datasets, as will be seen in our experiments.  Figure 2: (1) The framework of the proposed BoB model, including an encoder (BERT E), a response generation decoder (BERT D 1 ), and a consistency understanding decoder (BERT D 2 ). The italics denote the inputs and outputs of each submodule. (2) Transformer attention masks for generation (D 1 ) and understanding (D 2 ), and dark square means no attention.
(3) Training objectives and the utilized data. NLL denotes negative log-likelihood.

Overview
In this work, our goal is to learn a persona-based dialogue model from limited personalized data. To address the challenges of consistency understanding brought by limited data, we leverage large-scale non-dialogue inference data in our model.
Formally, let Q = q 1 , q 2 , ..., q n denote the dialogue query, R = r 1 , r 2 , ..., r m denote the target response, and P denote the personas. In addition, let N denote the non-dialogue inference data, which consists of premise, hypothesis, and their label. The premise and hypothesis are both natural sentences. Note that in the following sections, we use fonts to distinguish between sentences (P, Q, R) and their vector representations (P , Q, R 1 , R 2 ).
The task of the proposed model M is to generate a persona consistent responseR =r 1 ,r 2 , ...,r m , based on both persona P and query Q, i.e.,R = M(Q, P). As shown in Figure 2, the proposed model M consists of three BERT-based submodules: an encoder E, a response decoder D 1 , and a consistency understanding decoder D 2 . More concretely, E encodes the embeddings of persona and query, i.e., P and Q, into hidden states H. D 1 performs cross-attention on H in a typical encoderdecoder manner, and generate a coarse representation R 1 . D 2 learns consistency understanding from non-dialogue inference data N and further converts P and R 1 into final representations R 2 . At last, a consistent responseR could be generated from R 2 .

Disentangling
For response generation, a typical persona-based dialogue model needs persona P and dialogue query Q to generate a response. For consistency understanding, a model needs persona P, response R, and the consistency labels between P and R. However, if we entangle generation and understanding, it is not easy to obtain sufficient annotated data that satisfy the format of {P, Q, R, Label}.
Instead, in our model, we design the decoder D 2 to disentangle generation and understanding, where D 2 maps R 1 , rather than Q, to R 2 . The key to "disentangling" is we can get R 1 without the participation of Q, as R 1 is the representation of R. As a result, the mapping from R 1 to R 2 could be independent of Q. In this way, it becomes possible to 1) learn persona-based dialogue generation from {P, Q, R}, i.e., the personalized data, and 2) learn consistency understanding from {P, R, Label}. Moreover, considering the limited amount of such annotated data, we could approximate {P, R, Label} by the abundant non-dialogue inference data N ={Premise, Hypothesis, Label}, where P and R corresponds to the Premise and Hypothesis.
Given data P and R, suppose D 2 understands persona consistency, it should maximize the likelihood of generating R if R is not contradicted to P. Otherwise, it should minimize the likelihood of generating R. Motivated by this observation, we choose to apply unlikelihood training on D 2 to make it understand consistency. The detailed training objectives will be provided in Sec 3.4.

Encoder
The encoder E works like a standard BERT model, which bidirectionally encodes the input embeddings to a sequence of hidden vectors, from which the downstream tasks will be performed on.
In our model, the input consists of persona P and dialogue query Q. For persona, whether P is personal facts (e.g., "I have two dogs") or profiles (e.g., "location: Seattle"), we could always convert it into a sequence of words. A special token is placed between persona sequence and dialogue query, and the input is formated as: Then the embedding layer will convert input into representations. Following usual practice, the input representations are the sum of the corresponding token, type, and position embeddings, where the type embedding is 0 and 1 for persona and query, respectively. P and Q can also get their independent representations. The resulted representations are P and Q, which could be jointly denoted as emb = e p 1 , e p 2 , ..., e q l , where l is the maximum length of the input.
Once we get the input representations, encoder E will perform multi-head attetnion (Vaswani et al., 2017) on the emb to transform the embeddings into a sequence of hidden vectors H. The multi-head attetnion could be denoted as MultiHead(query, key, value), where scaled dot-product attention is performed on query, key, and value. There are N identical layers in E, for each layer: where h 0 = emb, and FNN is a fully connected feed-forward network containing two linear transformations with a ReLU activation in between. h N is the final output of encoder E, i.e., H.

Response Generation Decoder
The response generation decoder D 1 is initialized from BERT to inherit its robust language model but works in an auto-regressive decoder manner. First, a cross-attention is inserted between E and D 1 to pass the context information. Second, a leftto-right mask is applied to D 1 to preserve the autoregressive generation property.
As the cross-attention does not exist in the BERT model, it is randomly initialized and updated during training. In the cross-attention, the query comes from the previous layer of D 1 , and the key and value come from H: This attention is similar to the typical encoderdecoder attention mechanism in sequence to sequence models (Bahdanau et al., 2015), which attends to all positions in the context representations H according to the variations of r 1 . In training, r 0 1 is initialized from the embeddings of the target response. At each generation step, future tokens in the target response should not be considered. Therefore, as shown in Figure 2, a left-to-right mask is applied to D 1 to ensure that the predictions can only depend on the known outputs.
D 1 also has N identical layers. And the output of the last layer r N 1 , i.e., R 1 , is further fed to D 2 .

Consistency Understanding Decoder
Like E and D 1 , the consistency understanding decoder D 2 is also initialized from BERT, from where D 2 initializes a good semantic representation for understanding tasks.
In each layer of D 2 , the multi-head attention is performed twice: The resulted r i+1 2 in each layer thus fuses information from both P and R 1 . The output of the last layer of D 2 is the final representations R 2 . With an output layer, e.g. linear layers, upon the R 2 , we can get the generated responseR.

Training Objectives
We employ negative log-likelihood (NLL) loss and unlikelihood loss for dialogue generation and consistency understanding. A brief illustration is shown in the last column of Figure 2 and detailed descriptions will be provided in this section.
Response Generation In our model, the widely adopted negative log-likelihood loss is applied in the training. For E and D 1 , they read the persona P and dialogue query Q to predict the target response R, which yields the raw representations R 1 : log(p θ (r i |P, Q, R <i )).
The generation part in D 2 is also trained by NLL. D 2 reads persona embeddings P and raw representations R 1 to predict the target response R: log(p γ (r i |P, R 1 , R <i )).

(7)
Unlikelihood Training Given large-scale nondialogue inference dataset, we collect positive data D + from the entailed category and collect negative data D − from the contradicted category: whereP andR are premise and hypothesis from the non-dialogue inference data, and their representations in our model are denoted asP andR. For data from D + , we still apply the NLL loss: For data from D − , we apply the unlikelihood objective to minimize the likelihood of contradictions: which penalizes every token in the contradicted target. Therefore, the loss L D − 2 U L makes generating contradicted responses less likely.
Training Procedure The training steps can be summarized as follows: 1) Response Generation. Given P, Q, and R from personalized dialogue data, we calculate the response generation loss L 1 = L D 1 N LL + αL D 2 N LL ; 2) Consistency Understanding. Given D + and D − from non-dialogue inference data, we calculate the unlikelihood loss L 2 = βL 3) Optimization. Sum up L 1 and L 2 . Update parameters with back-propagation. We initialize our model from the publicly available BERT base model, with 12 layers and hidden size 768. We employ an Adam optimizer with a learning rate of varying from 5e-6 to 5e-5. Empirically, we set α to 5e-3 and β to 0.1. The training of the proposed model was done on an Nvidia Telsa V100 32G GPU. Other details please refer to the released projects.

Datasets
To evaluate the performance of the proposed model, we carried out persona-based dialogue generation experiments in a persona-dense scenario and a persona-sparse scenario with two publicly available datasets: • PersonaChat (Zhang et al., 2018) is a crowdsourced dataset covering rich persona features. The dialogues in this dataset are grounded on specific personal facts. Here we use the Con-vAI2 PersonaChat (Dinan et al., 2019), so the results are comparable to existing methods.
• PersonalDialog (Zheng et al., 2019) is a large-scale persona-sparse dataset, which is collected from Chinese social media Weibo. This dataset provides persona profiles and dialogues, but the majority of the dialogues are not persona-related. Two testsets are provided: a random testset, which is identically distributed as the training data, and a biased testset, which is manually selected to cover persona-related features.
We summarize the key statistics of two personalized dialogue datasets in Tabel 1. As aforementioned, we leverage non-dialogue inference data to address the consistency understanding issue brought by limited personalized data. Here we use the non-dialogue inference dataset MNLI (Williams et al., 2018) and its Chinese version CMNLI  as our auxiliary data. Moreover, to better compare models' performance on persona consistency, we leverage two dialogue inference datasets, DNLI (Welleck et al., 2019b) and KvPI (Song et al., 2020b), for evaluations. The statistics 1 of these inference datasets are summarized in Table2.

Compared Methods
The following models, including both nonpretrained and pretrained ones, have been compared in the experiments.
Baselines. Vanilla Transformer (Vaswani et al., 2017) is employed as baselines for the experiments on both PersonaChat and PersonalDialog. Personas are concatenated to the dialogue queries.  Non-Pretrained Models. Meta-learning has recently been explored in addressing the limited personalized data issue. CMAML (Song et al., 2020c) is a meta-learning based method that learns from few shot personas by customizing the model structures. Besides the meta-learning methods, GDR (Song et al., 2020a) introduces inference ability on the PersonaChat with a generate-refine framework. However, the two models are elaborately designed for the persona-dense dataset and not appliable for the persona-sparse scenario. Thus we only employ them for experiments on PersonaChat.
Pre-training Models. In the ConvAI2 challenge (Dinan et al., 2019), which utilizes Per-sonaChat as the competition dataset, LIC (Golovanov et al., 2019) is the best performing model. Thus we compare this model in the experiments on both PersonaChat and PersonalDialog. Atten-tionRouting (Zheng et al., 2020) is a pre-training method specially designed for the persona-sparse dataset, and it is also the latest model on Personal-Dialog. We also finetune a GPT2 (Radford et al., 2019) for a thorough comparison on PersonaChat.

Evaluation Metrics
We focus on two main aspects of the persona-based dialogues: response quality and persona consistency. To compare different models, we employ both automatic metrics and human evaluations.
For persona consistency, we employ two metrics. The first is Consistency Score (C.Score) (Madotto et al., 2019), which leverages a referee model to predict consistency and can be defined as: NLI(r, p i ).
The second metric is Delta Perplexity (∆P), which evaluates consistency from model's internal distributions. Li et al. (2020) first calculates the perplexity of entailed (p.Ent) and contradicted (p.Ctd) dialogues in the inference dataset. A dialogue model with good understanding ability should assign lower perplexity to the entailed dialogues while higher perplexity to the contradictions. From this intuition, the ∆P can be defined as: where a larger ∆P means the model has a better ability to distinguish entailment from contradiction. In our experiments, we get entailed and contradicted {persona, query, response} tuples from the dialogue inference datasets DNLI and KvPI.
Human Evaluations We recruit two teams (one for English and another for Chinese), each consists of five professional annotators, from a third-party company. These annotators are proficient in language tasks but know nothing about the models. We sample 100 {persona, query, response} tuples for each model's evaluation under every setting.
Human annotators are asked to evaluate dialogue quality from three conventional criteria: fluency (Flue.), informativeness (Info.), and relevance (Relv.). Each criterion is rated on a fivescale, where 1, 3, and 5 indicate unacceptable, moderate, and perfect performance, respectively. The annotators are also instructed to label the consistency (Per.C.) between persona and response, where 1 means persona-related and consistent, 0 means irrelevant, and -1 means contradicted.

Persona-Dense Results
Full PersonaChat We first report the full Per-sonaChat experimental results in Table 3. Our method achieves better performance consistently across all automatic and human evaluation metrics, which shows the effectiveness of our model. Among all the metrics, our model obtains significant improvements on PPL and ∆P. The lowest testset PPL means our model has learned a good language model fitting this dataset. Moreover, the highest ∆P shows that our model could more effectively distinguish entailment from contradiction than other baselines, which indicates our model has a better understanding of persona consistency.
Less Personalized Data Now that our model achieves better performance with a large margin on the full PersonaChat dataset, we want to test our model by simulating a low-resource scenario (Zhao et al., 2019), where we gradually reduce the number of examples by halving the training set. We report the low-resource settings' results in Table 4.
As we can see, our model can outperform most of the baselines' best results even by using only 1/8 of the training data. The performance gains largely benefit from the powerful language model of the backbone BERT model. Furthermore, due to the disentangling of generation and understanding, our model presents a stable performance on ∆P regardless of the size of the training set. This is in line with our expectations because the proposed model learns consistency understanding from the non-dialogue inference data rather than the personadense dialogue data. We observe that the method also improves fluency and informativeness. It is mainly due to the introduction of the non-dialogue inference data in the training procedure, which potentially enriches the dialogue language model.

Validations on Persona-Sparse
We further validate our model on a persona-sparse scenario. To have a more intuitive understanding of "sparsity", we recruit the same annotation team to annotate whether the dataset response is personarelated in the sampled random and biased test data. Results show that only 1% responses are personarelated in the random test data and 28% in the biased test data. We calculate the Fleiss' Kappa among the five annotators and obtain a kappa of 0.774, which means substantial agreement (Landis and Koch, 1977). We report the evaluation results on both random and biased testsets in Table 5.
On the random test set, experimental results demonstrate that our model has some advantages over other methods, but no method can consistently outperform the others. One possible reason is that the task has degenerated into the ordinary dialogue generation in the random test set, so our model's advantages can not be effectively leveraged. In contrast, on the biased test set, our model achieves the best performance on most metrics. The good performance on the metrics C.Score and Per.C. indicates that our model can be effectively trained from a dataset with limited personalized dialogues.

Analysis and Ablation Study
In addition to the good performance of the BoB model, we are also curious about Q1: what is the key to the BoB model's understanding ability? Q2: can the pre-trained models understand persona consistency just through finetuning on the personalized dialogues? And Q3: does the extremely low PPL come from the initialization of the BERT model or the architecture of the proposed BoB model? To better answer the above questions, we ablate the BoB model in the following three ways: 1) w/o UL. It removes the unlikelihood objective. 2) E+D 1 . It removes the unlikelihood objective and the second decoder D 2 . 3) E. It removes the unlikelihood objective and both decoders and thus degenerates into a vanilla BERT model. We report the ablation results on PersonalDialog in Table 5 and full PersonaChat in Table 6. From these results: Answer to Q1: The key to our model's understanding is the unlikelihood training. In training, our model assigns large perplexity to the contradictions. In generating, the non-contradicted responses are more likely to be generated as they are with much smaller losses. Table 7 shows an example. And as presented in the results, after removing the unlikelihood objective, all ablated models suffer from significant performance degradations in consistency-related metrics, such as Per.C. and ∆P.
Persona I've a son who is in junior high school Query You have any children?
GPT2 No kids. I work at home depot so I'm busy.
Ours Yes, I have a son in the 8th grade. Answer to Q2: Pretrained models barely understand consistency from personalized dialogues. According to the poor performances on ∆P, the three BERT-based ablated models can hardly distinguish contradiction from entailment. Although their Per.C. metric still looks good, it may come from just mimicking and copying words rather than understanding. A similar phenomenon also occurs to the pre-trained GPT2, as shown in Table 3. It is also this phenomenon that motivates us to introduce the unlikelihood training into the BoB model.
Answer to Q3: D 2 in the BoB architecture contributes most to the PPL. As shown in both datasets' ablation results, the PPL decreases the most after removing D 2 . We can also see an apparent gap between the models with D 2 and the vanilla BERT on PPL. Nevertheless, the BERT model still offers a good initialization for the BoB model to achieve the best performance on different metrics.

Reproducibility
The implementation for the BoB model is released at https://github.com/songhaoyu/BoB.

Conclusions
In this work, we propose a novel BERT-based dialogue model to learn from limited personalized data by disentangling response generation and consistency understanding. Unlikelihood training with non-dialogue inference data is introduced to en-hance the model's understanding ability. Experiments on two publicly available datasets demonstrate that our model can be trained with limited personalized dialogue data while still obtain significant improvements over strong methods.