Transferable Persona-Grounded Dialogues via Grounded Minimal Edits

Grounded dialogue models generate responses that are grounded on certain concepts. Limited by the distribution of grounded dialogue data, models trained on such data face the transferability challenges in terms of the data distribution and the type of grounded concepts. To address the challenges, we propose the grounded minimal editing framework, which minimally edits existing responses to be grounded on the given concept. Focusing on personas, we propose Grounded Minimal Editor (GME), which learns to edit by disentangling and recombining persona-related and persona-agnostic parts of the response. To evaluate persona-grounded minimal editing, we present the PersonaMi-nEdit dataset, and experimental results show that GME outperforms competitive baselines by a large margin. To evaluate the transferability, we experiment on the test set of BlendedSkillTalk and show that GME can edit dialogue models’ responses to largely improve their persona consistency while preserving the use of knowledge and empathy.


Introduction
Grounding dialogue agents on external information is important for building engaging conversational AI systems (Huang et al., 2020). Along this track, various datasets and models have been proposed to ground dialogues on personas , knowledge (Dinan et al., 2019), emotions (Zhou et al., 2018a), and images .
Generally, grounded dialogue modeling trains a dialogue model on a dataset D that consists of triples (c, r, g), where c is the dialogue history, r is the response, and g is the grounded concept. The model is generally optimized using maximum likelihood estimate (MLE), i.e., arg max θ E (c,r,g)∼D log P θ (r|c, g).
(1) Despite its effectiveness, this formulation faces two challenges regarding transferability. On one hand, grounded dialogue datasets are usually collected under a guided setting, e.g., annotators are usually encouraged to embed persona  or knowledge (Dinan et al., 2019) into responses, which leads to a distributional gap between the conversations in a grounded dialogue dataset and natural conversations. As a result, models trained with Eq. (1) may generate unnatural responses and are vulnerable to the distributional shift of the dialogue history. On the other hand, at inference time, models trained with Eq. (1) cannot be grounded on unseen types of concept g other than g. An example for such grounding gap is that a model trained on PERSONACHAT  with Eq. (1) cannot be grounded on world knowledge. To address the above transferability challenges, we propose a grounded minimal editing framework for grounded dialogue modeling. Instead of learning a grounded response generator as is done in Eq. (1), we propose to learn a grounded minimal editor that operates on existing responses. Specifically, suppose we have an original response r o that is coherent with the dialogue history c but is not grounded on the concept g. Our goal is to minimally edit r o such that it is grounded on the concept g and coherent with the dialogue history c. Original responses can be generated by dialogue models trained on natural conversation data and grounded on other concepts g , or even produced by humans; thus, they do not suffer from the distributional gap and grounding gap. Moreover, minimal editing guarantees that the distribution of the edited responses is similar to that of the original responses, which do not suffer from the two gaps. Note that collecting paired responses before and after editing is resource-consuming; thus, our goal is to learn the editing without paired data.
In this paper, we explore persona-grounded minimal editing, as demonstrated in Figure 1. We propose Grounded Minimal Editor (GME), which is trained on persona-grounded dialogue data. Specifically, response templates are sampled by corrupting persona-related spans and sentences based on gradient-based attribution and word overlap. By denoising the templates, GME disentangles and recombines persona-related and persona-agnostic expressions. Since the personas of original responses are not observed at inference, we train a classifier for template generation at inference.
Two research questions are investigated in this paper: Q1) Is the proposed GME model effective for grounded minimal editing? Q2) Does our framework address the transferability challenges (more specifically, the distributional gap and the grounding gap)? For Q1, we build PERSONAMINEDIT, a new dataset derived from PERSONACHAT with multiple human references for the edited response. Automatic and human evaluations show that GME outperforms competitive baselines and has the most similar behavior to humans references. For Q2, we evaluate GME on the test set of BLENDEDSKIL-LTALK (Smith et al., 2020), whose data distribution and grounded concepts are different from PER-SONACHAT, which requires GME to be transferable. We observe that GME improves the persona consistency of responses generated by pretrained Blender-90M models (Roller et al., 2020), while preserving the use of knowledge and empathy. Results also show that GME-edited responses largely outperforms TransferTransfo , which is trained in the canonical way as in Eq. (1). Our contributions include: • We propose a framework named grounded minimal editing to address the transferability challenges of grounded dialogue modeling.
• We propose Grounded Minimal Editor (GME) and present the PERSONAMINEDIT dataset to evaluate GME's effectiveness for personagrounded minimal editing.
• Experimental results show that GME largely outperforms strong baselines on the PERSON-AMINEDIT dataset. GME is also transferable to edit other models' outputs and improve the persona consistency while preserving their use of knowledge and empathy.

Related Work
Recent work leveraged grounded information in dialogue agents to chat engagingly, e.g., using knowledge (Zhou et al., 2018b), emotions (Zhou et al., 2018a), personas , and images . For persona grounding (Li et al., 2016;, transfer learning methods Golovanov et al., 2019) and latent variable models (Song et al., 2019;Chan et al., 2019) have shown promising results. Further, the persona consistency issue (Kim et al., 2020;Nie et al., 2020) and persona-augmented empathetic agents (Zhong et al., 2020) have also been explored. As discussed in Section 1, existing methods generally adopt the MLE objective in Eq. (1) and suffer from two transferability challenges, i.e., the distributional gap and the grounding gap, which are addressed by the proposed grounded minimal editing framework. The idea of editing existing responses has been explored, e.g., the deliberation network (Xia et al., 2017), two-pass response generation (Song et al., 2020), and retrieval-augmented dialogue modeling Pandey et al., 2018;Wu et al., 2019b;Gu et al., 2019;Cai et al., 2019). This paper is essentially different from these works from two perspectives. 1) Regarding the formulation, we emphasize minimal editing, while previous works do not. As analyzed in Section 1, minimal editing is an important component to address the transferability challenges; 2) Regarding the training algorithm, previous works derive templates from self-generated or retrieved texts, while our model derives templates from the observed responses.
In the large persona space, the persona sentences at test time are never seen during training. Further, when generating masked templates, the personas of the original responses are unobserved in our study.

Formulation
We provide a formulation of the proposed framework. Grounded dialogue modeling uses a dataset D that consists of triples (c, r, g), where c, r, and g are the dialogue history, the response, and the grounded concept, which are shown in grey in the left part of Figure 2. To formulate the term "minimal", we need to add unobserved variables into the graphical model, denoted as u in Figure 2, which cover all unobserved variables. The graph states that r = f (c, g, u). As shown in the right part of Figure 2, we observe (c, r o , g e ) at inference time, where r o and g e stand for the original response and the grounded concept for editing. The graph states that the original response r o = f (c, g o , u), where g o represents the concept the original response is grounded on, and that both g o and u are unobserved. The edited response is defined as r e = f (c, g e , u), which replaces g o as g e , and keeps c and u intact. Our formulation follows the idea of counterfactual reasoning (Peters et al., 2017), and it guarantees that 1) the content irrelevant to the grounded concept is preserved, and that 2) the edited response is coherent with the dialogue history. Since it is costly to collect paired (r o , r e ) for training, the grounded minimal editor should be trained on the grounded dialogue data (c, r, g) ∼ D as in Eq. (1). As the first attempt toward the proposed framework, we focus on persona-grounded minimal editing in the experiments. Thus, in the remaining part of this paper, we set the grounded concept g, g o , g e as the persona p, p o , p e .

Overview
We propose Grounded Minimal Editor (GME), a pipeline model for grounded minimal editing. At inference, GME first creates a response template t by masking persona-related spans in the original response r o and then recombines the template t, the persona p e , and the dialogue history c into an edited response r e . We design the template to approximate the unobserved variables u in Section 3, which distinguishes GME from previous retrieval-based dialogue models. With some abuse of notation, we use t to denote the template for both training and inference. During training, two modules are learned: 1) a generator used for the recombination described above and 2) a mask classifier that helps create the response template at inference. Note that GME can also be applied to other ground concepts besides personas. The full process is presented in Algorithm 1.

Recombination Module
The recombination module learns to recombine the response template, the persona, and the dialogue history as the edited response. During training, we create templates from the training responses, as detailed below.
Span mask The span mask serves as the placeholder of persona-related spans. For each responsepersona pair, we define three sets of tokens: GRA-DIENT, OVERLAP, and STOPWORDS. GRADIENT contains persona-related tokens that are determined using gradient-based attribution (Simonyan et al., 2014). We pretrain a response-to-persona model and compute the L 2 norm of the gradient of the Algorithm 1 Training and inference of GME persona's cross-entropy loss w.r.t. each response token's embeddings. A token is placed into the GRADIENT set if the L 2 norm is greater than δ = 3. OVERLAP contains response tokens whose lemma overlaps with in the lemmas of the persona tokens, which are likely to be related to the persona. STOP-WORDS contains stopwords and punctuation marks specified by NLTK (Bird, 2006). We mask a token if it is in GRADIENT or OVERLAP but not in STOPWORDS. We call sentences with masks after this step as persona-related sentences. For each persona-related sentence, we further mask 15% of its tokens to improve the robustness. Since the number of tokens varies at the same syntactic position, we merge consecutive masks so that all masks are at the span level.

Sentence deletion
The above span mask is effective for correcting persona contradictions in the original response. However, span mask cannot handle the situation where we want to add new persona information into the response (examples are given in Figure 1 and Appendix E). To model this pattern, we randomly delete persona-related sentences. Suppose we have l persona-related sentences in the response, the number to keep 0 ≤ n ≤ l−1 follows P (n) ∝ exp(−n/τ ), where τ is a hyperparameter. By infilling persona-related sentences, the model learns to merge persona into the response.
An example of the training template is shown in Figure 3. During training, the recombinition modules P θ is optimized by ( 2) where T (t|r, p) denotes the distribution of the template as detailed above. As shown in Figure 3, we use GPT-2 as the backbone to parameterize P θ , which is tackled as a language modeling task by concatenating input texts. We apply label smoothing ( = 0.1), and we use greedy decoding at inference. Token type embeddings are used to distinguish each type of text and each speaker.

Mask Generator
Since the persona of the original response before editing, i.e., p o , is unobserved at inference, we train a mask generator P φ to predict if a token r i should be masked. The objective for the mask generator is where m i = 1 if r i is in GRADIENT or OVERLAP but not in STOPWORDS, and m i = 0 otherwise. f i is the corpus-level frequency of m i , which is used to balance the number of positive samples and negative samples. At inference, we mask a word if 1) P φ labels it as masked with a confidence greater than ( = 0.5 in the main experiment, = 0.75 in the transferability experiment) and meanwhile 2) it does not appear in the persona p e or the dialogue history c. We merge consecutive masks to get span masks. This process is denoted as T φ test (t|c, r o , p e ) in Algorithm 1.

Data Collection
We present a new dataset PERSONAMINEDIT to evaluate persona-grounded minimal editing. Validation and test data are collected in two steps:

Editing persona selection
We first construct inference samples (c, r o , p e ), where the dialogue history c and original response r o are from PER-SONACHAT, and we select the editing persona p e based on two criteria: 1) editing difficulty and 2) conversation consistency. We bias our data to the hard cases that require correction of persona contradictions. Specifically, we use the heuristics provided by Welleck et al. (2019) to select personas that are contradictory to the original response. To ensure conversation consistency, we filter out personas that are contradictory to the speaker's responses in the dialogue history. Finally, we also ensure that the persona sentences within each persona are not contradictory to each other. Response editing For each constructed triple (c, r o , p e ), we collect references for the edited responses r e on Amazon Mechanical Turk. Specifically, r e should satisfy three requirements: 1) consistency with the editing persona p e , 2) minimal editing, and 3) coherence with the dialogue history c. We reject annotations that do not add words to the original response. Three human references are collected for each triple, and duplicate references are re-annotated. The inter-annotator BLEU (i.e., the BLEU of each reference given the other two references) is 73.8 on the validation set and 71.4 on the test set. The annotation instructions we used are detailed in Appendix A.
Training data in PERSONAMINEDIT is derived from the training data of PERSONACHAT, and personas are aligned with responses following Welleck et al. (2019). We also remove training samples whose persona appears in the editing personas in the validation and test data to ensure that the persona does not leak from training to testing.

Data Statistics
After removing training samples whose persona appears in the editing personas in the validation and We study the behavior of human references to understand the human intuition of minimal editing. In Table 1, we report the number of words added (add) and removed (rm), and the length difference (∆L) between the edited and original responses. We also report the minimum edit distance (MED) between the edited and original responses (d(r e , r o )), and that between the edited response and the editing persona (d(r e , p e )). We observe that the edited responses are generally local modifications of the original responses. On average, the edited responses are longer than the original ones, which can be explained by the observation that human sometimes add persona information into the response when no persona contradiction exists.

Experiment on PERSONAMINEDIT
We use PERSONAMINEDIT to evaluate personagrounded minimal editing (Q1 in Section 1).

Baselines
We modify state-of-the-art models for unsupervised text style transfer and counterfactual story generation as the baselines for grounded minimal editing.
No edit This baseline does not make any edits to the original response.  (Zhu et al., 2017) for unsupervised text style transfer. For our task, we replace the style classifier with a response-to-persona model. We use Gumbel-softmax straight through gradient estimator (Jang et al., 2017) for optimization. DeLorean-FT DeLorean (Qin et al., 2020) iteratively modifying GPT-2's logits via gradients from a content preserving loss. For our task, we replace GPT-2 with TransferTransfo  and set the mixture rate γ mix ∈ {0.75, 0.80, 0.85}, where larger (smaller) γ mix is biased towards persona consistency (minimality of editing).
We observe that CycleGAN is sensitive to hyperparameters and unstable to train, probably due to the biased gradient estimation given the large persona space. Thus, we do not include other methods that require gradient backpropagation from classifiers (Zhou et al., 2020;Madaan et al., 2020).

Automatic Evaluation
For automatic evaluation, we run each experiment with five random seeds. More details are presented in Appendix C. BLEU We compute BLEU-4 score (Papineni et al., 2002) based on the collected multiple human references, using the Moses script multi-bleu.perl. From Table 2 and Table 5, we observe that higher BLEU indicates the less editing. P-Score We define P-Score to evaluate the persona consistency. Specifically, we finetune a BERT model on the DNLI dataset (Welleck et al., 2019) to predict the relation C(r, p j ) (entailment, neutral, or contradiction) of a response r and a persona sentence p j . 2 We then map entailment, neutral, and contradiction to +0.5, 0, and −0.5 and define the P-Score of a sample as where r e is the edited response and p e j is a persona sentence in p e . We finally report the P-Score averaged over all samples. Average We observe that BLEU and P-Score show a trade-off between minimal editing and persona consistency. We report their arithmetic mean as the overall performance since BLEU and P-Score have similar scales and variances. Table 2 shows that CycleGAN and UNMT have high BLEU but negative P-Scores. Figure 4 shows that most of their outputs are contradictory to the editing personas, indicating that their edits are not focused on persona-related expressions. These results show that methods designed for binary style labels are not effective for persona-grounded minimal editing, where the persona space is much larger than the label space. Larger γ mix for DeLorean-FT  Table 2: Automatic evaluation. We report the average of 5 random seeds, and standard deviations are shown in parenthesis. Details of P-Score are in Figure 4. lead to lower BLEU and higher P-Score, showing that larger (smaller) γ mix is biased towards persona consistency (minimality of editing). However, results show that the overall performance cannot be improved by hyperparameter tuning. GME achieves a 31.9% relative improvement on the Average score over the best performing baseline (from 34.2 to 45.1). Figure 4 shows that most of GME's outputs entail the given personas. Table 4 shows the results for 1) removing dialogue histories from the data and 2) removing sentence deletion from GME. We observe that the dialogue history only has a slight contribution, showing that the response template contains an adequate amount of information of the original response. Sentence deletion contributes largely to the performance, especially for the persona consistency.

Human Evaluation
We randomly sample 150 test samples for human evaluation. Given two edited responses A and B, three annotators are hired to vote prefer A, none, or prefer B. We instruct annotators vote none if neither A nor B satisfies both minimal editing and persona consistency. See detailed guidelines in Appendix B and supplementary materials.    Table 3 shows that human annotators generally prefer GME to the baselines. The free-marginal κ for each row is 0.66, 0.66, and 0.51 (substantial, substantial, and moderate agreement). The strongest baseline DeLorean-FT is only preferred in 8.4% cases. We observe that in most cases where DeLorean-FT wins, the original response is syntactically similar to the persona.

Behavioral Analysis
Using the metrics defined in Section 5.2, we provide a behavioral analysis of the models. Results are shown in Table 5. CycleGAN and UNMT have small add, rm, and d(r e , r o ), which shows that they make little changes to the original response. For DeLorean-FT, larger mixture rates γ mix have larger add, rm, and d(r e , r o ), which is consistent with the observation in Section 6.2. The large d(r e , r o ) of DeLorean-FT also shows that this model behaves poorly at making minimal editing. GME has the most similar behavior with human references. Based on the observations in Section 6.2-6.4, we conclude that GME is effective in making minimal edits that are targeted at persona-related expressions. By checking the outputs, we observe that GME and human references add persona information into the response in some cases, which may explain why GME and human references have positive ∆L (i.e., their predictions are longer than the original responses).

Experimental Setup
In BLENDEDSKILLTALK, each dialogue session is grounded on two persona sentences and an optional knowledge topic, and the distribution of responses is biased towards the mixture of displaying persona, using knowledge, and being empathetic. Two types of existing responses are considered: • Responses generated by a persona-agnostic Blender-90M (Roller et al., 2020), which is trained on BLENDEDSKILLTALK in which the persona sentences are removed.
We compare the above two Blender-90M variants and GME-edited resposnes with TransferTransfo , a pretrained dialogue model finetuned on PERSONACHAT. Note that GME is not finetuned on BLENDEDSKILLTALK. Also, conversations in PERSONACHAT, on which GME and TransferTransfo are trained, barely display knowledge and empathy.

Automatic Evaluation
We report BLEU and F1 (Miller et al., 2017) computed with the human references. For persona consistency, we report the P-Score defined in Section 6.2. To evaluate fluency, we report the wordlevel NLL evaluated by GPT-2 (Radford et al., 2018). The automatic evaluation uses the full 5482 test samples of BLENDEDSKILLTALK. Table 6 shows that P-Scores are largely improved after GME editing (from 9.2 to 33.0, and from 0.8 to 29.4). BLEU, F1, and NLL remain comparable  Table 6: Automatic and human evaluation for the transferability to the test set of BLENDEDSKILLTALK. NLL is computed using GPT-2. Free-marginal κ for knowledge, empathy, persona, and grammaticality is 0.92, 0.70, 0.85, and 0.78 (almost perfect, substantial, almost perfect, and substantial agreement).
to those before editing. Although TransferTransfo has the highest persona consistency, it has much poorer BLEU, F1, and NLL than GME. These results show that grounded minimal editing addresses the transferability issue faced by TransferTransfo.

Human Evaluation
We randomly sample 100 test samples for human evaluation. Three annotators evaluate if a response shows knowledge, empathy, persona consistency, and grammaticality. See detailed guidelines in in Appendix B and supplementary materials.
Results are shown in Table 6. Free-marginal κ for knowledge, empathy, persona, and grammaticality is 0.92, 0.70, 0.85, and 0.78 (almost perfect, substantial, almost perfect, and substantial agreement), respectively. Results show that persona consistency is largely improved after GME editing, while the use of knowledge and empathy remain comparable to those before editing. Trans-ferTransfo has the highest persona consistency, but it has much lower knowledge and empathy than the responses edited by GME. For example, only 4.0% of TransferTransfo's responses show empathy, while the ratios are 29.0% and 28.0% for the GME-edited responses. We also notice a slight grammaticality drop after GME editing. However, the GME edited responses still achieve competitive or higher grammaticality scores comparing to Transfertransfo. In practice, the grammaticality scores can be easily improved using reranking approaches. In summary, GME largely improves the persona consistency of existing responses while preserving their use of knowledge and empathy, which addresses the transferability challenges faced by grounded dialogue models trained on PERSONACHAT, e.g., TransferTransfo.

Discussion
As mentioned in Section 2, the term "minimal" distinguishes our work from two-pass generation (Xia et al., 2017) and retrieval-augmented dialogue models Cai et al., 2019). Generally, their objective can be formulated as P (r|c, r ) where r is a response either generated by the model itself or retrieved from the dataset. However, these works do not require r and r to be a minimal editing pair. By contrast, we formulate r e and r o to be a minimal editing pair. To encourage minimal editing, we construct response templates from the observed responses themselves, while these works derive templates from the r defined above. GME itself is also trained on a grounded dialogue dataset that has biased distribution. Thus, as we mentioned at the beginning of Section 7, we also need to evaluate the transferability of GME. Section 7 shows that GME editing only slightly changes the distribution of the responses generated by the Blender-90M variants, while the distribution of TransferTransfo's responses is further away from the human references. This observation suggests that minimally editing out-of-domain responses is easier than generating them.
While we focus on the persona, other types of grounding, e.g., knowledge and image, remain to be explored. Many of GME's failure cases (see Appendix E) contain grammatical errors or fail to correct contradictions, which could be addressed by improving the quality of response templates or incorporating stronger language model priors.

Conclusions
We propose a framework named grounded minimal editing to address the transferability challenges of grounded dialogue modeling, which include the distributional gap and the grounding gap. Our Grounded Minimal Editor (GME) model achieves minimal editing by disentangling and recombining persona-related and persona-irrelevant expressions. For evaluation, we present the PERSONAMINEDIT dataset with multiple human references. Experimental results show the effectiveness of GME for persona-grounded minimal editing. GME is also transferable to edit responses generated by pretrained dialogue models and improve their persona consistency while preserving their use of knowledge and empathy.

A Annotation Guideline (Simplified)
The following guideline is provided to AMT crowdworkers when collecting reference responses: "We aim at building human-like agents that have their own personal background. We need your help to correct some responses that are irrelevant to or contradictory to the speaker's personal background. In each sample, you first see the dialogue history between the two speakers (Speaker1 and Speaker2), the original response by Speaker2, and the personal background of Speaker2. These background sentences are probably irrelevant to or contradictory to Speaker2's response. Your task is to minimally edit the response such that it shows the background. Two requirements should be satisfied: 1) The edited response should show the personal background. 2) By "minimally edit" we mean that the edited response should maintain the contents in the response that are not contradictory to the background sentences. We have pasted the original response into the answer blank, and please edit it directly."

B Human Evaluation Guidelines
We provide our human evaluation guidelines in software.zip, and we will make them pub-lic. We briefly summarize the guidelines here, and more details can be find in software.zip.

B.1 Grounded Minimal Editing
To make our task more comprehensible to human participants, we reformulate our task as a response correction task. We define two types of mistakes made by a response: 1) contradict and 2) ignore. We first ask the participants identify the type of mistake, which encourages the participants to reason over persona-grounded dialogues. We specify two requirements to be satisfied by a good correction: 1) the mistakes are corrected, and 2) minimal changes, i.e., all words that are not contradictory to the expected personal background should be maintained, and if more than four of them are not maintained, then this requirement is not satisfied. Given two corrections A and B, participants vote prefer A, none, or prefer B. We ask them to choose none if neither A nor B is a good correction.

B.2 Transferability
For each response, we instruct the participants to answer four questions: • Knowledge (0 or 1). Does this response includes world knowledge? 0: no; 1: yes. World knowledge includes facts and commonsense (see software.zip for details).
• Empathy (0 or 1). Do you think this response is showing empathy? 0: no; 1: yes. Showing empathy means being aware of or being sensitive to the feelings or experience of the person being talked to (see software.zip for details).
• Background occurrences. There are two personal background sentences. How many of them are reflected by this response (examples omitted here)? Since we only care about if at least one of the personas are shown, persona consistency is 0 if the answer to this question is 0, and 1 if the answer is 1 or 2.

C Experimental Details
Models are evaluated on the validation set for every 500 steps, based on the Average metric. The batch size is 32. We use Adam (Kingma and Ba, 2015) with the initial learning rate 5 × 10 −5 and gradient clip 1.0. The learning rate decays by half when the Average metric does not improve for two validations, and training terminates after three decays. We detokenize the BPE tokens into English words for evaluation. More details for the Reproducibility Checklist are in software.zip.

D Model and Baseline Details
DeLorean-FT and our GME model use the GPT-2 (Radford et al., 2018) as the backbone, initialized by Huggingface Transformers (Wolf et al., 2020) checkpoint gpt2 (DialoGPT-small for DialoGPT). UNMT and CycleGAN are Transformers with DistilGPT-2 encoder and decoder, initialized with distilgpt2. The auxiliary responseto-persona module in CycleGAN is implemented as a two-layer Transformer, initialized by the first two layers of DistilGPT-2.

E Data Samples and System Outputs
We provide several failure cases of GME in Table 7. Table 8-11 present some data samples and system outputs. Human reference 3 that sounds fun , i am a teacher at the public school but i work as a car cleaner in part time .
UNMT (0.1) that sounds fun , i am a teacher at the public school . CycleGAN that sounds fun , i am a teacher at the public school . DeLorean-FT γmix = 0.75 i work at a place that cleans cars . -γmix = 0.80 i work at a place that cleans cars . -γmix = 0.85 i work at a place that cleans cars . GME (ours) that sounds fun , i am a car mechanic at the place . i work at a place . Original response Speaker 1: do you think 40 is too old to go back to school ?
Editing persona i am seventy two years old .
Human reference 1 do you think seventy two years old is too old to go back to school ? Human reference 2 do you think seventy two is too old to go back to school ? Human reference 3 do you think seventy years old is too old to go back to school ?
UNMT (0.1) do you think 40 is too old to go back to school ? CycleGAN do you think 40 is too old to go back to school ? i am seventy seventy twelve years old . DeLorean-FT γmix = 0.75 do you skateboard ? i am a seventy two year old . -γmix = 0.80 i am a seventy two year old man . -γmix = 0.85 i am a seventy two year old man . GME (ours) do you think i am too old to go back to school ? i am seventy two years old .  Original response Speaker 1: probably a smart decision , too many people on the planet .
Editing persona i recently started to work online .
Human reference 1 probably a smart decision , i recently started to work online because too many people on the planet . Human reference 2 probably a smart decision , too many people on the planet that is why i recently started to work online . Human reference 3 probably a smart decision , too many people on the planet . i recently started to work online .
UNMT (0.1) probably a smart decision , too many people on the planet . CycleGAN probably a smart decision , too many people on the planet . DeLorean-FT γmix = 0.75 i am a computer science major . i am currently working online . -γmix = 0.80 i am a computer science major . i am currently working online . -γmix = 0.85 i am a computer science . i am currently working online . GME (ours) probably a smart decision , too many people on the planet . i am working online now .