An Investigation of Suitability of Pre-Trained Language Models for Dialogue Generation – Avoiding Discrepancies

Pre-trained language models have been widely used in response generation for open-domain dialogue. These approaches are built within 4 frameworks: Transformer-ED, Transformer-Dec, Transformer-MLM and Transformer-AR. In this study, we experimentally compare them using both large and small-scale data. This reveals that decoder-only architecture is better than stacked encoder-decoder, and both left-to-right and bi-directional attention have their own advantages. We further deﬁne two concepts of model discrepancy, which provides a new explanation to the model performance. As discrepancies may hinder performance, we propose two solutions to reduce them, which successfully improve the model performance.


Introduction
It has been shown (Wolf et al., 2019) that leveraging a pre-trained Language Model (LM) based on transformer can achieve excellent performance for dialogue generation. Different approaches have been proposed recently, which can be categorized into 4 frameworks: Transformer-ED (Zheng et al., 2019), an encoder-decoder Transformer, Transformer-Dec (Wolf et al., 2019;Lin et al., 2020), Transformer-MLM (Dong et al., 2019) and Transformer-AR (Bao et al., 2019;Shuster et al., 2019). The latter three all utilize a decoder-only architecture. Besides, Trans-Dec uses left-to-right attention for both source and target side, while Trans-MLM and Trans-AR employ bi-directional attention on the source side to encode dialogue history. Due to this difference, Trans-Dec only utilizes left-to-right pre-trained models, e.g. GPT-2 (Radford et al., 2019), while Trans-MLM/AR are based on the pre-trained models applying bi-directional attention (on the source side), e.g. BERT (Devlin et al., 2018). The difference between Trans-MLM and Trans-AR is that Trans-MLM uses masked language modeling while Trans-AR uses autoregressive objective.
Recent studies have explored pre-training dialogue models using large-scale Reddit/Twitter data (Adiwardana et al., 2020;Roller et al., 2020). It is then straightforward to fine-tune the models for a specific dialogue task. However, in practice, there may not always be enough data for pre-training. In some cases, we still need to exploit a pre-trained LM. For example, some studies do further pretraining for dialogue based on a pre-trained LM (Zhang et al., 2019;Dong et al., 2019;Bao et al., 2019;Shuster et al., 2019), and some studies that do multi-task learning (e.g. on dialogue and question answering) can only fine-tune based on a pretrained LM (Lin et al., 2020;Zeng and Nie, 2021). Then, a critical question is how to best exploit a pre-trained LM for dialogue generation. On this question, we have contradictory beliefs in the literature: some researchers believe that Trans-Dec is appropriate because it uses a left-to-right language model that corresponds well to the dialogue generation task (Zhang et al., 2019;Lin et al., 2020), while some others (Dong et al., 2019;Bao et al., 2019) show that Trans-MLM/AR fine-tuning BERT can also achieve state-of-the-art performance.
In this study, we aim to address the above question. To do it, we first compare the 4 frameworks with the same setting on 3 datasets, each with large and small scale training data. Our results on largescale datasets show that Trans-ED that applies the stacked encoder-decoder architecture does not produce competitive results against the others that use a decoder-only architecture. Trans-Dec/AR generate the most appropriate responses. However, according to automatic metrics, Trans-Dec generates most diverse responses while Trans-AR produce responses most similar to the ground-truth. This may be due to the fact that uni-directional attention does not have constraint from the right side context and thus is more flexible, while bi-directional attention on source side can better model dialogue context. In contrast, the results on small-scale datasets reveal an important aspect, namely, the discrepancies that may occur between the pre-training and the fine-tuning processes. We then try to explain the performances of the 4 frameworks with respect to the discrepancies. The concept of model discrepancy has been briefly mentioned in Yang et al. (2019) to mean that the model has been trained in a way, but used in a different way. However, the problem has not been investigated in depth. In this work, we go further in this direction and define two discrepancies: pretrain-finetune discrepancy which means the differences in architecture and loss function between pre-training and fine-tuning, and finetunegeneration discrepancy which means that the way the model is used in generation (inference/test) is different from the way it has been trained. For the 4 frameworks, except Trans-Dec, they all have some pretrain-finetune discrepancies. For example, Trans-AR relies on BERT pre-trained using bidirectional attention, but has to limit it to left-to-right attention on the target side during fine-tuning. Only Trans-MLM has finetune-generation discrepancy because of MLM objective: during training, the model input has random masks, while in the generation process, the input does not contain masks.
Discrepancies might affect the model performance since models with such discrepancies cannot best exploit the pre-trained model or employ the fine-tuned model. Our experiments on small-scale datasets show that the performance of Trans-AR that have larger pretrain-finetune discrepancy drops more sharply than Trans-MLM. Trans-Dec/MLM that have small pretrain-finetune discrepancy have clear advantage over other frameworks according to human evaluation. It becomes clear that discrepancies hinder the performance of a dialogue model. To alleviate the problems, we propose 2 approaches to respectively reduce pretrainfinetune and finetune-generation discrepancies of Trans-MLM, aiming at improving its performance. Our experiments show that both methods bring some improvement. In particular, by eliminating finetune-generation discrepancy of Trans-MLM, our approach significantly outperforms previous methods in most automatic metrics, and achieves comparable performance to Trans-Dec in human evaluation that uses much larger dataset for pre-training. These results confirm that discrepancies are indeed an important factor that influences the effectiveness of leveraging a pre-trained LM for a sequence-to-sequence task, and should be alleviated.
The contributions in this work are as follows: • We compare the four commonly used frameworks that utilize pre-trained language models for open-domain dialogue generation on 3 public datasets each in large and small scale. and we analyze each framework based on the experimental results.
• We introduce the concept of pretrain-finetune discrepancy and finetune-generation discrepancy, and we examine the discrepancies of each framework.
• We propose two methods to reduce discrepancies 1 , yielding improved performance. It is the first investigation that shows explicitly the phenomenon of model discrepancy and its impact on performance.

Pre-training Based Frameworks
We start with a brief description of the 4 frameworks for dialogue generation based on pre-trained models. More details are provided in Appendix A. We examine the pretrain-finetune discrepancy of each framework. Figure 1 and Table 1 provide an overview.

Trans-ED
Trans-ED discussed in this paper is an encoderdecoder architecture used by ConvAI2 (Dinan et al., 2019) champion 2 . The decoder of Trans-ED is stacked upon the encoder outputs, while in other decoder-only frameworks, all hidden states of the source side are utilized in the decoding part. The framework shares the encoder and the decoder and initializes the parameters with GPT (Radford et al., 2018). In this case, the pretrain-finetune discrepancy comes from the bi-directional attention in the encoder since GPT is a left-to-right language model. This framework is not commonly used for fine-tuning on a dialogue task. In practice, more efficient variants of Trans-ED are recently used for extremely large-scale dialogue pre-training from

Trans-Dec
Trans-Dec is a left-to-right decoder-only architecture, and it utilizes GPT-2 (Radford et al., 2019). Thus, there is no pretrain-finetune discrepancy in terms of architecture and loss function. This framework is widely applied for fine-tuning on a dialogue task. However, it encodes dialogue history using only left-to-right attention, which limits the scope of context, resulting in a partial context modeling.

Trans-MLM and AR
These two frameworks have an identical decoderonly architecture that employs different selfattention masks for the source and target side: they use bi-directional attention on the source side to encode dialogue history and left-to-right attention on the target side. The only difference between them is the objective function: Trans-MLM masks some tokens at the target side and tries to predict them, while Trans-AR uses auto-regressive objective that tries to predict the next tokens successively. BERT is often exploited by the two frameworks, which is a bi-directional architecture using MLM as the pre-training objective. Thus, the pretrain-finetune discrepancy of Trans-MLM/AR comes from the left-to-right attention on the target side. Additionally, Trans-AR applies the auto-regressive objec-tive, which is different from the MLM used in the pre-training.

Applications of the Frameworks
The four frameworks we described have been widely applied to dialogue generation.  Roller et al., 2020;Bao et al., 2020b).
In general, these studies show that all the 4 frameworks can produce good results, and increasing the model size and training data is an effective method to further improve performance. However, behind the success story, the question of suitability  of a framework is masked. To investigate this question, we do not follow the current trend to increase the model size and training data. Instead, we are interested in the behaviors of different frameworks on the same datasets and to understand the reasons.

Datasets
We use all the three large-scale unlabeled dialogue datasets in Shuster et al. (2019). Some important characteristics of the datasets are summarized in Table 2. We are interested in the behaviors of the models in two cases: 1) further pre-training on large dialogue data based on a pre-trained LM; and 2) fine-tuning on a small dialogue corpus based on a pre-trained LM. Our large datasets contain a few million samples, and the small datasets consist of 100K samples 3 . Although the datasets are smaller than those used in several previous studies, we believe that a comparison of different models on the same data, and the contrast between large and small datasets, can reveal interesting trends, which we will explain with respect to discrepancies. Specifically, we choose the following 3 datasets: Twitter Dialogue Corpus 4 is collected from Twitter consisting of 2.6M (message, response) pairs. We filtered out samples with history length longer than 72 words (to limit the computation) or shorter than 6 words (not enough information). Samples whose response is longer than 36 words or shorter than 6 words are also removed. As a result, 2M samples are kept. Reddit Conversational Corpus 5 (Dziri et al., 2019) is a 3-turn conversational dataset collected from 95 selected subreddits. Ubuntu Dialogue Corpus V2.0 6 (Lowe et al., 2017) contains two-person conversations extracted from the Ubuntu chat logs of technical support for various Ubuntu-related problems.

Implementation Details
We use open-source implementations for all four frameworks. Only minor adaptations (e.g. for data loading) have been made. The pre-trained language models used by these frameworks in previous studies have comparable number of parameters (∼ 110M), while the pre-training data are in different scales: Trans-ED < Trans-MLM/AR < Trans-Dec. We assume that the difference is trivial when there are millions of dialogue data. In this study, we use the same data for all the frameworks. More implementation details of each framework and the full comparison among pre-trained LM are given in Appendix C.
We also equip all frameworks with an identical decoding script 7 to avoid extra factor affecting the generation quality, which uses beam search with beam size of 4, prevents duplicated uni-grams, and sets minimum response length that encourages diverse generation as in Roller et al. (2020). The minimum response length is set to make the average length of generated responses match with the average target length of the dataset. Generation results are evaluated after applying an identical word tokenization method. With two P100 GPU devices, the maximum input length is set to 128, and we fine-tune all models for 6 epochs and apply early stopping based on the performance on validation set. Our methods (PF-free and FG-free, which will be described in Section 4.1) do not add parameters or increase runtime in comparison with Trans-MLM.

Evaluation
Automatic Metrics We compare the similarity between generated responses and ground-truth responses using 8 : BLEU (Papineni et al., 2002) evaluating how many n-grams (n=1,2,3) overlapped; CIDEr (Vedantam et al., 2015) utilizing TF-IDF weighting for each n-gram. Besides, we evaluate response diversity using Distinct (denoted Dist) (Li et al., 2016) that indicates the proportion of unique n-grams (n=1,2) in the entire set of generated responses.  Table 3: Evaluation results on large-scale (upper half) and small-scale (lower half) Twitter dataset. PF-free denotes the method with reduced pretrain-finetune discrepancy of Trans-MLM. FG-free denotes the method that eliminates finetune-generation discrepancy of Trans-MLM. Two-sided t-test compares each method with the one without () sign, which is usually the best performer. Scores are denoted with * (p < 0.05) or ** (p < 0.01) for statistically significant differences, and / for insignificant differences.  Human Evaluation Furthermore, we ask human evaluators to rate a response in {0, 1, 2}. 2 represents a coherent and informative response. Details are given in Appendix D. We also do a pair-wise evaluation to compare two models and indicate which one is better. To reduce time cost, we only perform human evaluations on Twitter and Reddit datasets that are closer to daily dialogue. However, during evaluation, we observe that ∼ 65% Reddit data are professional discussions that are difficult to understand. The percentage is ∼ 30% for Twitter data. These test samples are discarded, and at the end the test set for each dataset consists of 200 random samples. The inter-rater annotation agreement in Cohen's kappa (Cohen, 1960) is 0.44 and 0.42 for Twitter and Reddit, which indicates moderate agreement.
In addition to the 4 frameworks, we also include two general RNN-based baseline frameworks -SEQ2SEQ-MMI (Li et al., 2016) and HRED-MMI (Serban et al., 2016) to show how pre-trained models perform against them.

Architecture Analysis
We first examine architecture appropriateness on the large-scale data setting, since when data are limited pretrain-finetune discrepancy and the size of pre-training data may strongly influence the results. Appendix E shows some generation samples. Our global observation is that Trans-Dec and Trans-AR are the best choice for large-scale data setting, e.g. further dialogue pre-training based on a pre-trained LM.
Left-to-Right Only vs. Bi-Direction on the Source Human evaluation results in response appropriateness (Table 6 and 7) show that Trans-Dec and Trans-AR generate most appropriate responses. According to automatic metrics, Trans-AR applying bi-directional attention on the source side obtains the highest BLEU and CIDEr scores on all   three million-scale datasets. We believe that bidirectional attention helps the model to better encode the dialogue history. In contrast, Trans-Dec is able to generate the most diverse responses. We attribute it to the left-to-right attention that introduces less constraints than bidirectional attention, thus has a higher flexibility for generation.
Trans-MLM vs. AR With large data, Trans-AR substantially outperforms Trans-MLM in terms of both automatic and human evaluation. When eliminating the finetune-generation discrepancy of Trans-MLM, i.e. FG-free (we will introduce in Section 4.2), the performance is improved while still having a small gap especially in automatic metrics to Trans-AR. This may be because MLM objective only masks a certain percentage of tokens (40%)  while AR objective predicts all tokens on the target side for training. Thus, the AR objective is more training-efficient. Similar observation about the efficiency of MLM has been reported in Clark et al. (2020). However, when training data are limited, we will show that it is better to use MLM objective which has smaller pretrain-finetune discrepancy. Trans-ED vs. Decoder-Only With large dialogue data, we assume the size of pre-training data and pretrain-finetune discrepancy only have small influence on performance. However, even comparing with Trans-MLM(FG-free)/AR, Trans-ED generates much less diverse or appropriate responses. We also observe lower speed for convergence when training the model 9 . We believe that the result is more or less due to the main difference in architecture: an explicit encoder in Trans-ED might be redundant (Liu et al., 2018).

Discrepancy Impact
In section 2, we have discussed the pretrainfinetune discrepancy of each framework. When a large training dataset is available, the impact of pretrain-finetune discrepancy is less severe since the model can be gradually adapted to the given task. However, if the training data are limited, the discrepancy problems may surface. Evaluation results, especially in human evaluation, show that the performance is more reduced with small data if the framework has larger discrepancy. For example, by comparing Trans-MLM (FG-free) and Trans-AR, the latter having additional pretrain-finetune discrepancy due to its auto-regressive objective, we see that the performance of Trans-AR drops more when trained on a small dataset. Trans-MLM (FG-free) and Trans-Dec that have small pretrainfinetune discrepancy have clear advantage over other frameworks according to human evaluation.
These results suggest that with a small dataset one should reduce pretrain-finetune discrepancy to best exploit pre-trained LM. In the next section, we propose 2 methods to reduce pretrain-finetune discrepancy and finetune-generation discrepancy of Trans-MLM.

Pretrain-Finetune Discrepancy
The discrepancy of Trans-MLM comes from the left-to-right attention on the target side that has not been pre-trained in BERT. Therefore, this discrepancy cannot be eliminated during fine-tuning for a generation task. However, we can alleviate the discrepancy by using bi-directional attention also on the target side. Specifically, at inference time, to generate a new token denoted as g t , [MASK] is fed into t-th position, denoted as g t -M. Previously generated tokens g <t could be viewed as a special type of dialogue history, and thus we can apply bi-directional attention on it.
However, in this case, the corresponding training process will have efficiency problems -only one token can be masked in each training sample; otherwise, there will be conflict for the selfattention mask (Appendix B). This would lead to much lower training efficiency: the loss on validation set only decreases slightly to 5.39 from 6.27 after four epochs, while Trans-MLM masking 40% of the target tokens can reduce it to 4.35. To avoid this situation, we cannot always update previous hidden states using bi-directional attention in generation. Therefore, we explore to set a time-step interval for bi-directional attention on the target side -within the interval we apply left-to-right attention and at the end of an interval we apply bidirectional attention. The corresponding training method allows us to mask multiple target tokens at the same time to guarantee training efficiency. Figure 2 illustrates the generation process of our method with interval of 3. Before time step 3, left-to-right attention is used (e.g. t=2). At time step 3, bidirectional attention is allowed. Then left-to-right attention is used (e.g. t=5) before the end of next interval cycle (t=6). Accordingly, the training process is: given a target response, we first randomly select among all (3 in the figure because t=3 and t=5 are the same pattern) possible attention patterns (e.g. the case of t=3 or t=5 in Figure 2, where we apply bi-directional attention only on y 0,1,2 ); then in the part of left-to-right attention, we randomly mask several tokens. We can mask multiple tokens because this part applies left-toright attention and the masks at other positions will not influence the prediction on a given mask. We call this method PF-free, which means that the pretrain-finetune discrepancy is reduced.

Finetune-Generation Discrepancy
A model having finetune-generation discrepancy means the way that it is used in generation (inference/test) is different from the way it has been trained. Only Trans-MLM has finetune-generation discrepancy because of its MLM objective as shown in Figure 3: during training, there is a masked token, y 1 -M, before y 2 -M, while in inference there is not a masked token before when generating the token for g 2 -M. To deal with the problem, we propose that at training time, rather than replacing the tokens with [MASK] as in vanilla MLM, we keep all original input tokens unchanged and prepend [MASK] tokens in the input sequence as illustrated. The prepended [MASK] token uses the same position embedding of the corresponding token. Then, every position after y 1 -M attends to y 1 instead of the [MASK] token, and thus the finetune-generation discrepancy of MLM is eliminated. We call the modified model FG-free. A similar method has also been explored in (Bao et al., 2020a), where they introduced an extra pseudo mask in addition to [MASK] and prepend it before the original token in order to handle factorization steps of their partially auto-regressive language model.

Experimental Results
The results with PF-free, FG-free and PF&FG-free models on small-scale datasets are reported in previous tables together with other models. We can see that each of the proposed methods brings some improvement. PF-free improves most automatic metrics over Trans-MLM, but the response appropriateness in human evaluation is not improved. We observe that PF-free could generate some responses that lack fluency, which also influences PF&FG-free (Appendix E). In general, our exploration shows that the left-to-right attention on the target side is necessary for a generative task.
We examine our FG-free method on both large and small-scale data. It always brings statistically significant improvement over Trans-MLM in all automatic metrics, and generates more appropriate responses. On small-scale datasets, it outperforms all other frameworks in similarity metrics and achieve comparable performance in response appropriateness to Trans-Dec that has leveraged much more pre-training data.
This set of experimental results confirm the usefulness of reducing discrepancies in the model. This demonstrates that model discrepancies are indeed important problems we need to address when a pre-trained LM is used for dialogue generation, and the problems have been under-explored.

Conclusion
In this paper, we examined the 4 frameworks for open-domain dialogue based on pre-trained models. We compared their performances on several datasets with the same setting. The comparison revealed that Trans-Dec and Trans-AR are both good choices when large-scale data are available, e.g. further dialogue pre-training. When data are limited, e.g. fine-tuning on small dialogue tasks, Trans-Dec is the most appropriate.
Furthermore, we defined the concept of pretrainfinetune and finetune-generation discrepancy, and examined the 4 frameworks with respect to these concepts. We have shown that the performances of the 4 frameworks can be largely explained by their respective discrepancies, which hinder their performances. This becomes more clear when the dataset is small.
To further show that reducing the discrepancies can improve the performance, we designed PF-free and FG-free correction methods to reduce the discrepancies on Trans-MLM, and tested the corrected Trans-MLM models on the datasets. Our results confirmed that once discrepancies are eliminated, Trans-MLM can produce better results.
This study is the first investigation on the widely used 4 frameworks based on pre-trained LM in terms of architectural appropriateness and discrepancies. We believe that this question is important to understand how a pre-trained model can be used in dialogue generation. It deserves more investigations in the future.

A Multi-Layer Transformer
In this section, we provide some background knowledge on Transformer. The four frameworks we discussed all consist of 12 Transformer blocks. Figure  4 (a) shows a general architecture of a Transformer layer, where the most important component is the masked multi-head self-attention. The setting of attention masks is the largest difference between Trans-Dec and Trans-AR, and it is also the most critical part to implement our PF-free and FG-free methods.
The input representation H 0 ∈ R n×d h , where n is the input length and d h = 768 is the hidden dimension, is the sum of token embedding, position embedding, and type embedding at each position. Then, H 0 is encoded into hidden representations of i-th layer H i = [h i 1 , ..., h i n ] by: where Trans i denotes the i-th Transformer Block as shown in Figure 4 (a). The core component of a transformer block is the masked multi-head attention, whose outputs are C i = [c i 1 , ..., c i n ] that are computed via C i = Concat(head 1 , ..., head h ), with

B Illustration of Attention Conflict
If applying bi-directional attention at each generation step, only one token at the target side could be masked for each training sample; otherwise there will be attention conflicts, i.e. different selfattention mask matrices are required for different masked tokens, while only one mask matrix can be provided per training sample. In Figure 5, we provide an illustration of the mask conflict problem. We assume y 1 and y 3 are masked and need to be predicted at the same time. We see in the figure that two different masks are required for predicting y 1 and y 3 , which cannot be done in a single training step, making it impossible to mask more than one token in each step.

C Implementation Details
For the 4 frameworks, we used open-source implementations. Only some minor adaptations to our data and task are made (e.g. re-wrote the data loader to load our experimental datasets, and modified the training objective by keeping only the response reconstruction loss). For response generation, we equipped all frameworks with an identical decoding script 10 . We did not modify other parts, and used the default settings for hyper-parameters, e.g. optimizer and learning rate. Some generation examples are given in Appendix E. Although some models (e.g. Trans-ED) produced poor per-

Model
Pre-trained LM Data Trans-ED GPT (Radford et al., 2018) BooksCorpus Trans-Dec GPT-2 small (Radford et al., 2019) WebText Trans-MLM/AR BERT base (Devlin et al., 2018) BooksCorpus, English Wikipedia formance on small datasets, all model can generate some coherent and fluent responses with large scale training data, which is consistent with the performances reported in previous papers.
Language Models The pre-trained language models used by these frameworks have comparable number of parameters as listed in Table 9, while the pre-training data are in different scales as described in Trans-ED We use the implementation of Con-vAI2 champion 11 . The model was for personaconditioned dialogue generation. The framework is based on GPT architecture and uses GPT for parameter initialization. However they only provide a model checkpoint that has been fine-tuned on large-scale dialogue data including Reddit. To examine the ability of utilizing pre-trained LM, we did not use this checkpoint but initialize the model with GPT parameters 12 . We also did not apply post-processing to the generation results (to be consistent with other experiments).  Table 9: The number of parameters of each tested approach and the average runtime (minutes) for every million training samples. The runtime is tested using a 1080Ti GPU device, and the batch size is set to take all of the GPU memories. Notice that the runtime will be influenced by code implementation in addition to model structure.

Trans-Dec
for parameter initialization, and fine-tune it on dialogue datasets. PF-free and FG-free are also implemented based on the code. We set the bi-directional attention interval of PF-free to 5. Since the average length of ground-truth responses in the datasets is ∼ 15, This setting is generally appropriate.

D Human Evaluation Details
Some existing studies considered response fluency, coherence, and informativeness. We make the manual evaluation simpler and ask the human evaluators to rate a response in {0, 1, 2}. A score of 0 represents an unacceptable response, which might have flaw in fluency and logic or be incoherent. Special cases are for example completely coping from the dialogue history as the output, and a bland response such as "i do not know what you are talking about , but it 's a good point .". A score of 1 represents an acceptable response, but it is generic or not perfectly coherent to the dialogue history. 2 represents a coherent and informative response.
Dialogue History one week before election day , early voting is nearly twice as high as <num> Gold Response i hope it 's <person> out in full force . SEQ2SEQ-MMI i am not voting for a third party candidate .
(1) Trans-ED i am not sure what you are talking about , but it 's a good thing that <person> is going to win . (1) Trans-Dec that 's not true . early voting in nyc is times higher than the national average (2) Trans-MLM it 's not even close to <num> % of the vote . i am sure you are right , but there is a huge difference between early voting and <person> voter suppression in ca (1) Trans-AR it 's not that high , but i am sure there will be a lot of voter fraud .
(2) FG-free it 's not even close to <num> yet , but i am sure the polls will show up .
(2) Trans-ED (K) i am not sure what you are talking about . it 's the only one who is going to do with this (0) Trans-Dec (K) it 's not that early voting is a bad thing , but it shouldnot be the case . (0) Trans-MLM (K) it 's a little too late . i am not sure if that is the case , but we will see what happens next (2) Trans-AR (K) it 's not even close to <num> % of voter turnout . i am sure you are right about that (0) PF-free (K) <num> % of people voted for <person> , so i am going to vote (2) FG-free (K) it 's not even close to <num> % of voter turnout , but i am sure there are more .
(2) PF&FG-free (K) it 's not even <num> % of the vote . i am going to be a voter (2) he 's such a good guy . i love him so much, but it doesnot make sense to me that this is the only way we can be together in our lives (2) Trans-AR i think about this man every day too . he 's so good at it ! (2) FG-free he 's such a good guy . i am so happy for him and his family (2)