GTM: A Generative Triple-wise Model for Conversational Question Generation

Generating some appealing questions in open-domain conversations is an effective way to improve human-machine interactions and lead the topic to a broader or deeper direction. To avoid dull or deviated questions, some researchers tried to utilize answer, the “future” information, to guide question generation. However, they separate a post-question-answer (PQA) triple into two parts: post-question (PQ) and question-answer (QA) pairs, which may hurt the overall coherence. Besides, the QA relationship is modeled as a one-to-one mapping that is not reasonable in open-domain conversations. To tackle these problems, we propose a generative triple-wise model with hierarchical variations for open-domain conversational question generation (CQG). Latent variables in three hierarchies are used to represent the shared background of a triple and one-to-many semantic mappings in both PQ and QA pairs. Experimental results on a large-scale CQG dataset show that our method significantly improves the quality of questions in terms of fluency, coherence and diversity over competitive baselines.


Introduction
Questioning in open-domain dialogue systems is indispensable since a good system should have the ability to well interact with users by not only responding but also asking (Li et al., 2017). Besides, raising questions is a proactive way to guide users to go deeper and further into conversations (Yu et al., 2016). Therefore, the ultimate goal of opendomain conversational question generation (CQG) is to enhance the interactiveness and maintain the continuity of a conversation . Joint Table 1: An example of CQG task which is talking about a person's eating activity. There are one-to-many mappings in both PQ and QA pairs. The content of each meaningful and relevant question (Q1.1 to Q2.2) is decided by its post and answer. Q3 (dull) and Q4 (deviated) are generated given only the post. CQG differs fundamentally from traditional question generation (TQG) (Zhou et al., 2019;Kim et al., 2019;Li et al., 2019) that generates a question given a sentence/paragraph/passage and a specified answer within it. While in CQG, an answer always follows the to-be-generated question, and is unavailable during inference (Wang et al., 2019). At the same time, each utterance in open-domain scenario is casual and can be followed by several appropriate sentences, i.e., one-to-many mapping Chen et al., 2019).
At first, the input information of CQG was mainly a given post Hu et al., 2018), and the generated questions were usually dull or deviated (Q3 and Q4 in Table 1). Based on the observation that an answer has strong relevance to its question and post, Wang et al. (2019) tried to integrate answer into the question generation process. They applied a reinforcement learning framework that firstly generated a question given the post, and then used a pre-trained matching model to estimate the relevance score (reward) between answer and generated question. This method separates a post-question-answer (PQA) triple into post-question (PQ) and question-answer (QA) pairs rather than considering the triple as a whole and modeling the overall coherence. Furthermore, the training process of the matching model only utilizes one-to-one relation of each QA pair and neglects the one-to-many mapping feature.
An open-domain PQA often takes place under a background that can be inferred from all utterances in the triple and help enhance the overall coherence. When it comes to the semantic relationship in each triple, the content of a specific question is under the control of its post and answer (Lee et al., 2020). Meanwhile, either a post or an answer could correspond to several meaningful questions. As shown in Table 1, the triple is about a person's eating activity (the background of the entire conversation). There are one-to-many mappings in both PQ and QA pairs that construct different meaningful combinations, such as P-Q1.1-A1, P-Q1.2-A1, P-Q2.1-A2 and P-Q2.2-A2. An answer connects tightly to both its post and question, and in turn helps decide the expression of a question.
On these grounds, we propose a generative triplewise model (GTM) for CQG. Specifically, we firstly introduce a triple-level variable to capture the shared background among PQA. Then, two separate variables conditioned on the triple-level variable are used to represent the latent space for question and answer, and the question variable is also dependent on the answer one. During training, the latent variables are constrained to reconstruct both the original question and answer according to the hierarchical structure we define, making sure the triple-wise relationship flows through the latent variables without any loss. For the question generation process, we sample the triple-level and answer variable given a post, then obtain the question variable conditioned on them, and finally generate a question based on the post, triple-level and question variables. Experimental results on a largescale CQG dataset show that GTM can generate more fluent, coherent, and intriguing questions for open-domain conversations.
The main contribution is threefold: • To generate coherent and informative questions in the CQG task, we propose a generative triple-wise model that models the semantic relationship of a triple in three levels: PQA, PQ, and QA. Figure 1: The graphical representation of GTM for training process. z t is used to capture the shared background among PQA, while z q and z a are used to model the diversity in PQ and QA pairs. Solid arrows illustrate the generation of q, a (not used in inference), and qt, while dashed arrows are for posterior distributions of latent variables.
• Our variational hierarchical structure can not only utilize the "future" information (answer), but also capture one-to-many mappings in PQ and QA, which matches the open-domain scenario well.
• Experimental results on a large-scale CQG corpus show that our method significantly outperforms the state-of-the-art baselines in both automatic and human evaluations.

Proposed Model
Given a post as the input, the goal of CQG is to generate the corresponding question. Following the work of  and Wang et al. (2019), we leverage the question type qt to control the generated question, and take advantage of the answer information a to improve coherence. In training set, each conversation is represented as {p, q, qt, a}, consisting of post with its question type qt, and answer a = {a i } |a| i=1 .

Overview
The graphical model of GTM for training process is shown in Figure 1. θ, ϕ, and φ are used to denote parameters of generation, prior, and recognition network, respectively. We integrate answer generation to assist question generation with hierarchical latent variables. Firstly, a triple-level variable z t is imported to capture the shared background and What did you eat?
We ate steak and pasta.
What did you eat?
We ate steak and pasta.
[What] Figure 2: The architecture of GTM. ⊕ denotes the concatenation operation. In training process, latent variables obtained from recognition networks and the real question type qt are used for decoding. Red dashed arrows refer to inference process, in which we get latent variables from prior networks, and the predicted question type qt is fed into the question decoder. The answer decoder is only utilized during training to assist the triple-wise modeling.
is inferred from PQA utterances. Then answer latent variable z a and question latent variable z q are sampled from Gaussian distributions conditioned on both post and z t . To ensure that the question is controlled by answer, z q is also dependent on z a .

Input Representation
We use a bidirectional GRU (Cho et al., 2014) as encoder to capture the semantic representation of each utterance. Take post p as an example. Each word in p is firstly encoded into its embedding vector. The GRU then computes forward hidden states where e p i is employed to represent the embedding vector of word p i . We finally get the post representation by concatenating the last hidden states of two directions h enc Similarly, we can obtain representations of question q and answer a, denoted as h enc q and h enc a , respectively. The question type qt is represented by a realvalued, low dimensional vector v qt which is updated during training and is regarded as a linguistic feature that benefits the training of latent variables . We use the actual question type qt during training to provide the information of interrogative words that is the most important feature to distinguish question types.

Triple-level Latent Variable
To capture the shared background of entire triple, we introduce a triple-level latent variable z t that is inferred from PQA utterances and is in turn responsible for generating the whole triple. Inspired by Park et al. (2018), we use a standard Gaussian distribution as the prior distribution of z t : where I represents the identity matrix.
For the inference of z t in training set, we consider three utterance representations h enc p , h enc q and h enc a as a sequence, and use a bidirectional GRU to take individual representation as the input of each time step. The triple representation h t is obtained by concatenating the last hidden states of both directions. Then, z t is sampled from: where MLP(·) is a feed-forward network, and softplus function is a smooth approximation to ReLU and can be used to ensure positiveness (Park et al., 2018;Serban et al., 2017).

One-to-many Mappings
After obtaining z t , we use a GRU f to get a vector h ctx p for connecting p and q/a. h ctx p is then transformed to h ctx q and h ctx a that are used in prior and recognition networks for z q and z a : To model one-to-many mappings in PQ and QA pairs under the control of z t , we design two utterance-level variables, z q and z a , to represent latent spaces of question and answer. We define the prior and posterior distributions of z a as follows: where µ a , σ a , µ a , and σ a , the parameters of two Gaussian distributions, are calculated as: To make sure the content of question is also decided by answer and improve their relatedness, we import z a into z q space. The prior and posterior distributions of z q are computed as follows: where µ q , σ q , µ q , and σ q are calculated as: ).

Question Generation Network
Following the work of Zhao et al. (2017) and Wang et al. (2019), a question type prediction network MLP qt is introduced to approximate p θ (qt|z q , z t , p) in training process and produces question type qt during inference. As shown in Figure 2, there are two decoders in our model, one is for answer generation that is an auxiliary task and only exists in the training process, and the other is for desired question generation. The question decoder employs a variant of GRU that takes the concatenation result of z q , z t , h ctx q , and qt as initial state s 0 , i.e., s 0 = [z q ; z t , h ctx q , qt]. For each time step j, it calculates the context vector c j following Bahdanau et al. (2015), and computes the probability distribution p θ (q|z q , z t , p, qt) over all words in the vocabulary: where e j−1 represents the embedding vector of the (j − 1)-th question word. Similarly, the answer decoder receives the concatenation result of z a , z t , and h ctx a as initial state to approximate the probability p θ (a|z a , z t , p).

Training and Inference
Importantly, our model GTM is trained to maximize the log-likelihood of the joint probability p(p, q, a, qt): However, the optimization function is not directly tractable. Inspired by Serban et al. (2017) and Park et al. (2018), we convert it to the following objective that is based on the evidence lower bound and needs to be maximized in training process: The objective consists of two parts: the variational lower bound (the first five lines) and question type prediction accuracy (the last line). Meanwhile, the variational lower bound includes the reconstruction terms and KL divergence terms based on three hierarchical latent variables. The gradients to the prior and recognition networks can be estimated using the reparameterization trick (Kingma and Welling, 2014).
During inference, latent variables obtained via prior networks and predicted question type qt are fed to the question decoder, which corresponds to red dashed arrows in Figure 2. The inference process is as follows: (1) Sample triple-level LV: z t ∼ q φ (z t |p) 1 .

Experiments
In this section, we conduct experiments to evaluate our proposed method. We first introduce some empirical settings, including dataset, hyperparameters, baselines, and evaluation measures. Then we illustrate our results under both automatic and human evaluations. Finally, we give out some cases generated by different models and do further analyses over our method.

Hyper-parameter Settings
We keep the top 40,000 frequent words as the vocabulary and the sentence padding length is set to 30. The dimension of GRU layer, word embedding and latent variables is 300, 300, and 100. The prior networks and MLPs have one hidden layer with size 300 and tanh non-linearity, while the number of hidden layers in recognition networks for both triple-level and utterance-level variables is 2. We apply dropout ratio of 0.2 during training. The mini-batch size is 64. For optimization, we use Adam (Kingma and Ba, 2015) with a learning rate of 1e-4. In order to alleviate degeneration problem of variational framework (Park et al., 2018), we apply KL annealing, word drop (Bowman et al., 2016) and bag-of-word (BOW) loss  4 . The KL multiplier λ gradually increases from 0 to 1, and the word drop probability is 0.25. We use Pytorch to implement our model, and the model is trained on Titan Xp GPUs.

Baselines
We compare our methods with four groups of representative models: (1) S2S-Attn: A simple Seq2Seq model with attention mechanism (Shang et al., 2015).
(2) CVAE&kgCVAE: The CVAE model integrates an extra BOW loss to generate diverse questions. The kgCVAE is a knowledge-guided CVAE that utilizes some linguistic cues (question types in our experiments) to learn meaningful latent variables . (3) STD&HTD: The STD uses soft typed decoder that estimates a type distribution over word types, and the HTD uses hard typed decoder that specifies the type of each word explicitly with Gumbel-softmax . (4) RL-CVAE: A reinforcement learning method that regards the coherence score (computed by a one-to-one matching network) of a pair of generated question and answer as the reward function (Wang et al., 2019). RL-CVAE is the first work to utilize the future information, i.e., answer, and is also the state-of-the-art model for CQG 5 .
Additionally, we also conduct ablation study to better analyze our method as follows: (5) GTMz t : GTM without the triple-level latent variable, which means z t is not included in the prior and posterior distributions of both z p and z a . (6) GTMa: the variant of GTM that does not take answer into account. That is, answer decoder and z a are removed from the loss function and the prior and posterior distributions of z q . Besides, z t here does not capture the semantics from answer. (7) GTMz q /z a : GTM variant in which distributions of z q are not conditioned on z a , i.e., the fact that the content of question is also controlled by answer is not modelled explicitly by latent variables.
In our model, we use an MLP to predict question types during inference, which is different from the conditional training (CT) methods (Li et al., 2016b;Shen and Feng, 2020)   that provide the controllable feature, i.e., question types, in advance for inference. Therefore, we do not consider CT-based models as comparable ones.

Evaluation Measures
To better evaluate our results, we use both quantitative metrics and human judgements in our experiments.

Automatic Metrics
For automatic evaluation, we mainly choose four kinds of metrics: (1) BLEU Scores: BLEU (Papineni et al., 2002) calculates the n-gram overlap score of generated questions against ground-truth questions. We use BLEU-1 and BLEU-2 here and normalize them to 0 to 1 scale.
(2) Embedding Metrics: Average, Greedy and Extrema metrics are embedding-based and measure the semantic similarity between the words in generated questions and ground-truth questions (Serban et al., 2017;Liu et al., 2016). We use word2vec embeddings trained on the Google News Corpus 6 in this part. Please refer to Serban et al. (2017) for more details.
(3) Dist-1& Dist-2: Following the work of Li et al. (2016a), we apply Distinct to report the degree of diversity. Dist-1/2 is defined as the ratio of unique uni/bi-grams over all uni/bi-grams in generated questions. (4) RUBER Scores: Referenced metric and Unreferenced metric Blended Evaluation Routine (Tao et al., 2018) has shown a high correlation with human annotation in open-domain conversation evaluation. There are two versions, one is RubG based on geometric averaging and the other is RubA based on arithmetic averaging. Embedding metrics and BLEU scores are used to measure the similarity between generated and ground-truth questions. RubG/A reflects the se-mantic coherence of PQ pairs (Wang et al., 2019), while Dist-1/2 evaluates the diversity of questions.

Human Evaluation Settings
Inspired by Wang et al. (2019), Shen et al. (2019), and , we use following three criteria for human evaluation: (1) Fluency measures whether the generated question is reasonable in logic and grammatically correct. (2) Coherence denotes whether the generated question is semantically consistent with the given post. Incoherent questions include dull cases. (3) Willingness measures whether a user is willing to answer the question. This criterion is to justify how likely the generated questions can elicit further interactions.
We randomly sample 500 examples from test set, and generate questions using models mentioned above. Then, we send each post and corresponding 10 generated responses to three human annotators without order, and require them to evaluate whether each question satisfies criteria defined above. All annotators are postgraduate students and not involved in other parts of our experiments.

Experimental Results
Now we demonstrate our experimental results on both automatic evaluation and human evaluation.

Automatic Evaluation Results
Now we demonstrate our experimental results on both automatic evaluation and human evaluation. The automatic results are shown in Table 2. The top part is the results of all baseline models, and we can see that GTM outperforms other methods on all metrics (significance tests (Koehn, 2004), p-value < 0.05), which indicates that our proposed model can improve the overall quality of generated questions. Specifically, Dist-2 and RubA have been improved by 2.43% and 1.90%, respectively, compared to the state-of-the-art RL-CVAE model. First, higher embedding metrics and BLEU scores show that questions generated by our model are similar to ground truths in both topics and contents.
Second, taking answer into account and using it to decide the expression of question can improve the consistency of PQ pairs evaluated by RUBER scores. Third, higher distinct values illustrate that one-to-many mappings in PQ and QA pairs make the generated responses more diverse.
The bottom part of Table 2 shows the results of our ablation study, which demonstrates that taking advantage of answer information, modeling the shared background in entire triple, and considering one-to-many mappings in both PQ and QA pairs can help enhance the performance of our hierarchical variational model in terms of relevance, coherence and diversity. Table 3, GTM can alleviate the problem of generating dull and deviated questions compared with other models (significance tests (Koehn, 2004), p-value < 0.05). Both our proposed model and the state-of-the-art model RL-CVAE utilize the answer information and the results of them could prove that answers assist the question generation process. Besides, GTM can produce more relevant and intriguing questions, which indicates the effectiveness of modeling the shared background and one-to-many mappings in CQG task. The interannotator agreement is calculated with the Fleiss' kappa (Fleiss and Cohen, 1973). Fleiss' kappa for Fluency, Coherence and Willingness is 0.493, 0.446 and 0.512, respectively, indicating "Moderate Agreement" for all three criteria.

Question-Answer Coherence Evaluation
Automatic metrics in Section "Automatic Metrics" are designed to compare generated questions with ground-truth ones (RUBER also takes the post information into consideration), but ignore answers in the evaluation process. To measure the semantic coherence between generated questions and answers, we apply two methods (Wang et al., 2019): (1) Cosine Similarity: We use the pre-trained Infersent model 7 (Conneau et al., 2017) to obtain sentence embeddings and calculate cosine similarity between the embeddings of generated responses and answers. (2) Matching Score: We use the GRU-MatchPyramid (Wang et al., 2019) model that adds the MatchPyramid network (Pang et al., 2016) on top of a bidirectional GRU to calculate the semantic coherence. As shown in Table 4, questions generated by GTM are more coherent to answers. Attributing to the design of triple-level latent variable that captures the shared background, one-to-many   mappings in PQ and QA pairs, and relationship modeling for z q and z a , GTM can improve the relevance in QA pairs.

Case Study
In Table 5, we list the generated results of two posts from the test set to compare the performance of different models.
In the first case, both the post and answer mention two topics, "donation" and "song", so the question is better to consider their relations. Besides, the answer here begins with "because", then "why" and "what (reason)" questions are reasonable. For the second case, the post only talks about "pen", while the answer refers to "ink", which means there is a topic transition the question needs to cover. The second case shows the effectiveness of an answer that not only decides the expression of question but also improves the entire coherence of a tripe. Questions generated by GTM are more relevant to  both posts and answers, and could attract people to give an answer to them. However, other baselines may generate dull or deviated responses, even the RL-CVAE model that considers the answer information would only contain the topic words in answers (e.g., the question in case two), but fail to ensure the PQA coherence.

Further Analysis of GTM
Variational models suffer from the notorious degeneration problem, where the decoders ignore latent variables and reduce to vanilla Seq2Seq models Park et al., 2018;Wang et al., 2019). Generally, KL divergence measures the amount of information encoded in a latent variable. In the extreme case where the KL divergence of latent variable z equals to zero, the model completely ignores z, i.e., it degenerates. Figure 3 shows that the total KL divergence of GTM model maintains around 2 after 18 epochs indicating that the degen-eration problem does not exist in our model and latent variables can play their corresponding roles.

Related Work
The researches on open-domain dialogue systems have developed rapidly (Majumder et al., 2020;Shen et al., 2021), and our work mainly touches two fields: open-domain conversational question generation (CQG), and context modeling in dialogue systems. We introduce these two fields as follows and point out the main differences between our method and previous ones.

CQG
Traditional question generation (TQG) has been widely studied and can be seen in reading comprehension (Zhou et al., 2019;Kim et al., 2019), sentence transformation (Vanderwende, 2008), question answering (Li et al., 2019;Nema et al., 2019), visual question generation (Fan et al., 2018) and task-oriented dialogues (Li et al., 2017). In such tasks, finding information via a generated question is the major goal and the answer is usually part of the input. Different from TQG, CQG aims to enhance the interactiveness and persistence of conversations . Meanwhile, the answer is the "future" information which means it is unavailable in the inference process.  first studied on CQG, and they used soft and hard typed decoders to capture the distribution of different word types in a question. Hu et al. (2018) added a target aspect in the input and proposed an extended Seq2Seq model to generate aspect-specific questions. Wang et al. (2019) devised two methods based on either reinforcement learning or generative adversarial network (GAN) to further enhance semantic coherence between posts and questions under the guidance of answers.

Context Modeling in Dialogue Systems
Existing methods mainly focus on the historical context in multi-turn conversations, and hierarchical models occupy a vital position in this field.  proposed the hierarchical recurrent encoder-decoder (HRED) model with a context RNN to integrate historical information from utterance RNNs. To capture utterance-level variations, Serban et al. (2017) raised a new model Variational HRED (VHRED) that augments HRED with CVAEs. After that, VHCR (Park et al., 2018) added a conversation-level latent variable on top of the VHRED, while CSRR (Shen et al., 2019) used three-hierarchy latent variables to model the complex dependency among utterances. In order to detect relative utterances in context, Tian et al. (2017) and  applied cosine similarity and attention mechanism, respectively. HRAN (Xing et al., 2018) combined the attention results on both word-level and utterance-level. Besides, the future information has also been considered for context modeling. Shen et al. (2018) separated the context into history and future parts, and assumed that each of them conditioned on a latent variable is under a Gaussian distribution.  used future utterances in the discriminator of a GAN, which is similar to Wang et al. (2019).
The differences between our method and aforementioned ones in Section 4.1 and 4.2 are: (1) Rather than dividing PQA triples into two parts, i.e., PQ (history and current utterances) and QA (current and future utterances) pairs, we model the entire coherence by utilizing a latent variable to capture the share background in a triple. (2) Instead of regarding the relationship between question and answer as a text matching task that lacks the consideration of diversity, we incorporate utterance-level latent variables to help model one-to-many mappings in both PQ and QA pairs.

Conclusion
We propose a generative triple-wise model for generating appropriate questions in open-domain conversations, named GTM. GTM models the entire background in a triple and one-to-many mappings in PQ and QA pairs simultaneously with latent variables in three hierarchies. It is trained in a onestage end-to-end framework without pre-training like the previous state-of-the-art model that also takes answer into consideration. Experimental results on a large-scale CQG dataset show that GTM can generate fluent, coherent, informative as well as intriguing questions.