Iterative GNN-based Decoder for Question Generation

Natural question generation (QG) aims to generate questions from a passage, and generated questions are answered from the passage. Most models with state-of-the-art performance model the previously generated text at each decoding step. However, (1) they ignore the rich structure information that is hidden in the previously generated text. (2) they ignore the impact of copied words on the passage. We perceive that information in previously generated words serves as auxiliary information in subsequent generation. To address these problems, we design the Iterative Graph Network-based Decoder (IGND) to model the previous generation using a Graph Neural Network at each decoding step. Moreover, our graph model captures dependency relations in the passage that boost the generation. Experimental results demonstrate that our model outperforms the state-of-the-art models with sentence-level QG tasks on SQuAD and MARCO datasets.


Introduction
Automatic Question Generation (QG) is the task of generating question-answer pairs from a declarative sentence, QG has many useful applications: (1) it improves the question answering task (Chen et al., 2017) by providing more training data (Tang et al., 2017;Yuan et al., 2017); (2) it generates practice exercises and assessments for educational purposes (Heilman and Smith, 2010); and (3) it helps dialog systems to kick-start and continue a conversation with human users (Mostafazadeh et al., 2016). In this study, we focus on sentence-level QG tasks.
Conventional QG methods (Mostow and Chen, 2009;Heilman and Smith, 2010;Dhole and Manning, 2020) rely on heuristic rules or hand-crafted templates as it suffers a significant lack of question, sticking to a few simple and reliable syntactic transformation patterns. Recently, neural-based approaches to QG have achieved remarkable success, by applying large-scale reading comprehension Figure 1: An example (lower-cased) of using the structure information that is hidden in previously generated words information and the impact of copied word, where the answer words are in blue and the copied words are in purple. The model can copy the right word develop with a high certainty (0.97 score). datasets and employing the encoder-decoder framework. Most of the existing works are based on the sequence-to-sequence (Seq2Seq) network, incorporating the attention mechanism and copy mode, applied by (Zhou et al., 2018). Intuitively, connecting an answer to a passage lies at the heart of this task. (Song et al., 2018) leveraged multi-perspective matching methods and (Sun et al., 2018) proposed a position-aware model to put more emphasis on answer-surrounded context words. (Zhao et al., 2018) aggregated the paragraph-level context to provide sufficient information for question generation.  employed the Graph2Seq architecture to capture the information in a passage.
Most models with state-of-the-art performance model the previously generated text at each decoding step. However, they ignore (1) the rich structure information hidden in generated words (2) and the impact of copied words on the passage. We perceive that this information offers auxiliary information in the future generation. In Figure 1, the copied word donald davies helps model to copy the develop with high certainty (0.97 score). The copied word donald davies is a subject in the passage and the answer message routing methodology is the object in the passage. After capturing the structure information from the generated words, the model pays more attention to the words related to generated words and copy the predicate in the passage. However, the decoders in most QG model process the generated text as the sequence of words, which ignore the text structure. Therefore, it is hard to capture the structure information in previously generated words for most QG models. In addition, the information about which word has been copied changes at each step and is updated iteratively. However, most QG models are unable to achieve that.
To address these issues, in this paper, we design an Iterative Graph Network-based Decoder (IGND) to model the structure information in the previous generation at each decode step using a Graph Neural Network. We observed that the words copied from a passage played a decisive role in the semantics of the whole question. We modeled the copied word information to capture structure information and use their impact on the passage. We introduce the role tag to the passage graph, where all words have the role tag no-copy, except for answer words, which have the tag answer. The IGND updates the role tag at each decoding step. For example, the role tag changes to copied when the word in this node is copied to the question at this decoding step. Then, the information is aggregated by a novel Bi-directional Gated Graph Neural Network (bi-GGNN). Moreover, we propose a relationalgraph encoder, which employs a similar bi-GGNN to capture the dependency relations of a passage that boost the generation.
We performed experiments on two reading comprehension datasets, SQuAD and MARCO, and obtained promising results. Our model achieves new state-of-the-art results in sentence-level QG tasks on both datasets, with BLEU-4 20.33 on SQuAD and 23.87 on MARCO. Our codes are publicly available 1 .
Our main contributions are as follows: • We design an Iterative Graph Network-based Decoder (IGND) to capture the structure information in generation and model the copied words at each decoding step.
1 https://github.com/sion-zcfei/IGND • We propose a relational-graph encoder to encode the dependency relations in the passages and establish the connections between an answer and a passage.
• The proposed model that focuses on sentencelevel QG tasks achieves new state-of-the-art scores, and outperforms existing methods on the standard SQuAD and MARCO benchmarks for QG.

Model Description
In this section, we define the question generation task and present the Graph-to-Sequence (Graph2Seq) model with our IGND. We design and discuss the details of each component as shown in Figure 2.

Problem Formulation
The question generation generates natural language questions based on given sentences (Zhou et al., 2018). The generated questions must be answered from the input data. We assume that a text passage is a sequence of word tokens X p = {x p 1 , x p 2 , ..., x p N }, and a target answer is a sequence of word tokens X a = {x a 1 , x a 2 , ..., x a L }. The natural question generation task generates the best natural language question consisting of a sequence of word tokensŶ = {y 1 , y 2 , ..., y T } and maximizing the conditional likelihood arg max Y P (Y |X p , X a ). Here N, L, and T are the lengths of the passage, the answer, and the question, respectively. We focus on the problem set based on a set of passages, answers, and questions triples. We learn the connection between them. Existing QG approaches (Zhou et al., 2018;Sun et al., 2018;Song et al., 2018;Zhao et al., 2018; have the same assumption.

Graph2Seq Model with Iterative Graph
Network-based Decoder Compared to RNNs, GNNs can efficiently use the rich hidden text structure information such as syntactic information. In addition, they can model the global relations among the sequence words to improve the representations. We construct a directed and weighted text graph G based on dependency tree. In a passage graph, each passage word is treated as a node and the dependency relation between two words is treated as an edge. Furthermore, our Graph2Seq model encodes the passage Figure 2: Overall architecture of the proposed model. In Iterative Graph Network-based Decoder, we use the shade of the color to indicate how high the node copy score is and the color of zero-score node is white. Furthermore, answer nodes and copied nodes are yellow and purple respectively. graph with dependency relations and decodes the question sequence with IGND.

Relational Encoder
Answer information is crucial to generate the high quality and answer-relevant questions. Dependency relations connect the answer and passage words.
To use the dependency relations, we propose the relational embedding that aggregates global dependency relations for each words. Intuitively, relational embedding indicates words to pay more attention.
Firstly we adopt a bi-LSTM encoder to get the context hidden states H: ding of the word, BERT embedding of the word, answer position embedding, named entity embedding, part-of-speech embedding and word case embedding as proposed by (Zhou et al., 2018), h i is forward and backward hidden states of the ith token in passage X P . Moreover, we get the answer hidden states H a in a similar way. Then, we get the answer-aware weighted context hidden states H p : where H = [h 1 , h 2 , ..., h N ] is the passage hidden states and v a , W a , W h are trainable weighted matrices.
To learn the graph embedding from the text passage graph, we adopt the novel bi-GGNN that fuses the intermediate node embeddings from incoming and outgoing directions in every iteration. In bi-GGNN, passage embeddings for each nodes are initialized to the passage embeddings H p , and the relational embeddings are initialized randomly. The graph parameters are shared at every hop of computation. And at every node in the graph, we apply a mean aggregator to aggregate neighboring node passage embedding and get the aggregation vector: Similarly, we get the relation embedding aggregation vector: where r ij is the relations embedding between node i and j.
We fuse the information aggregated in two directions at each hop: where z is a gated sum of two information as fusion function: where is the component-wise multiplication and σ is a sigmoid function. We used a GRU (Cho et al.) to update the node embedding and incorpotate the aggregation information: After n hops computation, we obtain the final context embedding, relational embedding h n c i , h n rela i for node i. Then, the node embedding incorporating both text information and syntactic information is calculated as: Furthermore, we can get a graph-level embedding h G with a max pool: where W g is a trainable weighted matrix.

Iterative Graph Network-based Decoder
We adopt the architecture similar to other QG models, which is an attention-based LSTM decoder with a copy mechanism (Sun et al., 2018). However, most existing decoders ignore the structure information hidden in previous generated words and the impact of copied words on passage, which could serve as auxiliary information. To address this problem, we design the Iterative Graph Network-based Decoder(IGND). The decoder takes the graph-level embedding followed by two separate fully-connected layers as initial hidden states s 0 and initial context vector c 0 : We construct a graph in decoder G d similar to the passage graph G that adds the role tag information to node embedding. We introduce the role tag to nodes in the graph: each node has a role tag that is updated at each decode step including answer, copied and no-copy. The nodes that represent the answer word contain the answer tag at whole decode process, the nodes have been copied to question obtain the copied tag and other nodes obtain the no-copy tag:  0 x i has been copied 1 x i is a part of answer 2 x i has not been copied Intuitively, role tag can guide the model to incorporate the dependency relations to generate answerrelevant questions as shown in Figure 1.
At each decode step t, the embeddings for each nodes h t d i are reinitialized: where h n i is the node embedding of the passage graph calculated by equation (18) and r t i is the embedding of role tag for node i at step t. Furthermore, we adopt a bi-GGNN and a mean aggregator to aggregate the node embeddings, similar to that of Section 2.2.1. After n hops computation, we obtain the final node embeddings h t d i in decoder graph.
For each decoding step t, the LSTM reads the embedding of the previous word w t−1 , previous attentional context vector c t−1 and previous hidden state s t−1 to calculate its current hidden state: At time step t, the attention weights and the context vector are calculated as: where α is the attention weights previous generation information.
The copy mode copies words directly from the source sequence. As the attention weights measure the relevance of each input word to the partial decoding state and incorporate the generated words information, we treat α t as the copy probability, P copy = α t .
Then, s i and c i will be fed into a two-layer feedforward network to produce the vocabulary distribution P vocab .
The final probability distribution is the combination of the two modes: where p g is computed from the context vector c t , decoder hidden states s t and the decoder input w t : where W g is a trainable weighted matrix and σ is a sigmoid function. We train our model by the negative log likelihood for the target sequence y: 3 Experimental Settings 3.1 Dataset

SQuAD
The SQuAD (Rajpurkar et al., 2016) dataset contains 536 Wikipedia articles and more than 100K questions from the articles of crowd-workers. Answers are provided to the questions, which are spans of tokens in the articles. We use the sentencelevel data shared by (Zhou et al., 2018)

MARCO
MS MARCO datasets (Nguyen et al., 2016) 3 contains 100,000 queries with corresponding answers and passages. All questions are sampled from real anonymized user queries and context passages are extracted from real web documents. We picked a subset of MS MARCO data where answers were sub-spans within the passages to construct sentence-level dataset. That contains 4,6109, 4539 and 4539 sentence-question-answer triples for training, validation and test respectively.

Baseline Methods and Metrics
For fair comparison, we report the following recent works on sentence-level QG dataset: NQG++ (Zhou et al., 2018): a feature-enriched Seq2Seq model.
MPQG (Song et al., 2018): uses different matching strategies to explicitly model the information between answer and context.
Answer-focused Position-aware model (Sun et al., 2018): generates an accurate interrogative word and focuses on important context words.
s2sa-at-mp-gsa (Zhao et al., 2018): employs a gated attention encoder and a maxout pointer decoder to deal with long text inputs.
ASs2s (Kim et al., 2019): proposes an answer separated Seq2Seq model by replacing the answer in the input sequence with some specific words.
To the Point Context : extracts answer-relevant relations in the sentence and encodes both sentence and relations to capture answer-focused representations.
QG-pg (Jia et al., 2020): leverages the paraphrase information to the QG model.
Graph2seq +RL+ BERT : is a BERT enhanced Graph2seq QG model with reinforcement learning.
QQP & QAP with BERT (Zhang and Bansal, 2019): combines the QG task and QA task with BERT.
Syn-QG (Dhole and Manning, 2020): is a rulebased QG model that uses the PropBank argument descriptions and VerbNet state predicates to incorporate shallow semantic content. It is a SOTA model in sentence-level QG task.
Recurrent BERT (Chan and Fan, 2020): employs the pre-trained BERT language model to tackle question generation tasks. It is also a SOTA model in sentence-level QG task.
We evaluate the performance of our models using BLEU (Papineni et al., 2002) and ROUGE-L (Lin, 2004), which are widely used in previous QG works.

Implementation Details
We fix the 300-dim GloVe vectors for the most frequent 70,000 words in the training set. We compute the 1024-dim BERT embeddings on the fly for each word in text using a trainable weighted sum of all BERT layer outputs. The embedding sizes of the case, answer, copy, POS , and NER tags are set of 3, 3, 3, 12 and 8, respectively. We set the hidden state size of BiLSTM to 150 so that the concatenated state size for both directions is 300. The size of all other hidden layers is set to 300. We apply a variational dropout rate of 0.4 after word embedding layers and 0.3 after RNN layers. The number of GNN hops in both encoder and decoder is set to 4. We use Adam (Kingma and Ba, 2014) as the optimizer and the learning rate is set to 0.001.

Models
BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L NQG++ (Zhou et al., 2018) 42.46 26.33 18.46 13.51 -MPQG (Song et al., 2018) ---14.71 42.60 Answer-focused Position-aware model (Sun et al., 2018) 43.02 28.14 20.51 15.64 -s2sa-at-mp-gsa (Zhao et al., 2018) 44  (Zhou et al., 2018) Models BLEU-4 s2sa-at-mp-gsa (Zhao et al., 2018) 16.02 Answer-focused Position-aware model (Sun et al., 2018) 19.45 QG with semantic matching (Ma et al., 2020) 20.46 QG-pg (Jia et al., 2020) 21.61 Graph2seq +RL+ BERT  22.59 Our Model 23.87 We reduce the learning rate by a factor of 0.5 if the validation BLEU-4 score stops improving for three epochs. We stop the training when no improvement is seen for 10 epochs. We clip the gradient at length 10. The batch size is set to 60 for both SQuAD and MARCO. The beam search width is set to 5. All hyperparameters are tuned on the development set. Table 1 shows the experimental results of the SQuAD sentence-level dateset. For fair comparison, we report the results on sentence-level dataset excludes paragraph-level results (Dong et al., 2019;Bao et al., 2020;Qi et al., 2020). In terms of BLEU-4 regarded as the main evaluation metric for text generation, our model yields the best results, with 20.33. We achieve state-of-the-art results on SQuAD for sentence-level QG.

Main Results
We perform experiments on MARCO and achieve the state-of-the-art results as shown in Table 2. SQuAD and MARCO are built in different ways. The questions in SQuAD are generated by crowd-workers. However, questions in MARCO are sampled from real user queries. The experimental results on two datasets validate the generalization and robustness of our models.

Human Evaluation
To further assess the quality of generated questions, we perform a human evaluation to compare our model with the strong baseline of Graph2seq +RL+ BERT . We randomly select 100 samples from SQuAD and ask three annotators to score these generated questions according to three aspects: Fluency: measures whether a question is grammatical and fluent; Relevancy: measures whether the question is relevant to the input context; Answerability: indicates whether the question can be answered by the given answer.
The rating score is set to [0, 5]. The evaluation results are shown in Table 3. Our model receives higher scores on all three metrics, indicating that our generated questions have higher quality in different aspects.

Ablation Study
As shown in Table 4, we perform an ablation study to systematically assess the impact of different model components (BERT, relational embedding, IGND) on the SQuAD test-set.
We remove the relational embedding in the encoder, the BLEU-4 score of the original model drops from 20.33 to 19.43, which indicates the importance of relational embedding. This is verified by comparing the performance of original w/o IGND model (19.01 BLEU-4 score) and other Graph2seq model to use the syntactic information baseline such as Graph2seq + RL + BERT (18.30 BLEU-4 score).
In addition, we remove the IGND and use the normal attention-based mechanism. The BLEU-4 score drops from 20.33 to 19.01 as shown in Table 4. From this result, the IGND improves the model performance. Moreover, the time cost of the original model is only 1.1 times more than the original w/o IGND model . In comparison to the original model that scores 20.33, the original w/o relational embedding and IGND drop significantly (almost 2 in BLEU-4 score).
We find that the pre-trained BERT embedding considerable impact the performance, and finetuning BERT embedding improves the performance, and demonstrates the power of large-scale pre-trained language models.

Analysis of the Impact of Syntactic Information in Encoder
To understand the impact of the syntactic information, we calculated the words in the document that occurred in the ground truth question and their total attention score at the end of input attention layer as revealed in Figure 3. We compare three models: NQG++ (Zhou et al., 2018), which ignore the syntactic information; Graph2Seq + RL + BERT , which use the syntactic structure information but ignore the dependency relations. Our model uses syntactic information that includes both structure information and dependency relations. In our model, the average attention score is the highest, which indicates that syntactic information can improve the performance of encoder.

Analysis of the impact of IGND
To understand the impact of the IGND, we calculated for the words copied from the passage that occurred in the ground truth question. Their prob-  ability score at the decoding step, as revealed in Figure 4.
We compare with NQG++ (Zhou et al., 2018) and Graph2Seq + RL + BERT . Our model with IGND has the highest average copy probability score, demonstrating that the IGND can improve the performance of copy mode by modeling previous information.

Sensitivity Analysis of Hyperparameters
We perform experiments on the Original model on the SQuAD to study the effect of the number of GNN hops. Figure 5 shows that our model is insensitive to the GNN hops and can achieve reasonably good results with various hops.

Case Study
We present some examples of questions generated by our model. Furthermore, we present a pair of examples, which have the same input sentence as shown in Table 5. We find that the question generated by NQG++ and Graph2Seq+RL+BERT do not have the correct semantics, which copy the wrong word from the passage and can not find the right word as shown by Graph2seq + RL + BERT in Example 2. In contrast, our model can generate more answer-relevant questions than NQG++ and Graph2Seq+RL+BERT baseline.  As we can see, incorporating dependency relations information helps the model identify which word is relevant to answer and thus makes the generated questions more relevant and specific.

Related Work
Early works on QG (Mostow and Chen, 2009;Heilman and Smith, 2010) focused on the rule-based approaches that rely on heuristic rules or handcrafted templates, with low generalizability and scalability. Recent works adopted the attentionbased sequence-to-sequence neural model for QG tasks, taking answer sentence as input and output the question (Du et al., 2017), which proved to be better than rule-based methods. (Zhou et al., 2018) proposed the feature-enriched encoder to encode the input sentence by concatenating word embedding with lexical features as the encoder input, and answer position are to inform the model to locate the answer. To generate a question for a given answer, (Sun et al., 2018;Kim et al., 2019;Song et al., 2018) applied various techniques to encode answer location information into an annotation vector corresponding to the word positions, thus allowing for better quality answer focused questions.  presented a syntactic features based method to represent words in the document and to decide what words to focus on while generating the question.  combined supervised and reinforcement learning in the training to maximize rewards that measure question quality. Furthermore, recent concurrent work applied the large-scale language model pre-training strategy for QG to achieve a new state-of-the-art performance (Chan and Fan, 2020).
Most existing QG approaches are unable to explicitly model the previously generated words. However, we perceive that previous generated words serve as auxiliary information in subsequent generation.

Conclusion
In this study, we designed the IGND for QG to alleviate the problem, which ignore the structure information and copied words in generated words at each decoding step. In addition, we proposed the relational graph encoder to capture the dependency relations information to improve the performance. For the sentence-level QG task on SQuAD and MARCO dataset, our method outperforms existing methods by a significant margin and achieves the new state-of-the-art results. Future directions include investigating more effective ways of utilizing previous generation information and exploiting Graph2Seq models with GNN-based decoder for question generation from structured data like knowledge graphs or tables.