Expanding, Retrieving and Infilling: Diversifying Cross-Domain Question Generation with Flexible Templates

Sequence-to-sequence based models have recently shown promising results in generating high-quality questions. However, these models are also known to have main drawbacks such as lack of diversity and bad sentence structures. In this paper, we focus on question generation over SQL database and propose a novel framework by expanding, retrieving, and infilling that first incorporates flexible templates with a neural-based model to generate diverse expressions of questions with sentence structure guidance. Furthermore, a new activation/deactivation mechanism is proposed for template-based sequence-to-sequence generation, which learns to discriminate template patterns and content patterns, thus further improves generation quality. We conduct experiments on two large-scale cross-domain datasets. The experiments show that the superiority of our question generation method in producing more diverse questions while maintaining high quality and consistency under both automatic evaluation and human evaluation.


Introduction
With a growing demand for natural language interfaces for databases, automatic question generation from structured query language(SQL) query has been of special interest (Xu et al., 2018). Recently, diversity-aware question generation has shown its effectiveness in improving down-stream applications such as semantic parsing and question answering tasks Sultan et al., 2020). Although neural sequence-to-sequence based generation has been a dominant approach and is able to produce a meaningful description for SQL queries, existing methods still suffer from the lack of diversity as well as bad sentence structures .
In the neural-based approaches, conventional ways of generating diverse sentences focus on approximate decoding techniques such as beam search (Li et al., 2016a,b;Iyyer et al., 2018) and temperature sweep (Caccia et al., 2018). Those decoding strategies generate diverse samples while sacrificing the quality of sentences. Variational auto-encoders (VAEs) have been used to generate various sentences by applying additional information as latent variables (Hu et al., 2017;Shao et al., 2019;Ye et al., 2020). However, implicit latent representation provides limited controllability over sentence structure and can be difficult to adapt to a new domain. Paraphrase (Fader et al., 2013;Berant and Liang, 2014) and syntactic-based methods (Dhole and Manning, 2020) have also been studied. However, learning a paraphrasing model relies on a large number of domain-specific paraphrase pairs, which is difficult to obtain for target databases. Besides, syntacticbased approaches apply syntactic parsers or semantic rules to the natural language utterance, thus are not applicable to SQL-to-question generation.
In the rule-based generation systems, the templates work as essential prior knowledge that contains the structural information of sentences (Wang et al., 2015;Song and Zhao, 2016;Krishna and Iyyer, 2019). This ensures the generation contains fewer grammatical errors and performs better with extractive metrics (Wiseman et al., 2017;Puzikov and Gurevych, 2018). However, their template formats are mostly strict and the valid content for each chunk should be pre-defined, which makes large set of templates difficult to obtain.
In this paper, we propose a novel method that incorporates template-based generation with a neural sequence-to-sequence model for diversity-aware question generation. Instead of applying strict templates, we use flexible templates that can be collected efficiently with less expense. These flexible templates provide high-level guidance of sentence structure while also enable sufficient flexibility for a neural-based model to fill chunks with content de-tails. We present our method as a three-stage framework including expanding, retrieving, and infilling. In the expanding stage, we take advantage of existing large-scale cross-domain text-to-SQL datasets to extract and collect the flexible template set automatically. In the retrieving stage, given a SQL query, the best templates are retrieved from the collected template set by measuring the semantic distance of SQL and templates in a joint template-SQL semantic space. In the filling stage, we treat each template as a masked sequence and explicitly force the generator to learn question generation with the constraint of the template. In order to help the generator to discriminate template patterns and content patterns, a unique activate/deactivation mechanism is designed for the generator to learn when to switch between template-copying state and content-filling state.
We conduct experiments on two large-scale cross-domain text-to-SQL datasets. Compared to existing approaches, our method achieves the best diversity result for both datasets with both automatic evaluation and human evaluation, while maintaining competitive quality and high consistency with SQL queries. We further demonstrate that the designed modules each contribute to a performance gain through an ablation study.

Diverse Text Generation
In order to generate diverse expressions automatically, paraphrase-based methods (Qian et al., 2019;Fader et al., 2013;Berant and Liang, 2014;Dong et al., 2017;Su and Yan, 2017) have been studied. Wang et al. (2015) proposes to iteratively expand the template set and lexicons given a small number of template seeds and a large paraphrase corpus. Syntactic-based generation (Iyyer et al., 2018;Dhole and Manning, 2020) processes the given text with natural language processing techniques to produce high-quality and diverse sentences with pre-defined templates. However, the methods are not designed to deal with SQL queries and noisy table content. In recent years, neural network-based models have been widely used in text generation (Pan et al., 2019). Many studies attempt to diversify text generation by tuning latent variables of different properties, such as topic, style, and content (Fang et al., 2019;Ficler and Goldberg, 2017;Shen et al., 2019), while our method focuses on explicitly changing the sentence struc-ture. In exemplar-based systems Peng et al., 2019;, the exemplar works as a soft constraint to guide the sequenceto-sequence generation and realize controllable diverse generation. Wiseman et al. (2018) proposes to learn a hidden semi-Markov model decoder for template-based generation for knowledge records. Most existing work requires either paraphrase pairs of the same input, reference sentences of similar content, or work effectively only in a single domain. Unlike existing works, our method only takes advantage of the large-scale cross-domain SQL-to-text datasets to collect a large number of templates. We extract templates from the datasets directly to maintain the quality of the templates. In order to find proper templates for a given SQL query, we learn a joint semantic space by instance learning and retrieve the best templates with closest semantic distance.

Question Generation
The question generation task relates to many applications such as question generation over knowledge base records (Wang et al., 2015), data-to-text generation (Wiseman et al., 2017) and question generation for question answering(QA) systems (Tang et al., 2017;Sun et al., 2019). SQL-to-question task differs from the other tasks in that SQL queries typically include new entities over different databases, which makes cross-domain generation a significant challenge. Xu et al. (2018) explores the graphstructured information in a SQL query and proposes a graph-to-sequence approach for generation.  proposes to apply a copy mechanism and latent variables to map low-frequency entities from SQL queries to questions and generate diverse questions in an uncontrolled way. Existing SQL-to-question approaches aim at generating high-quality questions, while the diversity of the generation is less explored. In this work, we focus on generating diversified questions with the guidance of templates from cross-domain datasets.

Problem Formulation
Given a SQL query as the input sequence, question generation over database aims to generate a natural language question as an output sequence that accurately reflects the same meaning as the given SQL query. In this work, we generate the question by introducing an intermediate template in the generation process. Therefore, by applying  different templates, we can generate diverse expressions of questions. Let x = [x 1 , x 2 , ..., x |x| ] denote the given SQL query, y = [y 1 , y 2 , ..., y |y| ] denote the gold-standard question, and t = [t 1 , t 2 , ..., t |t| ] denote the corresponding template. Given a neuralbased system with a set of learnable parameters θ * t and θ * q , the two-stage objective in this work can be formulated as follows:

Methodology
In this section, we introduce our framework, which learns question generation from SQL queries with the guidance of various templates, so as to increase the diversity of generation.

Framework Overview
We illustrate the framework that models the generation process in Figure 1. In brief, it includes three main stages: expanding, retrieving, and infilling. Dataset Expansion The purpose of this step is to acquire a training dataset consisting of triplets <query, question, template>. Previous methods usually require a large corpus containing paraphrased questions to learn template structures; meanwhile, for existing text-to-SQL datasets, only <query, question> pairs are provided instead. To tackle those challenges, we design the longest common subsequence (LCS) based algorithm to automatically extract templates for each <query, question> pair without requiring the paraphrase pairs. Details of the algorithm are introduced in Section 4.2.
Template Retrieval After obtaining the expanded training set, all templates are gathered to form a large template set to generate diverse questions. To improve the quality and rationality of the generation, it is essential to retrieve suitable templates for it. A proper template should be consistent with the content information in a specific SQL query. For example, when the given query is SELECT Population WHERE ( City = New York), the template When is the <ph> of <ph> ? should not be selected. For that purpose, we propose a soft classifier to learn a joint SQL-template space. In this way, the semantic distance can be measured between the two modalities, so that we can select proper templates by the closest semantic distance. Besides, since templates show higher inter-class similarity with the same SQL pattern, we also apply a hard filter to exclude the templates paired with different SQL patterns. Details of the soft classifier are introduced in Section 4.3.
Text Infilling With a encoder-decoder model based on gated recurrent unit (GRU), we conduct question generation in the way of text infilling. The query x and template t are encoded into vectors separately by the bi-directional GRU encoder (Cho et al., 2014). Following the work of Gu et al. (2016) and , we leverage the soft-attention and copy mechanism in the decoder construction. A Gaussian latent variable is adopted to capture the query and template variations. In the decoding process, a template t is fed into the decoder as a supervised signal to generate questions dynamically and sequentially. We propose an activation/deactivation mechanism to enforce the decoder to differentiate between the template patterns and the content patterns instead of randomly masking slots. In this way, the decoder can learn when to switch between the template reading and the content filling during generation. Details of the activation/deactivation mechanism are illustrated in Section 4.4.

Flexible Template as LCS
Consider a SQL query as a combination of a SQL pattern (i.e., the query with its content words removed) and the table information as follows: SELECT COUNT( PLAYER ) WHERE (STATE = 'Texas') where the underlined tokens are content from the table and SELECT COUNT(<ph>) WHERE (<ph> = <ph>) is a typical SQL pattern. Similarly, questions are composed of content words and template words. Since template words are often reused more frequently than content words, we design an effective method to extract most template words for each question in the training set. For the i-th question Q i , we record its longest common sub-sequence (LCS) with each other question as a candidate template and construct a candidatetemplate dictionary d i . The keys in d i are the candidate sequences, and the corresponding values are the lengths of the sequences. After that, we choose the longest candidate from d i as the template for the i-th question. The pseudo-code is described in Algorithm 1.
The candidate templates should satisfy the following rules: • Each template should appear over 20 times.
• Each template should includes at least one of the keywords: where, what, which, when, why, who, how, name, tell . When applying LCS, we mark the possible positions for content insertion between tem- for all q j ∈ Q do 4: Update T len by adding t len 20: end for 21: return T len plate words by placeholder <ph>, and format the templates like this instance: Which <ph> has the largest <ph> ? By ignoring the lengths of content word sequences, the templates become more flexible and can adapt to more scenarios.

Learning Joint SQL-Template Space
As a sub-sequence of a question, a template should be close to the query in the semantic space if it is from the corresponding question, and be far away from the query if it is from an irrelevant question. Based on this intuition, we propose a soft classifier to learn a joint SQL-template space.
Soft Classifier. Classification models have been widely used in visual/textual applications. In a retrieval task, the classification model can learn feature embedding for the input, and its best matching counterpart can be found from a database by measuring the cosine distance between their embeddings. Inspired by this, we consider every <SQL query, template> pair in training set S as a distinct class, and learn the feature embeddings by instance-level classification. We represent each <SQL query, template, class> triplet by < x, t, n >. Considering SQL queries and templates as objects constructed with different syntax, we encode them with two separate GRU encoders: where e x , e t are query embedding and template embedding, respectively. In order to map SQL queries and templates to a joint feature space, we add a share-weight fully-connected layer W s with softmax as the final classifier. The predicted probabilities over all instances are calculated as follows: We jointly train the encoder and the classification layer with the following loss function: Inference. To enable efficient retrieval of templates, we store the template embeddings in a dictionary. In the inference phase, we can feed any SQL query into GRU x to produce the query embedding, then remove the improper templates by a hard filter. We calculate the cosine distances between the query and the remaining templates, and sort them in descending order. The top-k nearest templates are selected for question generation.
Avoid Overfitting. Since each class includes only one instance, we cannot use the classification loss of the validation set V to detect overfitting. Instead, we detect the overfitting by computing the average rank error R for the validation set: where r p and r a are the predicted rank and actual rank, respectively. We stop training when R keeps increasing.

Decoding with A/D mechanism
The key idea of the proposed decoding method is that the generation process can be decomposed into a series of sub-generation tasks that are spaced by tokens in the template. During each sub-generation, the model can generate tokens of variable lengths. Since the decoder generates text word-by-word, it should determine where to switch between a content-filling state and a template-copying state, namely the activation or deactivation state, respectively. To achieve this, we require the decoder to activate/deactivate (A/D) generation with special switch symbols <A> and <D>. Therefore, when the decoder generates a symbol <A>, it changes the template-copying state to the content-filling state; when the decoder generates a <D>, it terminates the content-filling state and switches to the templatecopying state. We also set a maximum length for each sub-generation to avoid the generation of unbounded sequences. In practice, before we train the question generator, we rewrite the question and template as follows (as an instance): Template: <BEG> Which <A> has the largest <A> ? <END> Question: <BEG> Which <A> one <D> has the largest <A> population among U.S. cities <D> ? <END> We apply a pointer p to point to the current template token. With a simple GRU decoder, the state s and the generated token y i at the i-th step can be determined as follows: where h dec i−1 is the hidden state of the (i − 1)-th step, and t p is the current token of the templatecopy operation, which should be updated after each copy. y = [ y 1 , y 2 , ... y |y| ] is the generated sequence. y and t are the rewritten question and template 1 . We apply teacher forcing during training, and feed the rewritten question y as input to the decoder. During inference, we feed the rewritten template t as input to the decoder. The training objective for generator G q can be factorize as follows: In order to further diversify the expression of content, we introduce a latent variable z to the model. z relies on both SQL query x and template t . We make Q a posterior distribution of z given x, t . Then the evidence lower bound(ELBO) loss for it is: where Q q (z|x, t ) ∼ N (µ, σ). We apply a reparameterization trick, making z ∼ N (0, I) and µ, σ learnable deterministic functions.
Note that the function of the template and latent variable do not overlap with each other. A large number of templates ensures the diverse sentence structure in generated questions, while the latent variable produces various expression of the content in the slots. As a part of the sentence, the <A> and <D> symbols join the back-propagation computation for optimizing the decoder's parameters. They work as additional information that guides the decoder to discriminate templates and content patterns, and learn when to terminate infilling and switch to template-copying state at each slot.

Experiment
Dataset To validate our framework's capability in generating diverse and controllable questions, we conduct experiments on two large-scale crossdomain text-to-SQL datasets: WikiSQL (Zhong et al., 2017) and Spider . WikiSQL contains 80654 query-question pairs derived from 24241 different schemas, with both validation set and test set released. Spider contains 10181 queryquestion pairs in 138 domains, with the validation set published. We follow the provided split settings for training and testing.
Setup We first construct the template set from the training set using our expansion method. For each SQL query in the test set, we retrieve the k − 1 most relevant templates from the template set to generate k − 1 different questions. We also include one question with template <BEG> <A> <END> to evaluate the model's capability in generating a complete sentence. Thus k questions for each SQL query are provided for evaluation. For our experiments, we set k = 5.
Baselines We compare our model(ERI)'s performance to models based on the baseline ap-proach QG  with different diversesentence generation strategies. (1) Latent Variable(QGLV):  applies a sequenceto-sequance network with a copy mechanism for question generation from SQL. A latent variable is introduced to generate diverse questions. (2) Temperature Sweep(TEMPS): We apply temperature sweep (ts) (Caccia et al., 2018) for decoding in QG.
(3) Beam Search(BEAMS): We further combine QG with beam search (Li et al., 2016b) to generate diverse questions. In practice, we set the beam width to 5 and obtain sentences with top-5 highest probabilities for comparison. A Boltzmann temperature parameter α is applied to modulate the entropy of the generator. In practice, we set α = 0.7 as suggested and obtain 5 generated sentences for each query.
Evaluation Metrics We adopt the following automatic metrics to measure the quality of generated questions.
(1) maxBLEU: The max BLEU-4 score among 5 generated questions. (2) Coverage: (Shao et al., 2019) This metrics measures the average proportion of input query covered by the generated questions. (3) ParseAcc: We use neural semantic parsers SQLova (Hwang et al., 2019) and Global-GNN (Bogin et al., 2019) to parse WikiSQL and Spider respectively. The semantic parsers translate the generated question into SQL query and calculate the exact-match accuracy with the input SQL query as the ground-truth. A higher accuracy score means the generated questions are more natural and consistent with the given SQL queries.
We adopt the following automatic metrics to measure the diversity of generated questions. (1) self-BLEU: (Zhu et al., 2018) The average BLEU-4 of all pairs among the k questions). (2) self-WER: (Goyal and Durrett, 2020) The average word error rate of all pairs among the k questions. A lower self-BLEU score and a higher self-WER score indicate more diversity in the result. (3) Distinct-4: (Li et al., 2016a) It measures the ratio of distinct n-grams in generated questions.

Automatic Evaluation
The automatic evaluation results are reported in Table 1 and 2. Our model ERI outperforms all baseline models in terms of the three diversity metrics for both datasets, except the self-BLEU on Spider. This shows the effectiveness of ERI in improving the diversity of its generation. Note that TEMPS performs less favorably in terms of qual-   ity and diversity, as the temperature parameter α is a sensitive factor for the generation and needs tuning further. The expanded template set provides various valid sentence structures for expressing the same question which significantly contributes to the diversity expressions. But it also decreases the word-level overlapping, which leads to a relatively low maxBLEU score on WikiSQL. However, BLEU score may not be a suitable measurement for diversity-aware generation (Su et al., 2020;Shao et al., 2019). As the BLEU calculates the overlapping of n-grams, it does not necessarily reflect the quality of template-based generation. We illustrate that by an example of our model in Table 4. Although the BLEU scores are low, the generated questions are fluent and consistent with the SQL query with diversified structures.
To further measure the question quality and consistency in the semantic parsing task, we calculate the ParseAcc score. Our model performs competitively with QGLV on WikiSQL and shows a substantial improvement on Spider. The ParseAcc scores with ground-truth questions are 81.60% and 65.96% on WikiSQL and Spider, respectively. Our model also outperforms the baselines in the coverage of field names and values in the SQL query, indicating that essential terms from input are learnt and translated to questions. We also show the result of our model without the latent variable. In this setting, diversity of generated questions solely depends on selected templates. Without the latent variable, the proposed framework still outperforms the baselines in diversity metrics while maintains a good quality, which also supports that template contributes the most to the diversity of generation.

Manual Evaluation
To evaluate the quality of the generation, we run a manual evaluation to measure the quality and diversity for 800 SQL-question pairs from Wik-iSQL test set, produced by baseline models and our model (with k = 5). Each rater gives a 5point scale for each SQL-question pair regarding the (1) Fluency: grammatical correctness, (2) Consistency: the semantic alignment with the corresponding SQL query, and (3) Diversity: the diverse expression of the generated questions.
We employ Fleiss' Kappa for inter-rater reliability measurement. The Kappa scores are 0.77, 0.60, 0.75 for fluency, consistency, and diversity, respectively, which indicates a good agreement on the scores. The results are presented in Table 3. Our model outperforms the baseline models in diversity and fluency. It can achieve the best trade-off for the three measurements. Although QGLV and BEAMS show the best performance in generating SQL query SELECT COUNT ( rd # ) WHERE pick # < 5 Ground-truth how many rounds exist for picks under 5 ? Q1 (BLEU=2.79) : what is the number of rd when the pick number is less than 5 ? Q2 (BLEU=11.73):how many rounds have a pick # less than 5 ? Q3 (BLEU=2.79) : what is the total number of rd where the pick is less than 5 ? Q4 (BLEU=2.02) : tell me the total number of rd for pick less than 5 Q5 (BLEU=1.51) : for the pick less than 5 , what was the total number of rd # ?   high-quality sentences, they tend to create questions of fixed structures with only minor changes in expressions. With our method, the templates provide substantial changes in sentences' structures, which validates the benefit of the proposed template-extraction method. Examples from our model and the baselines are given in Table 8.

Ablation Study
For the ablation study, we present experimental results to verify performance in two aspects: (1) whether the generator in our model benefits from learning with the activation/deactivation mechanism; (2) whether our model maintains the consistency by selecting suitable templates in the joint semantic space. Evaluation on Generator To analyze the ability of generating high-quality questions from the given templates, we extract the templates from the test set and use the corresponding template to guide the question generation. We measure the generation quality by BLEU, NIST, ROUGE, and ME-TEOR. In order to analyze the impact of various modules in our generator, we evaluate the following versions of our framework: (1) ERI w/o T   model that does not use templates for encoding and decoding.
(2) ERI w T model where templates are encoded but not used as the input for decoding. (3) ERI w/o A/D model that applies the generator with the decoding strategy similar to a seq2seq model in Zhu et al. (2019). It treats each slot as a segment, and predicts each segment as an independent sentence. (4) ERI model that has all designed modules, including the A/D mechanism. The results are presented in Table 5 and 6. Our model outperforms existing methods in four metrics. Using template information improves our model on both datasets. The segmented-based infilling method gains further improvement, which shows the effectiveness of providing templates as hard constraints. By introducing the A/D mechanism to decoding, our model has seen further boosts, which demonstrates that the A/D mechanism enhances the learning of generation from the templates.
SQL-Template Consistency To validate the im-  pact of our template retrieval method, we show the average rank error of soft classifier during the training phase in Figure 2. For both datasets, the rank error decreases during training, which indicates the soft classifier can capture the semantic relation between SQL queries and templates. To observe if the soft classifier affects the performance by selecting the proper template, we also compare our 2-stage template retrieval to random strategies and hard filter in mean average precision (MAP). The result is presented in Table 7. Compared to hard filter, the soft classifier improves the MAP by 4.4%, which validates the effectiveness of the proposed template retrieval method. Visualization of Joint Space To visualize the similarity of templates, we map feature samples to 2-dimensional space by t-Distributed Stochas-tic Neighbor Embedding(t-SNE) in Figure 3. The features from similar SQL-template pairs preserve closer distances, which shows the effectiveness of our instance-level classification in learning the semantic meaning in the joint feature space.

Conclusion
In this paper, we present a novel framework for question generation over SQL database to produce more diversified questions by manipulating the templates. We expand the template set from crossdomain SQL-to-text datasets, and retrieve proper templates from a template set by measuring the distance between the templates and the SQL query in a joint semantic space. We propose an activation/deactivation mechanism to make full use of templates to guide the question generation process. Experimental results have shown that the presented model can generate various questions while maintains their high quality. The model has also improved the matching between the templates and the content information of SQL queries.