Logic-Consistency Text Generation from Semantic Parses

Text generation from semantic parses is to generate textual descriptions for formal representation inputs such as logic forms and SQL queries. This is challenging due to two reasons: (1) the complex and intensive inner logic with the data scarcity constraint, (2) the lack of automatic evaluation metrics for logic consistency. To address these two challenges, this paper first proposes SNOWBALL, a framework for logic consistent text generation from semantic parses that employs an iterative training procedure by recursively augmenting the training set with quality control. Second, we propose a novel automatic metric, BLEC, for evaluating the logical consistency between the semantic parses and generated texts. The experimental results on two benchmark datasets, Logic2Text and Spider, demonstrate the SNOWBALL framework enhances the logic consistency on both BLEC and human evaluation. Furthermore, our statistical analysis reveals that BLEC is more logically consistent with human evaluation than general-purpose automatic metrics including BLEU, ROUGE and, BLEURT. Our data and code are available at https://github.com/Ciaranshu/relogic.


Introduction
Natural language generation (NLG) from semantic parses is to generate the text description for the formal representation input such as logical forms, AMR, and SQL queries. It has drawn widespread attention because of its substantial contributions to the interpretability and usability of the latest natural language interfaces (Gatt and Krahmer, 2018;Hu et al., 2020;Mishra et al., 2019;Yu et al., 2019;Gardent et al., 2017; *Equal Contribution Figure 1: Our data augmentation procedure for the generator and evaluator in the SNOWBALL framework. Wang, 2019;Koutrika et al., 2010a). Recently, pretrained large-scale language models like BERT (Devlin et al., 2018), T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020) have raised the ability to generate natural language from formal texts to a promising level of fluency and coherence.
However, NLG from semantic parses still has suffered from two crucial challenges: (1) the data scarcity constraint due to the bias on certain types of logic forms or expensive labeling work (Iyer et al., 2017;Yaghmazadeh et al., 2017), which potentially leads to the unsatisfied fidelity of remaining the complex and intensive inner logic in the generated text based on our empirical research; (2) The general-purpose automatic metrics (Novikova et al., 2017a) such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and BLEURT (Sellam et al.,Figure 2: Our SNOWBALL framework employs an iterative training procedure over a generator and evaluator through data augmentation. 2020) are not ideal for explicitly measuring the logic consistency Harkous et al., 2020), because they tend to evenly weight each word in the generated text without fully attending on the fatal logical keywords.
To address these two critical problems, we propose the SNOWBALL framework for high-fidelity text generation from semantic parses and the BLEC automatic evaluation metric for logic consistency: Snowball Framework. Our SNOWBALL framework, as illustrated in Figure 2, trains two modules to ensure high-fidelity text generation: (1) a generator that maps the logical form to its textual description, and (2) an evaluator that indicates the logic consistent score of each pair of logical form and textual sentence. Rather than training the generator and evaluator independently, SNOWBALL performs iterative training on the generator and the evaluator. To deal with the data scarcity issue, we propose a data augmentation procedure to cover valid logic variations with diverse natural language expressions to improve generalizability. To this end, during each iteration, various unseen logic pairs could be automatically generated with rule-based enumerated logic forms and their corresponding text predicted by the generator. The evaluator is then used to filter out the high reliable augmented logic pairs for the next training iteration. BLEC Metric. To evaluate the logic consistency of the text generated by the model, we propose a rulebased automatic evaluation metric called Bidirectional Logic Evaluation of Consistency, or BLEC. It takes the logical form and the generated corresponding natural language text as input, then outputs a label indicating if they represent consistent logic. Compared with the neural network evaluator, BLEC can be easily deployed to different datasets, as long as the parser (i.e., the grammar of the logical form) is given.
In our experiments, we exam the effectiveness of our proposed approaches on the benchmark datasets of NLG from semantic parses derived from existed Text-to-SQL dataset Spider (Yu et al., 2018) and Table-to-Text dataset Logic2Text . Our analysis shows that our BLEC metric has a substantially positive Pearson score with human annotations, demonstrating better logic consistency than other automatic metrics. The BLEC result shows that the SNOWBALL framework leads to accordant enhancement in logic consistency on two datasets compared to the single-pass training method based on BART (Lewis et al., 2020).
Our key contributions are summarized into threefolds: (1) We propose a simple but effective training framework SNOWBALL that strengthens the logic faithfulness of generated text by covering diverse logic variations. (2) We propose a new logic evaluation metric BLEC that accurately measures the logical consistency with a refined keyword matching mechanism. (3) Our experiment results demonstrate that SNOWBALL at most increases the BLEC from 10.1% on SQL-to-Text and 1.2% on Logic-to-Text tasks compared to the baseline. Moreover, our statistical analysis reveals that BLEC achieves a +0.66 Pearson correlation coefficient compared with human labels, serving as a much better automatic evaluation metric than not only the traditional BLEU and ROUGE metrics, but the latest BLEURT metrics.

Parses-to-Text
The source of data-to-text (D2T) datasets is mostly a flat ontology structure, like E2E (Novikova et al., 2017b), LogicNLP , Ro-toWire (Wiseman et al., 2017), and ToTTo , which is not powerful enough to encode rich semantic relationships in the ontology. Second, some datasets, such as WebNLG (Gardent et al., 2017), E2E, and RotoWire, have a limited number of domains. E2E is on the restaurant domain, and RotoWire is on the basketball domain.
Moreover, some of them only have loose alignments between input and sentence, e.g., RotoWire.
Generating the natural language descriptions for the logic forms or parses as a sub-task of D2T, has been studied in various datasets and tasks, such as GCC grammar to text (White, 2006), and UCC grammar to text (Gardent and Plainfossé, 1990). There are a lot of works that leverage the neural networks to conduct the generation on various tasks, for example, generating natural language from AMR (Song et al., 2018;Ribeiro et al., 2019;Damonte and Cohen, 2019), logic forms , as well as SQL parses (Xu et al., 2018;Koutrika et al., 2010b). However, different from these works, our work focuses on the logic consistency generation from parses. So we will mainly discuss and evaluate the model based on the logic between parses and questions.

High-fidelity Text Generation
As for the end-to-end neural-based text generation models, collaborating the auxiliary task during model training is an intuitive method that introduces the logic regulation to the models. For instance, the fidelity classification task proposed by Harkous et al. (2020), the auxiliary span extraction tasks by Kryscinski et al. (2020),the table-text optimal-transport matching and embedding similarity losses by  and the content matching task presented by  are proved to be effective. Nevertheless, to the best of our knowledge, we are the first to bridge the training procedure of evaluator and generator together with the iterative training framework snowball. Furthermore, we attempt to construct a new automatic metric and a new dataset dedicated to evaluating the logic consistency of text generation. The concentration of our work differs from the related highfidelity text generation work Chan et al., 2019;Nie et al., 2018;Tian et al., 2019;, by attempting to present the panorama of the challenges of logic-consistent text generation instead of focusing on the model-wised modifications.

Snowball Framework
The SNOWBALL framework addresses the challenge of the complex and intensive inner logic with data sparsity constraint for the high-fidelity textgeneration from semantic parses. As illustrated in Figure 2, SNOWBALL assures the logic consistency with three bases: (1) Iterative training procedure synergistically enhances the generator and evaluator in the adversarial fashion; (2) Data augmentation based on rule-based logic perturbations and neural-based text generation covering diverse unseen logic variations for iterative training; (3) Structure-aware encoding boost the sensibility of the encoder on mild logic shift.

Iterative Training
Rather than training the generator and evaluator independently, SNOWBALL performs training on the generator and the evaluator iteratively. As demonstrated in Figure 2, the prerequisite of the snowball training procedure is the regular training procedure: (1) the Generator 0 is trained on the benchmark NLG datasets with the normal end-to-end approach into trained Generator 1 ; (2) meanwhile, the logic forms in the seed data are converted into variations with given rules, then the Generator 1 predicts the text for each mutated logic forms to be a completed logic pair; (3) The initial Evaluator 0 is then trained on those augmented logic pairs.
Then, during the SNOWBALL procedure, the generator and evaluator are collaboratively improved through several training iterations, and during each iteration, a three-step adversarial interaction would be conducted between the generator and evaluator: Step 1: The trained Evaluator i−1 could be used to rerank the beam search results given by the decoder of the generator, consequently leading to increased quality of the augmented logic pairs, Augmented data i−1 ; Step 2: The Generator i is capable to better retain the logic consistency by training on the Augmented data i−1 which contains more unseen logic variations uncovered in the seed data; Step 3: The enhanced Generator i predicts the increasingly realistic-like perturbed sentences from the perturbed logical forms, which brings more challenging negative samples to the training set of the Evaluator i . The data augmentation in the first step would be further described in Section 3.2.
To be specific, our generator and evaluator in SNOWBALL are described as follows.
Generator The generator maps the logical form to the corresponding natural language sentences. We choose the pre-trained BART model (Lewis et al., 2020) following the standard transformer architecture (Vaswani et al., 2017), which contains the encoder and decoder architecture as the denoising autoencoder pre-trained on the task of corrupted text reconstruction. The input of the encoder is the structure-aware representation of the logic forms (Section 3.2), while the target output of the decoder is the aligned textual description for the input parses.
Evaluator An evaluator indicates the logic consistent score of pairs of logical forms and textual sentences, which is vital for assessing the performance of the logic-focused text generator. In contrast to other text generation tasks, generating sentences from logical forms especially requires the evaluator to be reasonably sensitive to the subtle logic shifts of the model predictions. For instance, deleting negation words such as 'not' is fatal for our task by significantly compromising the logic consistency. Therefore, we exploit a binary classification architecture similar to the BART-based natural language inference model (Lewis et al., 2020) as our evaluator to compute the consistency between the pairs of logical form and text [L, Q]. The input of the encoder is the concatenation of the L and Q appended an [EOS] token, and the logic scores γ are computed as: where h dn denotes the last hidden states of the decoder, ω denotes the max-pooling layer, and σ is the sigmoid activation function.

Data Augmentation
As the labeled training data for both the generator and evaluator is extremely limited, we propose a data augmentation procedure to enlarge the training set by covering variations of logic forms paired with diverse natural language expressions to improve the generalizability. To be specific, our data augmentation consists of three steps as depicted in Figure 1 from a seed dataset with human annotation: Step 1: Logic perturbation Instead of modifying the natural language sentences, we choose to corrupt the logic consistency by perturbing logical forms mainly because of two reasons: (1) The regular structures of logical forms guarantee the procedure of the logical corruption to be comparatively controllable; (2) The perturbed logical forms could be easily validated with the corresponding parser and grammar checker. The perturbations of each given logical form could be enumerated exhaustively according to hand-tuned rules to cover the following logic inconsistencies: • Logic shift: The logic shift indicates that the generated text logically distinct from the input logical forms, such as turning the assertive sentences into negative sentences. This could be attributed to the perturbations of aggregators, operators, logic conjunction, etc.
• Phrase and number changes: The phrase changes mean that the generated sentence modifies the appointed phrase from the logical forms, while the number changes are that the numerical values in the logical forms are perturbed.
• Entity insertion, deletion and swapping: Perturbations of entities is a common drawback that most natural language generation models suffer. This includes the phenomenon that the predicted sentences neglect the entities mentioned in the logical forms, insert unrelated entities to the logical form, or mislay them.
Step 2: Inference from perturbed logic After logic perturbation, the generator could be exploited as the artificial annotator to generate the corresponding sentence for each logical form in a semisupervised manner. Compared to the rule-based or template-based method, the recent pre-trained seq-to-seq models empirically generate the natural language sentences with better fluency and coherency. Though this method could easily create a considerable amount of labeled data meanwhile avoid the expensive human annotation, what can not be ignored is that the model-based generator naturally would introduce unexpected noise during augmentation. Therefore, the quality control for the data augmentation is one of the most crucial cornerstones for a satisfactory result.
Step 3: Dataset composition As shown in Figure   Seed text] would be suitable negative samples for the evaluator.

Structure-aware Encoding
The logical forms normally have equivalent structured representations to precisely express the complex relations between a set of objects. For instance, the executable codes written in Python or SQL could be parsed into abstract syntax tree(AST) (Noonan, 1985) denoting the mutual relations among occurred constructs in the source code, while the knowledge bases may be converted into knowledge graphs that depict the relations between entities with directed edges. Compared to the plain text inputs, the structure-aware encoding capturing not only the sequential information from texts but also the internal logic from structural representations recently has been proved to be more effective in several Graph-to-Text tasks (Song et al., 2018;Ribeiro et al., 2020). To make full use of the intrinsic knowledge of the pre-trained BART model, we follow the similar approach proposed by Ribeiro et al. (2020) to linearize the structural representations of the SQL queries and logical forms respectively ( Figure 3). Furthermore, the logical forms from different domains or datasets may vary in keywords, so normalizing them into a unified form would bridge the gaps between different logic NLG datasets and then increase the generalization ability of our framework. Hence, the logical forms would be firstly word-by-word translated into the unified intermediate semi-textual forms according to a manually annotated dictionary. Then the parenthesis is inserted into the semi-textual forms to denote the hierarchy of the correlated structured representations such as ASTs.
question: How many singer are not older than 20?

BLEC for Logic Consistency Evaluation
Because the general-purpose automatic metrics such as BLEU, ROUGE, and BLEURT are not ideal for explicitly measuring the logic consistency, we propose BLEC, a new rule-based automatic evaluation metric called Bidirectional Logic Evaluation of Consistency. We apply a bidirectional evaluation to determine the logical consistency of pairs of logical forms and questions. The intuition behind this metric is that some key tokens such as number, operator, and keywords in the logical form should always be matched with some tokens that represent similar meanings in the question, and vice versa. An example is shown in Figure 4, BLEC first traverses the key tokens in the question, trying to find the tokens with the same meaning in the logic form to match them. Then, in step two, the sample is marked as inconsistent because there is one token with no match from the question to the logical form. Formally, given a logical form L = l 1 , l 2 , ..., l n containing n word tokens and a questions Q = q1, q2, ..., q m containing m word tokens, the proposed evaluation metric performs token level  matching on l i and q j to test the consistency. To be specific, the matching procedure contains two steps, i.e. matching from L to Q as well as matching from Q to L. In step one, each key token l key i in L tries to match with the tokens in Q. In step two, each key token q key j in Q tries to match with the tokens in L. If no tokens are found that could be matched with any key tokens in either step one or step two, the sample will be marked as negative, vice versa. The final score is the accuracy of all the samples: Where S denotes the dataset while match( * ) is the matching function with binary output, i.e. 1 for positive and 0 for negative.
Compared with the neural network evaluator requiring data-specific training, BLEC can be easily deployed to different datasets. In our experiments, we demonstrate that BLEC can be applied to two different datasets of text generation from two types of semantic parse input, and it shows a substantial agreement with human evaluation for evaluating logic consistency between the semantic parse input and the text output (Table 2).

Datasets
Text generation from semantic parses has different forms depending on the input formal representation. To demonstrate that our SNOWBALL and BLEC can be applied to different types of inputs, we study two tasks: (1) SQL2Text with the SQL query as the input and (2) Logic2Text with the logic forms as the input.
To this end, we make use of two existing publicly available datasets: For SQL2Text, we use the Spider dataset (Yu et al., 2018), a complex crossdomain semantic parsing and text-to-SQL dataset. Generating natural language from formal languages with abundant logic representations could be regarded as the inverse semantics parsing process. Therefore, we reverse the input and output as a dataset for the text generation from SQL queries with complicated logic. As the test set of the Spider dataset remains undisclosed, 20% of the original Spider training set is converted into a development set, and 80% of the training set remains to be the training set, and the original development set is exploited as the test set for our SQL2Text task. For Logic2Text, we use an existing Logic2Text dataset from . We pick the SENT and LOGIC STR fields from the original Logic2Text to compose our own train data. We then change SENT to TEXTand change LOGIC SET to LOGIC as our one keyword of each sample in the dictionary of our dataset.
In contrast, evaluating the logical consistency between logical form and text is closely related to the sequence classification tasks such as fact verification and natural language inference (NLI). According to the best of our knowledge, there is no existing dataset for evaluating the logical consistency between logical form and generated text. Therefore, we simplified the logic evaluation as a two-sequence binary classification problem and then construct the dataset with the development set and test set dedicated for our proposed evaluator. The dataset is constituted from the development and test set of Spider and Logic2text by three methods: (i) The [logical form, Text] pairs in the two datasets are regarded as positive samples; (ii) The human-labeled negative samples by intentionally introducing the logical inconsistency to the known [logical form, Text] pairs in the two datasets; (iii) The manually scored [logical form, Text] prediction given by the trained generator on the two datasets which contain both positive and negative samples. As for the human-labeled negative samples, we attempt to cover the possible logic perturbations mentioned in section 3.2 with minimum modification to the original [logical form, Text] pairs. For example, a coincident pair [SELECT avg(age) FROM dogs, What is the average age of dogs?] would be corrupted into [SELECT avg(age) FROM dogs, What is the oldest age of dogs?]. Table 1 summarizes the statistics of each dataset for both generator and evaluator, respectively.

Baselines and Implementation Details
The baselines for assessing the performance of SNOWBALL framework are the attention-based LSTM machine translation model , and the single-pass trained models which are the models trained before performing SNOWBALL iteration. For instance, the BART-large generator trained in the second SNOWBALL iteration would be compared to the identical BART-large generator in the zero SNOWBALL iteration. The hype-parameter settings of the models trained on SQL2Text and Logic2Text, mostly follow the default setting of BART model from Huggingface (Lewis et al., 2020;Wolf et al., 2020). However, the learning rate of evaluator and tokenizer are slightly different, namely the learning rate of evaluator on SQL2Text is 2e-5 for BART-base and is 5e-6 for BART-large, while the learning rate of evaluator on Logic2Text is 1e-5 for both BART-base and BART-large.

Multitask Learning
Due to the lack of data of logic NLG, intuitively collaborative training on SQL2Text and Logic2Text dataset may prevent the models from bias fitting to their confined training data. Aside from the standard special separators used by BART tokenizer, we further introduce [SQL] and [logic] tokens to be the control codes to indicate if one sample is from SQL2Text or Logic2Text dataset, similar as (Keskar et al., 2019). For each sample fed into the BART model, a corresponding control token is contacted in the front of the input logical form according to that sample source. Therefore, the distribution p(Q SQL |L SQL , [SQL]) of the SQL2Text models and p(Q Logic |L Logic , [logic]) of the Logic2Text models could be learned respectively during the backpropagation that takes the control tokens into account, while training the generator and evaluator in the MTL fashion.

Human Evaluation
To evaluate if the sentence generated by the model is logically consistent, we randomly sample 90 questions from the test set of Spider and a test set of logic2text separately to form a human evaluation set. The samples of each setting will be divided into two parts and assigned to two different annotators. Each part contains 10 overlap and 40 non-overlap examples, which means one person has to label 50 samples for a setting. As for the human evalua-tion, the annotators label the [logical form, text] as True or False based on two criteria: (1) the logic consistency between logical form and text; (2) The grammaticality of the text. After labeling, we estimate the accuracy of the model predictions by computing the expectation of the true labels from 80 non-overlap data. To prove the consistency of the annotators, we use the 10 overlap data to calculate the cohen kappa score. Only if the kappa score is over 0.4 which implies that this estimated accuracy is valid, the results would be reported. In Table 4, we only human annotated the results given by the models trained without snowball iteration and trained with 4 snowball iterations. It demonstrates the correlation between human evaluation and BLEC metrics in these two time steps instead of directly evaluating the improvement of model performance.
6 Results and Analysis

Correlation Analysis on BLEC
To show that BLEC is consistent with human judgment, we test the Pearson correlation between the BLEC score and the human evaluation result. We also include ROUGE and BLEU for comparisons. Therefore, we apply these four automatic metrics (BLEU, ROUGE, BLEURT, BLEC) to a humanlabeled dataset and compare the evaluation results. This dataset is constructed by extracting 50 samples from each of the different Snowball iterations, 15 iterations in total. As shown in Table 2, the logic consistency between BLEC and human evaluation is 0.66 while BLEU, ROUGE, and BLEURT obtain scores below or around zero. This shows that the BLEC score is capable of testing the logical consistency between logic forms and questions.

Effectiveness of Snowball Framework
Generator The experimental results of the generator in our SNOWBALL framework are shown in Table 4. We found that the SNOWBALL training framework empirically leads to the improvement of Input SQL: SELECT count( * ), max(Percentage) FROM country language WHERE LANGUAGE = "Spanish" GROUP BY CountryCode Pre-processed SQL: ( the number of ( all items ) ) , ( the maximum of ( percentage ) ) that belongs to ( countrylanguage ) , that have ( ( language ) equal to ( spanish ) ) , grouped by ( countrycode ) Label: What is the total number of countries where Spanish is spoken by the largest percentage of people?
BART-base the number and percentage of languages that are Spanish for each country code. +snowball (iteration = 4) the number of languages and maximum percentage of languages in each country code? +multi-tasking How many languages does Spanish have in each country code? +snowball (iteration = 4) Show the number and maximum percentage for each country code.

BART-large
How many languages are there in each country and what is the percentage of the language spoken in that country? +snowball (iteration = 4) Find the number of languages and maximum percentage of Spanish for each country code. +multi-tasking Find the number and percentage of speakers of Spanish in each country code. +snowball (iteration = 4) Find the number and highest percentage of speakers of Spanish for each country code  Evaluator The results of the evaluator in our SNOWBALL framework are illustrated in Figure 5. Empirically the snowball framework is more effective to the evaluators base on BART-base than BART-large, this is likely because that the BART-large models have already obtained enough intrinsic knowledge to accurately judge the validness of the [Logical form, Text] pairs. The data augmentation procedure of the SNOW-BALL framework may introduce unexpected noise to the evaluators, which may cause a catastrophic reduction in terms of AUC and other metrics. On the other hand, the snowball framework indeed enhances the performance of the evaluator based on relative inferior BART-base by improving the performance on the SQL2Text by 10.9% as well as the Logic2Text by 3.1%. These results indicate that our proposed SNOWBALL framework is most suitable for tasks suffering from both domain data scarcity and the lack of external knowledge. Table 3 shows example outputs from our model with different settings. Apparently, in this case, the entity Spanish and the aggregator maximum are the touchstones for evaluating the logic consistency of each model. The prediction from the BARTlarge based generator trained under both snowball and multi-tasking frameworks simultaneously is the only one that acquires the seamless sentences from the input SQL. Furthermore, we also notice that multi-tasking learning significantly alleviates the artifacts within the generated text. Based on the fact that, compared to the vanilla generators, the generator solely trained with snowball framework would enhance the logic consistency but also increase an unnatural sense to the generated sentences at the same time, we may argue that there is a trade-off between fluency and logic consistency of our purposed snowball framework. The modellevel modification may collaboratively enhance the fluency and logic consistency of the NLG, which we would remain for future studies.

Conclusion
In this paper, we propose SNOWBALL, a neural network-based framework to augment the data alternatively by a generator, and an evaluator. In addition, we propose BLEC, an automatic evaluation metric that could evaluate the logic consistency between question and logic forms by directional matching. We also formulate two datasets and the experimental results show the effectiveness of the proposed framework. This method is applicable to other Data-to-Text tasks, because domainspecific rules for perturbations can be derived for most structural data with pre-defined structures or grammar.

Ethics Statement
The datasets we use are built by selecting and processing from two datasets that are open to the public, separately. The data sources we utilize to construct our datasets are Spider and Logic2Text, two complex and cross-domain text-to-SQL datasets. Besides, we use three experts to annotate about 500 data beyond the original dataset. We admit that some biases may still exist in our datasets, even though we have double-checked the data they annotated and the data from the original datasets. Authors with SQL expertise annotate and verify our datasets through 1) selecting about 500 representative samples from the original dataset, 2) changing the entities of the samples, 3) using three different labels to mark which type of change has been done to the sentences, and 4) double-checking the quality of the data we annotate.
We conduct several experiments of different settings on our AWS server, with 8 Tesla V100 GPUs, to test the efficiency of our models. To be more specific, our experiments contain two different types. The first type of them is that we train both generator and evaluator using SQL2Text or Logic2Text. The second type of them is that we utilize SQL2Text and Logic2Text to train generators, use one of them to train evaluators in the first epoch, and train the next several epochs with both of them.