Investigating the Robustness of Natural Language Generation from Logical Forms via Counterfactual Samples

The aim of Logic2Text is to generate controllable and faithful texts conditioned on tables and logical forms, which not only requires a deep understanding of the tables and logical forms, but also warrants symbolic reasoning over the tables according to the logical forms. State-of-the-art methods based on pre-trained models have achieved remarkable performance on the standard test dataset. However, we question whether these methods really learn how to perform logical reasoning, rather than just relying on the spurious correlations between the headers of the tables and operators of the logical form. To verify this hypothesis, we manually construct a set of counterfactual samples, which modify the original logical forms to generate counterfactual logical forms with rare co-occurred headers and operators and corresponding counterfactual references. SOTA methods give much worse results on these counterfactual samples compared with the results on the original test dataset, which verifies our hypothesis. To deal with this problem, we firstly analyze this bias from a causal perspective, based on which we propose two approaches to reduce the model’s reliance on the shortcut. The first one incorporates the hierarchical structure of the logical forms into the model. The second one exploits automatically generated counterfactual data for training. Automatic and manual experimental results on the original test dataset and counterfactual dataset show that our method is effective to alleviate the spurious correlation. Our work points out the weakness of current methods and takes a further step toward developing Logic2Text models with real logical reasoning ability.


Introduction
Recently, generating logical consistent natural language from tables has attracted the attention of the research community (Wiseman et al., 2018; Lee, 2018;Liang et al., 2009;Shu et al., 2021;Chen et al., 2020c,b).Given a table, the Logic2Text (Chen et al., 2020d) task is required to generate controllable and faithful texts conditioned on a logical form, which not only requires a deep understanding of the tables and logical forms, but also warrants symbolic reasoning over the tables.
State-of-the-art methods based on pre-trained models (Radford et al., 2019;Shu et al., 2021) have achieved remarkable performance on the standard test dataset of Logic2Text. Figure 1 shows an original sample of the task.
However, we question whether these methods really learn how to perform logical reasoning, rather than just relying on the spurious correlations 1 between table headers such as "attendance" and logical operators such as "argmax".Several previous studies have demonstrated that such shortcuts severely damage the robustness of models (Branco et al., 2021;Wang and Culotta, 2021a).
To verify our hypothesis, we manually construct a set of counterfactual samples from the development set and test set of Logic2Text, named LCD (Logical Counterfactual Data) 2 .We modify the original logical forms to generate counterfactual logical forms with rarely co-occurred table headers and logical operators, then we annotate the corresponding counterfactual label sentences.Figure 1 compares the original sample of Logic2Text and the corresponding counterfactual sample.GPT-2 based model makes a fluent and logical consistent prediction on the original sample.However, for the counterfactual sample, the model still employs the "argmax" logical operator to describe "attendance", which is inconsistent with the logical form.The reason we suppose is that in the training dataset there are a large number of logical forms containing " argmax { all_rows ; attendance }", which is used to describe the phrase "the highest attendance".A model trained on this biased dataset would learn to exploit such spurious correlations to make predictions, thus failing to perform correct logical reasoning on the counterfactual samples.We evaluate the state-of-the-art methods on LCD, and they give much worse results compared with the results on the original test set.
In addition to the bias in the training dataset, previous works directly using linearized logical forms as inputs fail to understand the hierarchical structure of the logical forms, which further aggravates the models to learn the spurious associations between the logical operators and the table headers.
To deal with this problem, we firstly leverage 1 Spurious Correlations or Shortcuts refer to connections between two variables that are non-causal in statistics (Simon, 1954).
2 Counterfactual Samples are samples that change some variables of the factual samples with the others unchanged (Pearl et al., 2016).
Causal Inference (Pearl et al., 2016) to analyze this bias.Based on the analysis, two approaches are proposed: 1) In order to overcome the limitation of linearized logical form inputs, we use different attention masks for different tokens in the logical forms to constrain each token to only interact with the tokens it should reason with; 2) To reduce the reliance on spurious correlations in the training dataset, we train the model on automatically generated counterfactual data, forcing the model to learn real logical reasoning.
Automatic and manual experimental results on the standard test dataset of Logic2Text and LCD demonstrate that our method is able to alleviate spurious correlations and improve logical consistency.Compared with the state-of-the-art baselines, there are 22% and 14% less relative decreases after applying our method to GPT-2 and T5, respectively.Our work mainly points out the weakness of current methods, which is easy to be ignored but important for a robust and faithful generation.It takes a further step toward developing Logic2Text models with real logical reasoning ability.
2 Pilot Study on the Robustness of Logic2Text Models

Counterfactual Samples Construction
To quantify to what extent the bias affects the robustness of Logic2Text models, we manually construct a set of counterfactual samples from the development set and test set of Logic2Text, named LCD (Logical Counterfactual Data).Specifically, we take datapoints from the development set and test set of Logic2Text and modify the original logical forms to generate counterfactual logical forms with rarely co-occurred table headers and logical operators, then we annotate the corresponding counterfactual label sentences.The tables are left unchanged when constructing counterfactual samples.Figure 2 shows how to construct a counterfactual sample from the original sample.The "argmax" logical operator in the original left sample is applied to the "attendance" table header to locate the row in the table.We replace the table header "attendance" with a counterfactual table header "date" to produce the counterfactual logical form.Then, based on the constructed counterfactual logical form, we annotate the corre- sponding label sentence.The reason we choose the "date" table header is that after linearizing, both the original logical form and the counterfactual logical form contain the logical operator "argmax" and the table header "attendance", which leaves a negative shortcut for the models to exploit.
We totally construct 809 counterfactual samples, on which current SOTA Logic2Text models are evaluated.

Models
We evaluate the GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2020) based Logic2Text models on the counterfactual dataset.These two models achieve the SOTA results on the standard test dataset of Logic2Text.
Formally, given a linearized logical form L, and the caption of a table T , the input of the pretrained models is denoted as X = [P ; T ; L], which is a concatenation of L, T and a prefix prompt P .The prefix prompt P is "Describe the logical form: ".Given a set of training samples where Y is the label sentence and N is the number of samples, GPT-2 or T5 based Logic2Text models are trained by maximizing the following objective function: where P indicates the probability distribution modelled by GPT-2 or T5, |Y | is the length of the label sentence and Y i is the i-th token of Y .As shown in Table 1, compared with the performance on L2T, there is a serious decline for both T5 and GPT-2 on LCD with respect to BLEC* and BLEC.Specifically, for T5, the BLEC* score decreases from 71.61 to 41.78, giving a relative decrease of 42%.For GPT-2, the model achieves a BLEC* score of 61.17 on L2T, while only obtaining a BLEC* score of 28.18 on LCD, giving a relative decrease over 50%.Since BLEC only checks the accuracy on operators and numbers, the decline of BLEC is relatively small.As seen, T5 performs slightly better than GPT-2 on LCD.The reason we suppose is that since GPT-2 is an autoregressive language model, when used as an encoder it can only see partial logical forms.Results on LCD verify our hypothesis that the models trained on the biased dataset learn to exploit the spurious correlations to perform logical reasoning, which are not robust when encountering counterfactual samples.

Causal Analysis
To deal with this problem, we utilize Causal Inference (Pearl et al., 2016) to analyze this bias.The left graph of Figure 3 illustrates the overall training process of conventional Logic2Text models from a causal perspective, where vertex L and Y denote the linearized logical form and the label sentence, respectively.Vertex U represents an unobserved confounder in Logic2Text3 , which we suppose is the preference of the annotators to describe a table.There are also three edges in the left causal graph: U → L, U → Y and L → Y .Edge U → L and U → Y denote the confounder's effects on L and Y , respectively.L → Y represents that the label sentence is dependent on the logical form L. This link is the objective that the model should learn.
Concretely, in the Logic2Text task, the confounder U can be interpreted as the preference of the annotators to describe a table.For example, given a table recording sports events, the annotators prefer to describe the game with the largest crowd attendance rather than the most recent game.As a consequence, this unobserved confounder has direct effects on the generation of the logical form (U → L), and the generation of the label sentence (U → Y ).These effects finally build a backdoor path from the logical form to the label sentence (L ← U → Y ).The backdoor path induces the model to learn the shortcut between the logical form and the label sentence, rather than the process of reasoning sentence from the real structure of logical form (L → Y ).For example, when a model trained on a dataset in which a large number of logical forms containing " argmax { all_rows ; attendance }", the model can make correct predictions "the highest attendance".However, when testing on the logical form " argmax { all_rows ; date }", the model leverages this shortcut and still employs the "argmax" logical operator to describe "attendance", which is inconsistent with the logical form.
Formally, given a confounder U , there exists a logical operator o i ∈ O and a table header where O, H represent the set of logical operators and table headers in the dataset respectively.p(o, h) denotes the probability that the header h should reason with operator o.In this case, the models actually learns the probability P (Y |U, L, T ), rather than P (Y |L, T ).

Methodology
Based on the above analysis, we make two modifications on the causal graph as shown on the right part of In order to overcome the drawbacks of the linearized logical form inputs in previous works, a vertex Z is added to the causal graph, which represents the structure-aware feature of the logical form.Then, we replace the edge L → Y with the path L → Z → Y , as shown in the right part of Figure 3.In the implementation, this modification is achieved by using an attention mask matrix to constrain each token in the logical form to only interact with the tokens it should reason with.Specifically, let L denote a linearized logical form L = {w 1 , w 2 , ...w n } where w i ∈ V ∪ O. V and O are the vocabulary and the set of logical operators, respectively.Let M denote the attention mask matrix.M i,j = 0 indicates that the attention value from word w i to w j is masked, otherwise the opposite.Take the logical form "hop{ argmax { all_rows ; attendance }; date }" as an example.
With this constraint, the model will not be able to learn the association between "argmax" and "date".
To decide the values in the attention mask matrix M, we need convert the linearized logical form L into a logical graph L g .To be more specific, for each token in the linearized logical form, there is a corresponding node in L g .If a logical operator w i and a table header w j satisfy the pattern "w i { A ; w j ; B }" in the linearized logical form, where A and B denote valid logical subclauses, then w i and w j in L g are connected.
Formally, let M i,j = edge(w i , w j ), where edge is a binary function which indicates whether there is an edge between word w i and w j in L g .Based on M, the attention matrix in each transformer layer of the pre-trained models is calculated as: where A denotes the original attention values, ⊙ represents the element-wise product.Based on Â, the standard self-attention transformer can be performed to calculate the representations for each token.Specifically, given a selected table header in the logical form, we propose to replace it with another table header, which is randomly selected from the set of all the table headers in the training dataset.Accordingly, the exactly same tokens in the label sentence are also replaced with the selected table header.This replacement constructs counterfactual samples containing some rarely co-occurred headers and logic operators, which violates the preference of the annotators, thus removing the edges from the unobserved confounder U to L and Y .Take Figure 4 as an example.Given a linearized logical form, and the corresponding label sentence, we locate the table header "attendance" and replace it with another randomly selected table header "assist".We also replace "attendance" with "assist" in the label sentence.We filter out samples whose logical forms do not contain table headers that can exactly match any tokens in the label sentence.Due to the space limitation, three strategies we used to select the table headers to be substituted are listed in the Appendix B.

Counterfactual
Based on the automatically generated counterfactual data S, the model is trained on the mixup of the counterfactual data S and the original training dataset S as follows: It is worth noting that the label sentences of the automatically generated counterfactual dataset may not be natural sentences, since the randomly selected table headers may not fit the contexts of the original label sentences.As a result, adding more automatically generated counterfactual dataset would improve the logical consistency of the generated texts, but the fluency of the texts may be hurt.The trade-off is decided by the ratio r between the size of S and the size of S. r can be calculated as | S|/|S|.An experiment of exploring the effect of r can be found in Subsection 5.2.

Experiments
In this section we conduct experiments on both the biased Logic2Text dataset and the counterfactual dataset, LCD.We compare our method with other SOTA models, and discuss the experimental results.
for decoding and the beam size is set to 2. The maximum length of the output sentence is set to 180.The batch size is set to 10 for inference and 2 for training.For the T5 backbone, we initialized the parameters from CodeT5 (Wang et al., 2021), which is pre-trained on programming languages.Linearized logical forms contain similar structures, with operators and table headers corresponding to functions and parameters, respectively.For SNOW-BALL, we follow the settings in (Shu et al., 2021) but reduce the batch size and beam size to 4 due to the limitation of GPU memory.We train all the models on one GeForce GTX 1080 Ti.The code is implemented by PyTorch and MindSpore.
Manual Evaluation In addition to the automatic evaluation, we manually check the logical consistency by comparing the output sentences with the logical forms.Specifically, we randomly select 100 samples from L2T and LCD, then we calculate the percentage of the samples that the output sentence is logically consistent with the logical form.

Relative Decrease
The LCD and L2T have comparable data sizes, with 809 and 1092 samples, respectively.Here we use relative Decrease to quantify the degree of decline in model performance when test data is transferred from L2T to LCD, which is calculated as (L2T − LCD)/L2T .L2T denotes the model performance on standard Logic2Text test set, LCD denotes the performance on LCD.

Main Results
As shown in Table 2, augmenting GPT-2 and T5 with our method obtains a large gain on both L2T and LCD with respect to BLEC* and BLEC.
Specifically, for T5, the BLEC* score increases to 83.42 on L2T and 59.58 on LCD, which out-performs all the baselines and achieves the new state-of-the-art results.
For GPT-2 with our method, the BLEC* increases by 20.65 on LCD and 10.53 on L2T.The relative decrease is reduced from 54% to 32%.Both the results of GPT-2 and T5 demonstrate that our modifications to the causal graph are effective.For BLEC, T5 with our method obtains the highest BLEC score of 89.93 on L2T, which shows a slight improvement compared with vanilla T5 (88.0).For GPT-2 with our method, the BLEC score increases by 4.2 on L2T and 7.78 on LCD.The improvement of BLEC is relatively limited compared with BLEC*.The reason we suppose is that BLEC only checks the accuracy of operators and numbers, which cannot reveal the errors of table headers.This problem is also demonstrated by the case analyses of SNOWBALL.Although SNOWBALL gives a low decrease, when looking into some cases generated by SNOWBALL, we find that SNOWBALL fails to choose the correct table headers.
A similar conclusion can be drawn from human evaluation.Our T5-based method achieves the highest accuracy on L2T (84) and LCD (71), giving the lowest relative decrease (15%).Compared with vanilla T5, the manually checked logical accuracy of our method obtains a 30-point improvement on LCD, which further verifies the effectiveness of our method.

Effect of the Size of Synthetic Data
As indicated in Section 4.2, adding more automatically generated counterfactual examples would improve the logical consistency of the generated texts, but the fluency of the texts may be hurt.The trade-off is decided by the ratio between the size of counterfactual data S and the size of original train-  ing data S.We conduct development experiments to quantitatively explore the effect of the trade-off on the logical consistency and text fluency.We use BLEC* and BLEC to evaluate the logic consistency, and BLEU-4 4 (Papineni et al., 2002)(noted as BLEU) to evaluate the text fluency.
As shown in Table 3, with the increase of r = | S|/|S|, the logical correctness increases and the text fluency decreases, which verifies our hypothesis in Section 4.2.We suggest that the future researchers fine-tune the hyperparameter r to obtain logically correct and semantically fluent generated texts.

Effect of the Complexity of Logical Form
We explore how the model performance is affected by the complexity of the logical form.The effects of the complexity are demonstrated from two aspects: 1) the maximum depth of the logical form in a tree form, and 2) the number of nodes in the logical form.The logical form with more nodes or deeper depth is regarded as more complex.
We take T5 as the backbone and report the mispredicted token rate (MTR) as the metric.Mispredicted tokens are the tokens that occur in the logical form but not generated in the text.We calculate the ratio of the number of the mispredicted tokens to the length of the logical form.We plot the MTR results with respect to different numbers of nodes and depths in Figure 5.
As shown in Figure 5(a), as the number of nodes increases, more tokens are mispredicted by the vanilla T5 model.In contrast, our method helps T5 to maintain a relatively low error rate as the number of nodes increases.A similar conclusion can be drawn from Figure 5  the model stable as the logical depth grows.Besides, we observe that for logical forms with low depths, our method obtains a slight higher MTR compared with the baseline.The reason we suppose is that simple logical forms are rarely affected by the unobserved confounder.

Ablation Study
We conduct ablation experiments on both T5 and GPT-2 by removing the attention mask (denoted as AM) and automatically generated counterfactual training data (denoted as CF).The results are shown in Table 4.
When removing AM, for GPT-2, the BLEC* score decreases by 2.1 and 6.68 on L2T and LCD respectively.The BLEC score decreases by 2.98 on L2T and 7.65 on LCD.For T5, removing AM leads to 13.55 and 11.74 decreases on L2T and LCD, respectively.These observations demonstrate the effect of AM.
When removing the automatically generated   counterfactual data for training, the performances for both GPT-2 and T5 are significantly decreased.Specifically, for GPT-2, the BLEC* decreases by 9.98 on L2T and 14.45 on LCD.For T5, the BLEC* decreases by 17.49 on L2T and 19.78 on LCD.The decline increases from 28% to 40%.There is much worse decrease on LCD than L2T since the data distribution of LCD is different from L2T which shares the same data distribution with the training data.

Generation of Counterfactual Data
We conduct experiments to investigate the effects of different strategies to generate counterfactual data.Specifically, we try to replace the table header tokens in the logical form, with 1) a random string (denoted as Random), 2) a randomly selected table header (denoted as Disturb), 3) and mix the above two methods up (denoted as Mix).We list the details of the methods in Appendix B. The results are shown in Table 5.We observe that Disturb greatly boosts the permore than Random on both L2T and LCD, which proves that meaningful and in-domain table headers generate more effective counterfac-tual samples.Besides, the Mix strategy further give a slightly improvement.

Case Study
We select a counterfactual sample from LCD to demonstrate why our method could perform better than previous works using the attention score distributions when decoding.The linearized logical form of the sample is "eq{ hop {argmax {all_rows ; score };attendance };5032}", and the label sentence is "the game with the highest score had 5032 spectators".Specifically, after decoding the "highest" or "largest" token corresponding to the "argmax" logical operator in the logical form, we plot the attention values of the last transformer layer for each token of the logical form to explore which table header the model would choose, as shown in Figure 6.As seen, for vanilla GPT-2, the attention score of the token "attendance" is higher than on "score", thus producing "highest attendance".For our method, the attention score of the token "score" achieves the largest, which is the corresponding table header the operator "argmax" should select.
7 Related Work  Table-to-text is a popular area in recent years (Wiseman et al., 2018;Lee, 2018;Liang et al., 2009;Chen et al., 2021).As previous methods generate superfacial and uncontrollable logic, Chen et al. (2020e) introduced Logic2Text as a controllable and fidelity text generation task conditioned on a logical form.Since then, many works on Logic2Text have been proposed.In order to unify the studies of structural knowledge grounding, Xie et al. (2022) proposed the UNIFIEDSKG framework and unified 21 structural knowledge grounding tasks into a text-to-text format, including Logic2Text.Zhang et al. (2021a) proposed a unified framework for logical knowledge-conditioned text generation in few shot setting.To solve the data scarcity problem of Logic2Text, Shu et al. (2021) iteratively augmented the original dataset with a generator and proposed an evaluator for highfidelity text generation.
However, they all ignored the spurious correlation in logical forms, which is investigated in our work.

Causal Inference For NLP
Causal Inference (Pearl et al., 2016;Kuang et al., 2020) is a powerful statistical modeling tool for explanatory analysis.In NLP, many methods have been proposed based on the causal inference theory (Zhang et al., 2021b;Chen et al., 2020a;Zhang et al., 2021c;Hu and Li, 2021).Yang et al. (2021) and Wang and Culotta (2021b) exploit causal inference to reduce the bias from the context for text classification tasks.For named entity recognition, Zeng et al. (2020) replaced the entities in sentences with counterfactual tokens to remove spurious correlation between the context and the entity token.Wang and Culotta (2021a) generated counterfactual samples by replacing causal terms with their antonyms in sentiment classification.Wu et al. (2020) proposed to use a counterfacutal decoder to generate unbiased court's view.
Our work proposes to improve the robustness of Logic2Text models with causality.

Conclusion
We investigate the robustness of current methods for Logic2Text via a set of manually constructed counterfactual samples.A significant decline on the counterfactual dataset verifies the existence of bias in the training dataset.Then we leverage causal inference to analyze the bias, based on which, two approaches are proposed to reduce the spurious correlations.Automatic and manual experimental results on both Logic2Text and the counterfactual data demonstrate that our method is effective to alleviate the spurious correlations.

Limitations
Although our method has achieved high logical consistency, we find that for some unseen headers, the model cannot understand them and generate some logically correct but not fluent sentences, which is related to the method of generation of counterfactual samples.Due to the limited number of high-quality logical forms, future work may continue to explore more advanced counterfactual data generation methods considering the context.
Besides, our structure-aware logical form encoder works based on the attention mechanism, so it can't be applied to models without attention.Fortunately, the current attention-based models are widely used not only because of their better performance but also because of their high interpretability.Header Disturb Another straightforward idea is to select another header token from a set of all table headers to replace the header token in the logical form.However, such a method ignores the attribute of the data type carried by the columns, thus it will produce unreasonable counterfactual samples.In order to solve this problem, we group all the headers by their data type, including three categories: strings, numbers, and time.A header in the logical form is only replaced by another header with the same data type.
Mixing Replacement We take turns performing the above two replacement strategies.

C Details of Baselines
Pointer Generator Network (PGN) (See et al., 2017) can be employed to solve OOV problem.In addition to calculating the probability of each token in the existing vocabulary P vocab , PGN also calculates p gen while decoding, where p gen denotes the probability to copy the tokens from the input sequence.The final distribution is calculated as: SNOWBALL To solve the constraint of data scarcity, Shu et al. (2021) proposed the SNOW-BALL framework for high-fidelity text generation, which employed an iterative training procedure over a generator and an evaluator through data augmentation.
T5 Raffel et al. (2020) proposed the pre-trained model specifically for text-to-text generation tasks.We initialized the parameters from CodeT5 (Wang et al., 2021), which is more suitable for formallanguage-to-text.

D Effect On the Probability Of Copying Table Headers
Figure 8: Our method contributes to the improvements on L2T.
Our method reduces the reliance on spurious cor-5510 relations, making the model learn the relationship between key tokens.We find that our method increases the probability of copying headers from the logical forms, which affects the performance of the model on L2T as well.We show two samples in Figure 8.
In sample 1, the baseline just omits the header token "nation" because "west germany" itself is an instance of "nation".In sample 2, "pga championship" and "tournaments" have a similar relationship.But our method prefers to keep these headers if allowed.

E More Counterfactual Samples and Running Examples
We present 3 more counterfactual samples for reference.We list them in    the largest population for the populated places in guam whose area is less than 10 km square is 5845 .

Baseline
in the 1975 season of hans -joachim stuck , among the years that he participated in , 5 of them had points less than 1 .
in guam , the highest population in area km square is 5845 .

Ours
for hans -joachim stuck , when the year is over 1975 , there were 5 times that there were less than 1 points .
the highest population in guam with area km square less than 10 is 5845 .

Figure 1 :
Figure 1: The GPT-2 based model makes a fluent and logical consistent prediction on the original Logic2Text sample, but generates a sentence logically inconsistent with the logical form of the counterfactual sample.

Figure 2 :
Figure 2: Construction of counterfactual samples.Left: Original sample.Right: Counterfactual sample.We construct the counterfactual logical form by replacing the header "attendance" with the counterfactual header "date", leaving a negative shortcut for the model to exploit.Then we annotate the label sentence based on the counterfactual logical form.

Figure 3 :
Figure 3: Causal graph for the task of Logic2Text.Left: The original method.Right: Our proposed modifications.
Figure 3. Firstly, we propose structureaware logical form encoder to build L → Z → Y .Secondly, we remove the backdoor path L ← U → Y by training the model on automatically generated counterfactual samples.4.1 Structure-Aware Logical Form Encoder by Building L → Z → Y

Figure 4 :
Figure 4: Automatically generated counterfactual sample for training.
Data Synthesizing to Remove the Backdoor Path L ← U → Y In order to constrain the model to only rely on the direct path L → Y to generate sentences, the second modification is to remove the backdoor path L ← U → Y in the left causal graph of Figure 3.The backdoor path induces the model to learn the shortcuts between the logical form and the label sentence.This modification is implemented by training the model on automatically generated counterfactual samples.
(b).Our approach makes4 Standard script NIST mteval-v13a.pl(a) Mispredicted tokens rate with respect to the number of nodes in the logical form.(b)Mispredicted tokens rate with respect to the depth of the logical form.

7. 1
Text Generation from Tables

Figure 6 :
Figure6: Attention values during decoding.The baseline pays more attention to "attendance" as we expected, which verifies our hypothesis.

Figure 7 :
Figure 7: Sample of Attention Mask matrix.The attention of each token to others with no directly connected edges is masked.
LCD ↑ Dec ↓ L2T ↑ LCD ↑ Dec ↓ L2T ↑ LCD ↑ Dec ↓ as baselines.The details about the baselines are shown in Appendix C.Implementation DetailsIn the experiments, the learning rate is set to 0.0003.We use beam search

Table 2 :
Results of different methods on L2T and LCD with respect to BLEC, BLEC* and human evaluation.Dec denotes the relative decrease when estimated on LCD compared with on L2T, which is calculated as L2T −LCD L2T

Table 3 :
The effect of the proportion of synthetic training data on the quality of the generated text.r = | S|/|S|, where | S| denotes the size of counterfactual data, |S| denotes the size of the original training data.r = ∞ indicates that all of the training data are synthetic counterfactual samples.

Table 4 :
Results of ablation study.We explore the effect of the attention mask (AM) and training on counterfactual data (CF).

Table 5 :
Results of different methods of generation of counterfactual data vary on L2T and LCD, where "-" denotes training on standard Logic2Text.
Table 6 to 8. The concerned pairs of logical operators and table headers in original samples and counterfactual samples are highlighted in blue and red respectively.The counterfactual samples are modified from the original samples, where parts of the logical form are replaced with rarely co-occurred logic operators and table headers.Then the corresponding label sentences are re-edited.To prove the efficiency of our method, we give two more running examples in Table9.The results in the table illustrate that the Baseline model is prone to text inconsistent with logical forms when feeding counterfactual samples as input, while our approach yields more robust logical reasoning.

Table 9 :
2 running examples to prove the effectiveness of our method.