De-Confounded Variational Encoder-Decoder for Logical Table-to-Text Generation

Logical table-to-text generation aims to automatically generate fluent and logically faithful text from tables. The task remains challenging where deep learning models often generated linguistically fluent but logically inconsistent text. The underlying reason may be that deep learning models often capture surface-level spurious correlations rather than the causal relationships between the table \boldsymbol{x} and the sentence \boldsymbol{y}. Specifically, in the training stage, a model can get a low empirical loss without understanding \boldsymbol{x} and use spurious statistical cues instead. In this paper, we propose a de-confounded variational encoder-decoder (DCVED) based on causal intervention, learning the objective p(\boldsymbol{y}|\textrm{do}(\boldsymbol{x})). Firstly, we propose to use variational inference to estimate the confounders in the latent space and cooperate with the causal intervention based on Pearl’s do-calculus to alleviate the spurious correlations. Secondly, to make the latent confounder meaningful, we propose a back-prediction process to predict the not-used entities but linguistically similar to the exactly selected ones. Finally, since our variational model can generate multiple candidates, we train a table-text selector to find out the best candidate sentence for the given table. An extensive set of experiments show that our model outperforms the baselines and achieves new state-of-the-art performance on two logical table-to-text datasets in terms of logical fidelity.


Introduction
Data-to-text generation refers to the task of generating descriptive text from non-linguistic inputs. With the different types of inputs, this task can be * Corresponding Authors defined more specifically, such as abstract meaning representation to text (Zhao et al., 2020;Bai et al., 2020a), infobox with key-value pairs to text (Bai et al., 2020b), graph-to-text , and table-to-text Parikh et al., 2020) generation.
Among these tasks, we focus on logical tableto-text generation, which aims to generate fluent and logically faithful text from tables (Chen et al., 2020a). And the ability of logical inference is a kind of high-level intelligence, which is nontrivial for text generation systems in reality. The task remains challenging because the reference sentences often convey logically inferred information, which is not explicitly presented in the table. As a consequence, data-driven models often generated linguistically fluent but logically inconsistent text. Recent progress on this task mainly lies in the use of pretrained language models (LMs) like GPT-2 (Radford et al., 2018), which was shown to perform much better than non-pretrained models (Chen et al., 2020a,e).
However, it is still arguable that whether pretrained LMs can correctly capture the logics, as pretrained LMs like BERT would use spurious statistical cues for inference (Niven and Kao, 2019). The substantial difficulty for this task does not lay on whether to use the pretrained models or not. Instead, the difficulty is because the surface-level spurious correlations are easier to capture than the causal relationship between the table and the text. For example, we have observed that a model cooperating with GPT-2 generated a sentence "The album was released in the United States 2 time" for a given table. But the country where the album was released twice is "the United Kingdom" 1 . In the training stage, a model may get low training loss by utilizing the surface-level correlations without 5533 actually focusing on the selected entities. As a result, in the inference stage, the model is possible to produce incorrect facts.
In this paper, we view the logical table-totext generation from the perspective of causal inference and propose a de-confounded variational encoder-decoder (DCVED). Firstly, given the table-sentence pair (x, y), we assume confounders z c existed in the latent space and contributing to the surface-level correlations (e.g., "the United States" and "the United Kingdom"). We estimate z c in the latent space based on variational inference, and cooperate the causal intervention based on Pearl's do-calculus (Pearl, 2010) to learn the objective p(y|do(x)) instead of p(y|x). Secondly, to make the latent confounder meaningful, we propose a back-prediction process to ensure the latent confounder z c can predict the not-used entities but linguistically similar to the exactly selected ones. We also consider the exactly selected entities as the mediators in our deconfounded architecture models. Finally, since our variational model can generate multiple candidates, we train a table-text selector to find out the best text for the table. An extensive set of experiments show that our model achieves new stateof-the-art performance on two logical table-to-text datasets in terms of logical fidelity.
The main contributions of this work can be summarized as follows: • We propose to use variational inference to estimate the confounders in the latent space and cooperated with back-prediction to make the latent variable meaningful.
• We propose a generate-then-select paradigm jointly considering the surface-level and logical fidelity, which can be considered as an alternative to reinforcement learning.
• The experiments have shown that our model achieves new state-of-the-art performance on two logical table-to-text datasets with or without pretrained LMs.
2 Related Work Table-to-Text Generation. The task of table-totext generation belongs to the data-to-text generation, where a key feature is the structured input data. Lebret et al. (2016) used a seq2seq neural model with a field-infusing strategy that obtains field-position-aware and field-words-aware cell embeddings to generate sentences from Wikipedia tables. A follow-up work proposed to update the cell memory of the LSTM by a field gate to help LSTM identify the boundary between different cells (Liu et al., 2018). Transformerbased (Vaswani et al., 2017) models were also proposed which improved the ability to capture longterm dependencies between cells (Ma et al., 2019;Chen et al., 2020a). It is worth to mention that the copy mechanism (Luong et al., 2015) is an important part to deal with the outof-vocabulary (OOV) words (Lebret et al., 2016;Gehrmann et al., 2018;Chen et al., 2020a) when not using pretrained language models. Logical Table-to-Text Generation. While usually fluent, existing methods often hallucinate phrases that contradict the facts in the table.
To benchmark models' ability to generate logically consistent sentences, recent work proposed a dataset collected from open domain (Chen et al., 2020a), which would score low on those models ignoring logical consistency. Follow-up work further proposed another dataset that involved logical forms as additional supervision information (Chen et al., 2020e), which includes common logic types paired with the underlying logical forms.
Causal Inference. Machine learning models often suffer from the spurious statistical correlations brought by unmeasured or latent confounders (Keith et al., 2020). To eliminate the confounding bias, one approach is applying the causal intervention based on Pearl's do-calculus (Pearl, 2010). However, it remains an open problem to choose proper confounders, and the language of text itself could be a confounder (Keith et al., 2020). It is worth noting that high-quality observations of the mediators can also reduce the confounding bias, as the models will reduce the possibility of counting on the confounders (Chen et al., 2020d).

Backgrounds
Before introducing our models, we briefly review the framework of VAE (Kingma and Welling, 2014), a generative model which allows to generate high-dimensional samples from a continuous space. In the probability model framework, the probability of data x can be computed by: where it is approximated by maximizing the evidence lower bound (ELBO): where p θ (x|z) denotes the decoder with parameters θ and q ϕ (z|x) is obtained by an encoder with parameters ϕ, and p(z) is a prior distribution, for example, a Gaussian distribution. And KL(·||·) denotes the Kullback-Leibler (KL) Divergence between two distributions. When applied to seq2seq generation where the input and the output are denoted by x and y respectively, the conditional variational autoencoder (CVAE), or often known as variational encoder-decoder (VED), is used with following approximation: In the vanilla CVAE formulation, such as the ones adopted in Jain et al., 2017), the prior distribution p(z|x) is approximated to p(z), which is independent on x and fixed to a zero-mean unit-variance Gaussian distribution N (0, I). However, this formulation is shown to induce a strong model bias (Tomczak and Welling, 2018) and empirically perform worse than non-variational models (Wang et al., 2017) in multi-modal situation. The symbols x, y, z m , z c denote the input table, the output sentence, the hidden mediator, and the hidden confounder, respectively. We assume c and m to be the proxy variables of z m and z c , respectively, which are relatively easy to be observed.

De-Confounded VED
From a human perspective, multiple sentences can properly describe a given table, varying with dif-ferent concerns, different logical types or linguistic realizations. Therefore, given the input data x and the output sentence y, we can assume a latent variable z existed leading to a conditional generation process p(y|x, z) where z contributes to the diversity. It suggests a CVAE framework with Equation 3. However, as discussed in Section 3, the vanilla CVAE will introduce a model bias (Tomczak and Welling, 2018). In this subsection, we re-think the CVAE from the perspective of causal inference. We assume a directed acyclic graph (DAG) existed, which includes a mediator z m and a confounder z c as shown in Figure 1(a). The mediator is determined by x and has causal effects on y, while the confounder has causal effects on both x and y.
When only considering z m , we can compute the probability distribution p(y|x) by: where φ denotes the parameters of a mediator predictor. An example for z m is the selected entity (e.g., United Kingdom) from the table x and exactly appeared in y. The vanilla CVAE will constrain z m in the continuous space, and further approximate the prior distribution p(z m |x) to p(z m ), which produces biased information.
However, it does not mean that removing the approximation between p(z m |x) and p(z m ) is enough. We observe that models often rely on spurious statistical cues for prediction, resulting in some linguistically similar but inconsistent expressions in the generated sentences (e.g., using "The United States" instead of "The United Kingdom). The model is possible to minimize the training loss relying on the surface-level correlations between the selected entity and the high-frequency entity. In this case, the high-frequency entity belongs to the confounder z c . In the inference stage, model may infer contradicting facts due to a high posterior probability of q(z c |x).
To eliminate the spurious correlations, we apply causal intervention by learning the objective p(y|do(x)) instead of p(y|x), which forces the input to be the observed data x, and removes all the arrows pointing to x as shown in Figure 1(b). When only considering z c , we can compute the in-tervened probability distribution by: where z c is no longer determined by x, making p(z c |do(x)) = p(z c ). When applying variational inference to z c , we have: It can be seen that the confounder z c is more suitable than the mediator z m to cooperate with variational inference, as cutting off the link z c → x will naturally make p(z c |do(x)) to p(z c ).
When jointly considering z m and z c , we have: according to the intervened causal graph in Figure  1(b). The symbols ϕ, φ and θ denote the parameters of three probability modeling networks, respectively. It is worth noting that we do not apply variational inference to z m because finding a proper prior distribution p(z m |x) remains another big topic. Instead, our framework is easy to implement.

Making Latent Variables Meaningful
However, there is no guarantee that z m and z c can represent the real mediators and confounders in Equation 7. If we have no other observed variables, the confounder z c would mainly represent the covariate which is naturally independent of x and has causal effects on y.
Therefore, we further involve proxy variables m and c for z m and z c , respectively, where the full causal graph is shown in Figure 1. Proxy variables are the proxies of hidden or unmeasured variables (Miao et al., 2018). In practice, the mediators and the confounders are often too complex and can not be directly observed. For example, we may not be able to directly measure one's socioeconomic status but we are probable to get a proxy by the zip code or job type (Louizos et al., 2017).
To make the latent variables z m and z c meaning-ful, we add two additional networks and the learning objective is maximizing: where Φ and Ψ denote the parameters of the two additional networks. Back-Prediction from the Confounder. As shown in Figure 1(a), the confounder z c inferred from y also have a causal effect on x. Otherwise, the confounder will collapse into the covariate. The spurious correlations we have observed are that models often generate linguistically similar but logically inconsistent outputs. For example, "the United Kingdom" and "the United State" instead of "the United Kingdom" because the two entities are linguistically similar to each other. Therefore, we assume the proxy confounders c to be the not-mentioned entities in the given table. And we keep those high-frequency entities in the training set (≥ 5 times). Let c = {c i,j } ∈ R Nc×Lc where c i,j denotes the j-th token of i-th entity, and N c and L c denote the number of entities and maximum length of the entity, respectively. The logprobability log p Ψ (c|z c ) is computed by: where c i,<j denotes the tokens preceding to the jth token in the i-th entity. Then we minimize the cross-entropy between p Ψ (c|z c ) and p(c).
Supervision for the Mediator. In the logical table-to-text generation task, from the human perspective, the correct mediators may be the selected entities, the logical types, or the logical forms (Chen et al., 2020e). In this paper, we only consider the selected entities as it is relatively easy to extract while the logical types or forms are laborintensive to annotate. We represent the selected entities by m = {m i,j } ∈ R Nm×Lm where m i,j denotes the j-th token of i-th entity, and N m and L m denote the number of entities and maximum token number of the entity, respectively. The logprobability p Φ (m|z m ) is computed by: where m i,<j denotes the tokens preceding to the j-th token in the i-th entity.

Encoders and Decoders Implementation
Then we introduce the implementations of p θ (y|x, z m , z c ), p φ (z m |x), and q ϕ (z c |y). We assume that the seq2seq model consists of an encoder Enc(·) and a decoder Dec(·) for p θ (y|x, z m , z c ). And a target-oriented encoder T-Enc(·) is used for q ϕ (z c |y).
Firstly, we need to implement p φ (z m |x) and q ϕ (z c |y). Let H x be the hidden states of x encoded via H x = Enc(x), and E y be the embeddings of y before fed to the decoder Dec(·). We use a fully-connected neural network (FCNN) to project H x followed with the average pooling to obtain z m . And we use the target-oriented encoder to encode E y and obtain H y = T-Enc(E y ). We apply the mean pooling operation to H y and obtain h y . To modeling q ϕ (z c |y) which is approximated to a Gaussian distribution, we use two FC-NNs to process h y and obtain the mean vector µ y and the log variance log σ 2 y which makes: To implement p θ (y|x, z m , z c ), our model cooperates an non-pretrained model "Field-Infusing+Trans" (Chen et al., 2020a) or a pretrained model "GPT-TabGen" (Chen et al., 2020a). Specifically, "Field-Infusing+Trans" uses an infusing field embedding network to produce header-words-aware and cell-position-aware embeddings E p , then concatenate E p with token embeddings to obtain the infused embeddings E = {e i } ∈ R Lt×d where e i denotes the embedding of i-th token in the table x, and L t and d denote the token number and the dimension, respectively. Then the decoder is used to decode y token by token: y t = Dec (H x , y ≤t , z m , z c ). The latent variables z m and z c are concatenated as one latent variable and projected by a FCNN to get a vector z m,c which has the same dimension with H x . Then we add z m,c with E y at each decoding step. When cooperated with "GPT-TabGen", the difference from "Field-Infusing+Trans" is that we use the GPT-2 as the encoder and decoder, and use the table linearization to indicate the cell position instead of the field-infusing method. More details about the table linearization can be found in (Chen et al., 2020a). And the vector z m,c is fed to the last Transformer layer of GPT-2 instead of the first layer, which brings less impact on the pretrained GPT-2.

Generate-then-Select Paradigm
By sampling multiple latent variables z c ∼ p(z c ), our model can generate multiple candidate sentences Y = ( y 1 , y 2 , ..., y Nc ) for the table x where N c is the number of generated sentences. We propose to find out the best sentence by a trained selector. The generator optimized with MLE may focus more on the token-level matching than the sentence-level consistency while the selector will focus on improving the sentence-level scores. Therefore, it can be considered as an alternative of reinforcement learning. The selector scores each candidate sentence by s i = S χ ( y i , x) where χ denotes the parameters of the selector network. Note that we are not designing a selector s i = S χ ( y i , y) because the reference sentence y is not available in practice.
Recent work has provided several selectors including parsing-based and NLI-based models (Chen et al., 2020c). We can directly use these selectors but we aims to develop a more general selector jointly considering surface-level fidelity and logical fidelity. We use a mix of BLEU-3 (Papineni et al., 2002) and NLI-Acc (Chen et al., 2020a) scores to supervise the selector. In the training stage of the selector, we can get the gold scores of each generated candidate with the referenced sentence y by s * i = S * ( y i , y). Then, we use BERT to encode x and y i followed with the average pooling layers to produce h s and h i s . Finally, we score the table-sentence pair represented by (h s , h i s ) as follows: where ⊕ and ⊙ denote the concatenation and the element-wise multiplication operations, respectively. And W s denotes the parameters of the scoring network. The score S χ ( y i , x) is between 0 and 1, and better sentences need to be closer to 1. The scores of gold reference are set to 1. Then we use the margin-based triplet loss for the generated sentences in two way: comparing with gold sentences, and comparing between arbitrary two generated sentences. Given N c generated candidate sentences, we rank the generated sentences according to the mix of BLEU-3 and NLI-Acc scores. The ranked sentences are denoted by Y r = ( y 1 r , y 2 r , ..., y Nc r ) where y 1 r has the highest score. Then the loss is computed as 5537 follows: (13) where γ 1 and γ 2 are the hyperparameters representing margin values, and i and j represent the ranked indexes. At the inference stage, we can select the best sentence with the highest score.

Dataset
Vocab Tables

Datasets
We conduct experiments on two datasets: Logic-NLG (Chen et al., 2020a) and Logic2Text (Chen et al., 2020e). LogicNLG is constructed based on the positive statements of the Tabfact dataset (Chen et al., 2020c), which contains rich logical inferences in the annotated statements. Logic2Text is a smaller dataset and provides the annotation of logical forms. Since the annotations of logical forms are labor-intensive, we only use the tablesentence pairs, following the task formulation of LogicNLG. The statistics of the two datasets are shown in Table 1.

Evaluation and Settings
The models are evaluated on the surface-level consistency and the logical fidelity. In terms of the surface-level consistency, we evaluate models on the sentence-level BLEU scores (Papineni et al., 2002) based on 1-3 grams matching. In terms of logical fidelity, we follow the recent work and apply three metrics including SP-Acc and NLI-Acc based on semantic parsing and pretrained NLI model, respectively (Chen et al., 2020a). The metrics are computed with the officially released codes 2 . Compared Models. We compare our models with both non-pretrained and pretrained models. The non-pretrained models include "Field-Gating" (Liu et al., 2018) and "Field-Infusing" (Lebret et al., 2016) with LSTM decoder or Transformer decoder, which are strong baselines among nonpretrained models. The pretrained models include "BERT-TabGen" and "GPT-TabGen" with the base size (Chen et al., 2020a). Moreover, for the Log-icNLG dataset, we compare with a two-phrase approach denoted by "GPT-Coarse-to-Fine", which first generates a template and then generates the final sentence conditioning on the template (Chen et al., 2020a). For the variational models, we compare with the vanilla CVAE  that approximates the prior distribution p(z|x) to p(z).
Hyperparameters. For the non-pretrained models, we set the dimension of LSTM or Transformer to 256. Our model is based on "Field-Infusing+Trans" which includes 3-layer Transformers in the encoder and decoder respectively. The posterior network q ϕ (z c |y) contains a twolayer Transformer. For the pretrained models, we use the base version of BERT and GPT-2 which have an embedding size of 768. The KL loss is minimized with the annealing trick where the KL weight is set to 0 for 2 epochs and grows to 1.0 in another 5 epochs. The learning rate is initialized to set to 0.0001 and 0.000002 for non-pretrained and pretrained models, respectively. Each model is trained for 15 epochs. A special setting for our model is that we generate 10 candidate sentences for each table, and report the average performance and the best performance based on the selector, respectively. We set the hyperparameters γ 1 = 0.2 and γ 2 = 0.2 for the selector. Table 2 and 3 present the performance of our model as well the compared models on the surfacelevel consistency and the logical fidelity. As shown, without the selector, our model DCVED already outperforms the baseline models "Field-Infusing" and "GPT-TableGen" on both Logic-NLG and Logic2Text datasets. Specifically, when compared with "Field-Infusing", our model increases the BLEU-3, SP-Acc, and NLI-Acc scores by 1.4, 3.7, and 3.9 points, respectively on the LogicNLG dataset, and 0.2, 2.4, and 2.8 points on the Logic2Text dataset. When cooperating with GPT-2, our model outperforms "GPT-TableGen" by 1.6, 2.2, and 5.2 points of BLEU-3, SP-Acc, and NLI-Acc scores, respectively on the Logic-NLG dataset, and 0.2, 1.3, and 5.4 points on the Logic2Text dataset. Moreover, our model Model Type Surface-Level Fidelity Logical Fidelity    would select the best sentence according to the BLEU-3 and NLI-Acc scores, respectively. As shown, a higher BLEU-3 score does not lead to a higher NLI-Acc score. Similarly, a higher NLI-Acc score does not yield a higher BLEU-3 score. The findings indicate that selecting candidates only by BLEU-3 or only by NLI-Acc is not enough. Instead, our trained selector comprehensively considers the BLEU-3 and NLI-Acc scores.

Ablation Study
To analyze which mechanisms are driving the improvements, we present an ablation study in Table  4. We show different ablated models with different combinations of z c , z m , c and m. All these models are based on "Field-Infusing". Moreover, the vanilla CVAE is also compared, which can be considered as a baseline making both z m and z c independent from x. As shown, both the mediators and the confounders are influential. The full model achieve the best SP-Acc and NLI-Acc scores with slightly lower BLEU-3 scores than the ablated model, DCVED (z c , z m , m). Eliminating c from the full model leads to a drop of NLI-Acc by 0.6 and 0.4 points on LogicNLG and Logic2Text, respectively. Further eliminating z m and m leads to a drop of NLI-Acc by 0.9 and 2.9 points on LogicNLG and Logic2Text, respectively. An interesting finding is that DCVED (z c , c) performs worse than DCVED (z c ) on SP-Acc. The reason may be that predicting c from z c without considering the mediators z m may also lead to a bias, similar to CVAE. However, the ablated models all perform better than CVAE on SP-Acc and NLI-Acc.

Human Evaluation
Following recent work (Chen et al., 2020a), we also perform human evaluation on the fluency and  logical fidelity. We randomly select 200 tables in the LogicNLG dataset, and generate one sentence per table for each model. Then we present the generated sentences to four raters without telling which model generates them. The raters are all post-graduate students majoring in computer science. We ask the raters to finish two binarydecision tasks: 1) whether a generated sentence is fluent; and 2) whether the fact of a generated sentence can be supported by the given table.
We report the averaged results in Table 5, from which we can see that our model "DCVED + GPT-TabGen" mainly increases the logical fidelity over the baseline model "GPT-TabGen" from 19.1% to 25.8%. When cooperated with the trained selector and the oracle NLI selector, our model further increase the logical fidelity to 30.8% and 37.1%, respectively. It is worth noting that the NLI selector can be represented by the scorer P N LI ( y, x), which does not require the ground-truth sentence y to be available (Chen et al., 2020a). It means that the setting of using the oracle NLI selector is acceptable.

Case Study
To directly see the effect of our model, we present a case study in Figure 2. Several GPT-2 based models generate sentences describing two tables in the LogicNLG test set. The underlined red words represent the facts contradicting the tive logic but not mentions a specific year. Instead, our model produces two logically consistent sentences with superlative and comparative logic.

Limitations
Although our model can improve the logical fidelity to a certain degree, all the models still get low scores in terms of the logical fidelity in human evaluation, which reflects the challenge of the task. Especially, we find that models do not perform well on certain types of tables: 1) containing and comparing between large numbers, e.g., 18,013 and 29,001 in a table; and 2) containing mixed logics so that models require multi-hop reasoning, e.g., models generating "there were 3 nations that won 2 gold medals" while the correct nation number is 4.
To deal with these problems, we believe that two directions of work may be workable: 1) enhancing the mediators. For example, the logical forms (Chen et al., 2020e) can be utilized as the mediator. But as mentioned in Section 4.2, it is label-intensive to annotate the logical forms; 2) large-scale knowledge grounded pre-training, which may be a more promising way. This type of work utilized the existing knowledge graphs or crawled data from Wikipedia (Chen et al., 2020b) to help models better encode/represent non-linguistic inputs, such as the numbers, the time, or the scores in the tables.

Conclusion
In this paper, we propose a de-confounded variational encoder-decoder for the logical table-to-text generation. Firstly, we assume two latent variables existed in the continuous space, representing the mediator and the confounder respectively. And we apply the causal intervention method to reduce the spurious correlations. Secondly, to make the latent variables meaningful, we use the exactly selected entities to supervise the mediator and the not selected but linguistically similar entities to supervise the confounder. Finally, since our model can generate multiple candidates for a table, we train a selector guided by both surface-level and logical fidelity to select the best sentence. The experiments show that our model yields competitive results with recent SOTA models.