PLOG: Table-to-Logic Pretraining for Logical Table-to-Text Generation

Logical table-to-text generation is a task that involves generating logically faithful sentences from tables, which requires models to derive logical-level facts from table records via logical inference. It raises a new challenge on the logical-level content planning of table-to-text models. However, directly learning the logical inference knowledge from table-text pairs is very difficult for neural models because of the ambiguity of natural language and the scarcity of parallel data. Hence even large-scale pre-trained language models present low logical fidelity on logical table-to-text. In this work, we propose a Pretrained Logical Form Generator (PLOG) framework to improve generation fidelity. Specifically, PLOG is first pretrained on a table-to-logical-form generation (table-to-logic) task, then finetuned on downstream table-to-text tasks. The logical forms are formally defined with unambiguous semantics. Hence we can collect a large amount of accurate logical forms from tables without human annotation. In addition, PLOG can learn logical inference from table-logic pairs much more reliably than from table-text pairs. To evaluate our model, we further collect a controlled logical table-to-text dataset CONTLOG based on an existing dataset. On two benchmarks, LOGICNLG and CONTLOG, PLOG outperforms strong baselines by a large margin on the logical fidelity, demonstrating the effectiveness of table-to-logic pretraining.


Introduction
Table -to-text generation is a sub-task of data-to-text generation, aiming to generate natural language descriptions from structured tables.There are two main steps to performing table-to-text generation: content planning (selecting table contents and determining the plan to describe them) and surface realization (verbalizing the plan into fluent natural language).Traditional table-to-text systems adopt a pipeline architecture to complete the two procedures with separate modules (Kukich, 1983;McKeown, 1985).Recent work has shown the advantage of using a neural encoder-decoder model to directly generate sentences from the tables, which presents the strong capability to produce fluent and natural text (Wiseman et al., 2017;Nie et al., 2018;Puduppully et al., 2019b).Researchers have also attempted to finetune pretrained language models such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) on downstream table-to-text tasks and achieved remarkable success on a broad range of benchmarks (Xie et al., 2022;Kale and Rastogi, 2020).
Previous studies have mainly focused on surfacelevel realization, i.e., simply restating surface-level facts in natural language (Wiseman et al., 2017;Liu et al., 2018;Puduppully et al., 2019a,b).Recently, logical table-to-text generation (Chen et al., 2020a), i.e., generating textual descriptions that require logical reasoning over surface-level facts in the table, has attracted increasing attention.Logical table-to-text generation poses a new challenge of logical-level content planning, requiring models to perform logical inference to derive facts from surface-level table records.End-to-end neural models often suffer from low logical fidelity on this task, i.e., the generated sentences are not logically entailed by the tables despite their reasonable fluency (Chen et al., 2020a(Chen et al., , 2021)).We attribute this to the fact that the ambiguity of natural language target sentences hinders neural models from learning accurate logical inference from table-text pairs.In addition, the amount of such table-text pairs is limited because of the labor-intensive human annotation for logic-focused descriptions, which also limits the performance of neural models.
To achieve high fidelity of logical-level generation, Chen et al. (2020b)  Target: For dwbl, the power for dwll is higher than the power for dyku.
(a)  (Liu et al., 2021a;Shu et al., 2021;Xie et al., 2022) mostly focus on converting the logical forms into texts rather than tables into texts.
In this study, we propose a Pre-trained LOgical Form Generator (PLOG) model to achieve more faithful logical table-to-text.Specifically, PLOG is first pretrained on a large-scale synthetic corpus of table-to-logical-form generation (table-to-logic) to learn how to generate accurate logical forms from tables, then finetuned on downstream table-to-text tasks to transfer the logical inference knowledge learned from pretraining to text generation.Our insights are three-fold.(i) Unlike natural language sentences, logical forms are formally defined with unambiguous semantics; hence it is much easier and more reliable for models to acquire logical inference knowledge via learning from logical form generation. (ii) It is viable to collect large-scale logical form corpora via rule-based sampling over tables without the efforts of human annotators.(iii) Via pretraining on large amounts of table-to-logic data, the proposed model can better understand the table and organize the logical-level content planning, leading to faithful table-to-text generation.Here, we treat logical forms as intermediate mean-ing representations of logical-level texts, while we do not need them when performing the downstream task.To collect the pretraining data, we propose an execution-guided sampling approach to sample accurate logical forms from tables automatically.
We formulate the pretraining task in the same sequence-to-sequence (seq2seq) generation to achieve smooth transfer learning to the downstream table-to-text task.We adopt several strong pretrained language models, BART and T5, as the backbone models.Because the previous benchmark for logical table-to-text, LOGICNLG, lacks control features, leading to uncontrollable content selection and poor logical fidelity, we collect a a new CONTrolled LOGical Natural Language Generation (CONTLOG) dataset as a complementary testbed towards controlled logical table-totext generation.Specifically, we re-organize the LOGIC2TEXT dataset by detecting highlighted cells based on their annotated logical forms.Figure 1 presents examples of the table-to-logic pretraining task and the (controlled) logical table-to-text task.
On the two benchmarks, LOGICNLG and CON-TLOG, PLOG outperforms the strong baselines such as T5 by a large margin on the logical fidelity, demonstrating the effectiveness of table-to-logic pretraining.Human evaluation and analysis experiments further demonstrate that our approach can considerably promote the fidelity of logical   (Lebret et al., 2016).LOGICNLG is the first dataset to focus on logical table-to-text generation, with Wikipedia tables and human-annotated logical descriptions.Chen et al. (2021) proposed a deconfounded variational encoder-decoder model to encourage the model to generate non-surface-level predictions; however, the logical reasoning process is not explicitly considered, leading to low fidelity scores on human evaluation.Chen et al. (2020b) proposed to annotate logical forms to guide the generation and released a LOGIC2TEXT dataset.In this work, we focus on direct logical table-to-text generation without any explicit logical forms.Another related line of datasets are ToTTo (Parikh et al., 2020) and HiTab (Cheng et al., 2021), which incorporate highlighted cells to promote controllable generation.The CONTLOG dataset we propose is similar to the task settings of these datasets but differs in that we focus on logical-level generation.At the same time, only a small portion of examples in ToTTo and HiTab involve logical reasoning.ROTOWIRE (Wiseman et al., 2017) and Numeric-NLG (Suadaa et al., 2021) (Eisenschlos et al., 2020;Liu et al., 2021b;Dong et al., 2022;Iida et al., 2021) has been popular for table understanding tasks such as Table Question Answering (TableQA) (Zhong et al., 2017;Pasupat and Liang, 2015) and Table Fact Verification (TableFV) (Chen et al., 2019).With large-scale pretraining corpora, the table pretraining models can learn a better joint understanding of tabular and textual data through well-defined pretraining objectives.Most table pretraining works are based on table-text corpora, while TAPEX (Liu et al., 2021b) learns from synthetic SQL programs, which is the closest to our work.Specifically, TAPEX is first pretrained on a table-based SQL execution task, where the input is a table and a SQL program, and the output is the answer to the SQL query.Then, the pretrained model can be finetuned on TableQA and TableFV tasks where the input is a table associated with a textual query/statement, and the output is the an-swer.However, our work differs from TAPEX in that we focus on table-to-text generation, where the input is a structured table and the output is a textual statement of the table contents.Our task requires deriving a complete logical-level fact from the table without the guidance of any query.In addition, our pretraining task also requires generating a self-contained logical form from the table, while TAPEX aims to learn the neural execution of an existing SQL program.Similarly, FLAP (Anonymous, 2021) proposes to enhance the numerical reasoning ability of table-to-text models with an artificial pretraining task.This task is a synthetic QA task similar to TAPEX pretraining.
Another line of related works adopts pretraining techniques to solve the text-to-SQL parsing (Yu et al., 2021;Shi et al., 2021) task, also involving collecting synthetic SQL data and pretraining models on SQL generation tasks.However, text-to-SQL still requires an explicit NL query as the input, which is different from our task.Although table pretraining is popular in table understanding tasks, it has not been well-explored in table-to-text.Previous works on table-to-text tend to directly utilize pretrained language models by flattening structured tables into sequences (Gong et al., 2020;Kale and Rastogi, 2020;Xie et al., 2022).A recent work (Andrejczuk et al., 2022) incorporates structural positional embeddings of tables into T5 (Raffel et al., 2020) and performs intermediate pretraining in a similar way to TAPAS (Eisenschlos et al., 2020).Similarly, PLOG can also be seen as intermediate pretraining of language models for table-to-text generation.

Downstream Tasks
In this work, we focus on logical table-to-text.The previous benchmark LOGICNLG aims at generating sentences from a full table without control features, which causes uncontrollable content selection and hinders faithful generation (Chen et al., 2020b).Therefore, we propose a new controlled logical table-to-text dataset CONTLOG as a complementary testbed to LOGICNLG.Inspired by previous studies on controlled table-to-text (Parikh et al., 2020;Cheng et al., 2021), we incorporate highlighted cells as additional supervision signals in CONTLOG (Figure 1) to narrow down the scope of content selection, such that models can focus more on planning and generation.

CONTLOG Dataset Construction
We reuse the LOGIC2TEXT dataset to build CON-TLOG.In LOGIC2TEXT, there is an annotated logical form for each target sentence.The logical form can convey the accurate logical semantics of the sentence.Hence, we execute the logical forms on the context tables and extract the table cells relevant to the execution process.These cells are also the ones most relevant to the target sentence.Although built upon LOGIC2TEXT, CONTLOG does not contain logical forms because we focus on the direct table-to-text generation.Figure 1 shows an example of CONTLOG.

Task Formulation
The input of LOGICNLG is a table where R T and R C are the numbers of rows and columns, respectively, and T ij is the table cell value at row i and column j.Each column also has a column header Col j .The output is a sentence y.The task objective is to find a model P (y|T ) to generate a sentence ŷ that is both fluent and logically entailed by the table.In CONTLOG, an additional set of highlighted cells H = {(i, j)} are included as a part of the input, where i and j denote the row index and column index of a highlighted cell.The objective thus becomes P (y|T ; H).

Table-to-Logic Pretraining
Logical table-to-text is difficult mainly because of the ambiguity of natural language sentences.For example, a sentence Alice was the first player that achieved champion in 2010 has two possible meanings: (1) Alice got the first champion of 2010; (2) Alice became the first champion in history, and this achievement happened in 2010.This prevents end-to-end neural models from inferring unambiguous logical facts from the table, especially when the parallel data are scarce.
To achieve faithful logical table-to-text generation, we propose a table-to-logic pretraining task that involves generating a logical form from an input table.In this task, the model needs to mine logical-level facts from tables and organize the facts into formally defined meaning representations, i.e., logical forms.Each logical form can be regarded as an abstract content plan of a logical-level description.Therefore, we expect a model to learn logical-level content planning from the pretraining task.We then finetune the model on the downstream table-to-text tasks to generalize the content planning to natural language generation.We formulate our pretraining and downstream tasks as the same seq2seq generation paradigm to realize successful transfer learning.

Pretraining Task Formulation
The input of the pretraining task is the same (sub-) table as we introduced in Section 3.2, while the target is a logical form instead of a sentence.We follow the same schema in LOGIC2TEXT to define the logical forms used in our task.Each logical form z is the composition of several logical functions.Each function f i (arg 1 , ...) accepts several arguments relevant to the table T .z can be parsed into a tree and executed from bottom to up by a logical form executor.In this process, the execution result of f i may be fed to its parent function as an argument.The root function always outputs a Boolean value (true or false) which indicates the factual correctness of z.We select this schema because of its several merits.(1) It is originally designed to represent logical-level textual statements in LOGIC2TEXT, and its definition is close to our downstream tasks.A similar schema has also been used for TableFV tasks (Ou and Liu, 2022;Chen et al., 2019).( 2) It covers seven of the most commonly used logic types: count, unique, comparative, superlative, ordinal, aggregation and majority.(3) The logical forms can be executed on the tables to evaluate their exact correctness, allowing accurate evaluation of the pretraining task.A detailed description of the logic schema is provided in Appendix B.

Evaluation Metric of Table-to-Logic
We adopt the execution accuracy of generated logical forms as the evaluation metric for our pretraining task, similar to the setting in text-to-SQL tasks (Zhong et al., 2017).Specifically, a logical form is counted as correct if it can be successfully executed on the input table and returns a Boolean value True that indicates the table entails it.

Pretraining Data Collection
To perform table-to-logic pretraining, we must collect enough paired data of tables and associated logical forms.The formal definition of logical forms allows us to automatically collect a large amount of logical forms from tables via rule-based sampling.Here, we propose instantiating existing logical form templates to sample logical forms similarly to how prior studies collect SQL programs (Zhong et al., 2020;Liu et al., 2021b).Specifically, we extract abstract templates from the logic schema we use.Then we adopt an execution-guided sampling method to instantiate the templates based on the context tables.Our approach has two merits: (1) By utilizing the pre-defined templates, we can control the distribution and diversity of collected logical forms.(2) With the execution-guided sampling, the correctness of the collected logical forms is guaranteed.
Templatization We first extract the templates based on our logic schema.We define them as trees with typed placeholder nodes that need to be instantiated into specific functions or entities.The placeholders include two entity types: Column represents a column header and Object means either a textual entity or a numerical value.In addition, we categorize some similar functions into smaller groups to obtain several function placeholders, which can reduce the number of templates and simplify the instantiation work.For example, FILTER represents a set of row-filtering functions.Table 7 shows the complete list of these function placeholders.The other functions that cannot be categorized need not instantiation.Finally, we obtain 35 templates, an average of 5 for each logic type.More examples of the templates are provided in Appendix C.

Instantiation
We propose an execution-guided bottom-up sampling strategy to instantiate the template trees.An example of template instantiation is depicted in Figure 2. We design rules to instantiate different placeholder nodes via sampling.For example, we uniformly sample a column from the table to instantiate a Column placeholder (e.g.date in Figure 2).For a function placeholder such as FILTER, we sample a specific function from the corresponding category it represents (e.g.filter_greater in Figure 2).For each instantiated function node, we execute it, obtain the execution result and feed the result to the parent function as an argument.Hence, the arguments of higher-level functions are guaranteed to be valid.The process lasts from bottom to up until finishing executing the root function node.We provide the detailed sampling rules in Appendix C. For each table, we conduct multiple trials of sampling.At each trial, we randomly select a template based on its distribution in LOGIC2TEXT, and perform the instantiation.Because of the randomness in selecting functions and entities, we can obtain different results from multiple trials.A trial may sometimes fail because of execution errors, but each successful trial will result in a correct logical form.We can perform the sampling as many trials as we want to obtain a large-scale table-to-logic corpus.

The PLOG Model
In this section, we introduce our proposed model PLOG and how we conduct the seq2seq generation for the pretraining and downstream tasks.
Backbone Model We utilize the same backbone model to address both tasks to achieve the knowledge transfer from the  (Parikh et al., 2020).This is to avoid the over-length issue with pretrained models and the negative impacts caused by irrelevant table information.

Numerical
Pre-Computation Numerical reasoning is difficult for neural language models, especially aggregation operations (e.g., the average of numerical values) and numerical ranking (e.g., the nth-maximum values of a column).Therefore, we conduct a pre-processing step by pre-computing some potentially useful numerical values.Similar approaches have also been proposed to improve the fidelity in table summarization (Suadaa et al., 2021) and text-to-SQL tasks (Zhao et al., 2022).First, we evaluate each numerical cell's rank in its column (or the scope of highlighted cells) and append this rank to the linearized cell representation.Hence, each table cell T ij can be serialized into a sequence , where r + ij indicates the rank of T ij in column j in the decreasing order and r − ij is the rank in the increasing order.The special tokens with angle brackets are used to indicate the structure of the input.In addition, we compute the average and sum of each numerical column in the input (sub-) table, and append two aggregation cell strings c sum and c avg to the flattened table sequence.c sum j /c avg j = <sum_cell>/<avg_cell> sum_value/avg_value <col_header> Col j </col_header> </sum_cell>/</avg_cell>.
Finally, the input (sub-) table is serialized as Model Output We linearize each logical form z into a string via a pre-order traversal of the logic tree following (Chen et al., 2020b).Special punctuations such as semicolons and braces are used to indicate the structural relationships between functions.For example, the logical form instance in Figure 2 can be linearized into eq { 5 ; count { filter_greater { all_rows ; date ; August 5, 1972 } } }.As for the downstream task, the output becomes a sentence.After pretraining a PLOG model, we directly finetune it on the downstream table-to-text tasks by changing the target from logical forms to sentences.

Evaluation
Metrics Following prior works (Chen et al., 2020a(Chen et al., , 2021) ) on LOG-ICNLG, we evaluate our models on both surface-level matching metrics and logical fidelity scores.Surface-level metrics include BLEU-1/2/3, which are based on n-gram matching between the model generations and gold references.In terms of fidelity scores, prior works adopt SP-Acc and NLI-Acc.For SP-Acc, a sentence is first parsed into a logical program and evaluated as the execution accuracy of the program.NLI-Acc is based on TableBERT, a table-entailment model pretrained on the TabFact dataset (Chen et al., 2019).The model can predict whether a table supports a sentence.
However, these two fidelity metrics are not enough to verify the fidelity: we empirically find that the parsing algorithm for SP-Acc often generates irrelevant logical programs for the sentences, which renders the evaluation inaccurate.In addition, the TableBERT model used for NLI-Acc only achieves 65.1% accuracy on the TabFact dataset, and we find it overly positive about the predictions.To this end, we add two state-of-the-art table-entailment models for evaluation: TAPEXlarge (Liu et al., 2021b) and TAPAS-large (Eisenschlos et al., 2020), which achieve 84.2% and 81.0%test accuracy on TabFact, respectively.We name the two metrics as TAPEX-Acc and TAPAS-Acc, respectively.We still evaluate SP-Acc and NLI-Acc to compare our method with previous studies.For CONTLOG, we adopt the evaluation metrics of LOGIC2TEXT: BLEU-4 and ROUGE-1/2/4/L to evaluate surface-level matching, and use TAPEX-Acc and TAPAS-Acc to evaluate the fidelity.

Models for Comparison
For LOGICNLG, we compare our method with the following models: GPT-TabGen (sm) and GPT-Coarse-to-Fine (sm) (Chen et al., 2020a) are two baselines based on pretrained GPT-2; DCVED+GPT-TabGen (Chen et al., 2021) is a de-confounded variational model with GPT-TabGen (sm) as the backbone.We also include pretrained BART-large, T5-base and T5-large as the baselines models for both LOGICNLG and CONTLOG, for which we adopt our data pre-processing method introduced in Section 5. Our models are named PLOG (BART-large), PLOG (T5-base) and PLOG (T5-large) when using different backbones.We adopt the same input serialization strategy with numerical pre-computation for BART, T5, and PLOG models.
Training Details We conduct our main experiments based on Transformers (Wolf et al., 2020) and PyTorch (Paszke et al., 2019).During training, the parameters of embedding layers of models are frozen.During inference, we adopt beam search with beam size 4 for all the experiments.We set the maximum length as 500 and 200 for source and target sequences, respectively.Each experiment was run only once because of the time cost.On LOGICNLG, model selection is based on the BLEU-3 score on the validation set, and on CONT-LOG, it is based on validation BLEU-4 scores.The selection of pretraining checkpoints is based on the Execution Accuracy of generated logical forms on the validation set of pretraining tasks.We provide detailed hyperparameters in Appendix A.

Automatic Evaluation
LOGICNLG Table 2 presents the results on LOGICNLG.We can observe that the BART and T5 models with our preprocessing strategies outperform all the previous models based on GPT-2 in terms of both surface-level metrics and logical fidelity scores.We also observe that the PLOG models mostly outperform their base models on BLEU scores while they can significantly improve the logical fidelity scores on all the metrics.For example, PLOG (T5-large) improves the TAPEX-Acc and TAPAS-Acc over T5-large by an average of 10% accuracy.However, PLOG (T5-base) achieves lower results on BLEU scores, possibly because of the uncontrollable task setting of LOGICNLG.LOGICNLG does not provide highlighted cells, so the potential space for content selection is usually very large.This makes models very likely to generate faithful sentences that describe different facts/contents from the gold references, causing low BLEU scores.Moreover, BLEU is based on local N-Gram matching which cannot capture the global faithfulness of generated sentences.Therefore, such surface-level metrics may not correlate well with fidelity metrics.
CONTLOG The results on CONTLOG are shown in Table 3.As observed, PLOG models outperform their base counterparts consistently on both  surface-level and logical-level metrics.This suggests that adding highlighted cells to narrow down the scope of content selection is beneficial to more reliable evaluation.In addition, the consistent improvements with different backbone models demonstrates the general effectiveness of our approach.

Human Evaluation
To further investigate whether the models can generate faithful sentences, we perform a human eval-uation on the outputs of BART, T5, and PLOG models.Specifically, we randomly sample 200 examples from the test set of each dataset.We hire three human annotators to rate each sentence a score in the discrete range between 0 and 3, according to the criteria adopted in (Chen et al., 2020a).Non-sense (0): the sentence does not make sense, and people cannot understand its meaning.Wrong (1): the sentence is overall fluent, but the logic it describes is false.Partially correct (2): the sentence describes multiple facts.At least one of them is correct, but it still contains factual errors.Correct (3): the sentence is of high quality in both fluency and logical correctness.The model names are hidden to the annotators, and we collect their individual results to summarize two scores for each model: (1) the average of their scores on each sampled set; (2) the fidelity accuracy, i.e., the proportion of sentences scored as correct2 .The evaluation is only based on the context  not describe the same fact as the references do but still present high quality in terms of fidelity and fluency.
As shown in Table 4, PLOG (T5-base) outperforms T5-base by a large margin on CONTLOG while it does not achieve superior results on LOGIC-NLG, which is inconsistent with automatic scores.However, PLOG (T5-large) and PLOG (BARTlarge) achieve significant improvements over base models on both datasets, showing an improvement consistent with the automatic metrics.

Table-to-Logic Results
We report the Execution Accuracy of our pretrained models on the table-to-logic pretraining task in Table 5.As shown, PLOG (T5-base) and PLOG (T5-large) present over 90% accuracy in generating correct logical forms, demonstrating that tableto-logic pretraining indeed improves the model's ability to derive accurate logical facts.However, PLOG (BART-large) achieves much lower accuracy.We analyzed the error cases of BART-large and found that over 90% of the errors are caused by logical form parsing errors, i.e., the generated logic string cannot be successfully parsed into a structurally correct logical form tree because of misspelled function names and mismatched brack-ets.It seems BART-large performs much worse than T5-base and T5-large at learning the structure of logic strings.We suppose that incorporating grammar-guided decoding methods (Wang et al., 2018) may alleviate this problem, which we leave to future work.Surprisingly, this does not affect the performance of PLOG (BART-large) on downstream tasks, showing that the model still acquired beneficial knowledge through the pretraining.

Analysis on Different Logic Types
In CONTLOG, each target sentence belongs to a predefined logic type inherited from LOGIC2TEXT, allowing us to analyze the performance of models on different logical reasoning types.In Figure 3, we can observe that our PLOG models generally improves the performance of their base models on most logic types, especially on superlative and ordinal.However, we still observe a considerable amount of incorrect generations of all the models, suggesting the potential room for improvement in the future.

Conclusion
We proposed a table-to-logic pretraining task to enhance the fidelity of logical table-to-text generation.In addition, we constructed a controlled logical table-to-text generation task by re-purposing an existing dataset.To realize pretraining on largescale corpora, we proposed an execution-guided sampling scheme to extract accurate logical forms from tables automatically.With table-to-logic pretraining, our table-to-text model could significantly improve logical fidelity.Our work shows a novel way to utilize formal language to promote tableto-text generation, and may be extended to other related areas such as table representation learning.

Limitations
The first limitation of our approach is that it is initialized from pretrained language models such as T5 to inherit the language generation knowledge learned from large-scale text corpora.This requires the input of PLOG to be a text sequence, which may limit the structural encoding of table inputs and logical form outputs.Although it is possible for us to design and pretrain a new model from scratch, the computational cost will be too large.The second limitation is also caused by this.Because we adopt pretrained language models to perform tableto-logic and table-to-text generation, we have to serialize the input (sub-) tables to fit them in the language model encoder.Therefore, the maximum sequence length of the encoder model limited the size of the input table.To address this, we only input relevant columns or highlighted cells instead of the full table to reduce the input sequence length.However, some potentially useful contextual information in the full table is omitted and may limit the model performance.The third limitation lies in the logical form schema we adopt, which is restricted to the domain of current logical table-totext datasets.When applying our method to new downstream datasets with unseen logic types, e.g., median, proportion, the current schema should be extended to support the new logic.However, the schema is easy to extend by defining new logical operations as executable functions on tables.

Ethics Statement
This work presents PLOG, a pretrained language model for the research community to study logical table-to-text generation.In addition, we also propose a new dataset CONTLOG for the research of controlled logical table-to-text generation.Our dataset contains Wikipedia tables, annotations (target sentences, meta information such as logic types) and highlighted table cell information.We reuse the tables and annotations of LOGIC2TEXT.LOGIC2TEXT is a public dataset under MIT license.And to obtain the highlighted cell information, we use an automatic method without human annotation.We also use LOGICNLG, another public dataset for experiments, which is also under MIT license.All datasets are in English.In human evaluation, we hire human annotators to evaluate the performance of our models.We recruit 3 graduate students in electrical engineering, computer science, and English majors (1 female and 2 males).Each student is paid $7.8 per hour (above the average local payment of similar jobs).

A Experimental Setting Details
The following are the hyperparameters for different model configurations.During finetuning, each pair of base model and the corresponding PLOG model share the same hyperparameters for a fair comparison, while these hyperparameters are tuned only with the base model.T5-base and PLOG (T5-base) : Hyperparamters are the same for both datasets.
The following is the additional information of each pretrained model.
• BART-large: 406M parameters with 24-layer, 1024-hidden-state, 16-heads, Pretraining Details We pretrain our models on the collected table-to-logic data and evaluate their Execution Accuracy on the validation set (pretraining corpora) at an interval of a certain number of steps.We take the best pretraining checkpoints to finetune them on downstream tasks.Figure 4 presents the validation results of pretraining during the training process.We can observe that the models achieve higher accuracy when trained for more epochs.The pretraining is very time-consuming because of the large-scale pretraining data and models.For example, it takes approximately 17 hours to train PLOG (T5-base) for one epoch on the CON-TLOG pretraining data, while it takes 5 days to train one epoch of PLOG (T5-large).Each experiment was done on a single NVIDIA V100 GPU.We suppose the time cost can be reduced by using more GPU resources.
For the definitions and examples of these logic types, please refer to the Appendix of (Chen et al., 2020b).In this section, we provide a complete list of the logical functions in Table 6, which we use to define our logical form schema.

C Details of Pretraining Data Collection
Here, we provide more details of the pretraining data collection procedure, including examples of abstract templates and the complete rules for logi-   .Otherwise, if the function node is eq, this placeholder is instantiated as the execution result of its brother node.This is to guarantee the correctness of equality judgements.
3. The instantiation of a function-type placeholder depends on its function category, as listed in Table 7.If the placeholder belongs to the function category COMPARE or MAJORITY, we choose the specific function name based on the real relationships among its arguments.For example, the arguments of COMPARE functions are two objects whose relationship (equal, greater, less, etc.) can be pre-computed.Hence we can determine the actual function based on this relationship.If the placeholder belongs to another category, the function can be uniformly sampled from the function set.

D Case Study
We further conduct a case study by showing some qualitative examples of model generations.As presented in Figure 5, PLOG models can generate logically correct sentences with complex reasoning while the base models often fail to describe correct facts for the table.

Figure 2 :
Figure 2: An example of instantiating a logical form template.The colored nodes in the template indicate nodes that do not need instantiation, while the white-background nodes are typed placeholders in the template.The dotted arrows indicate instantiation.We employ instantiation of these white nodes with a bottom-up execution-guided sampling approach.Finally, a logical form instance is obtained.Column and Object indicate a column header and an object (entity/number), respectively.FILTER indicates the category of row-filtering functions.all_rows is a special entity to represent the entire table.

Figure 3 :
Figure 3: The human evaluation results of different models on the different logic types of CONTLOG.The y-axis indicates the number of samples scored as correct.Full indicates the number of samples of each logic type in the 200 human evaluation samples.

Figure 4 :
Figure 4: Validation results of table-to-logic pretraining with T5-base and T5-large as the backbones.The results of LOGICNLG pretraining and CONTLOG pretraining are shown at different intervals for better illustration.The results within the first 160k steps are not computed.
have attempted to annotate Table-to-Logical-Form Generation Figure 1: Examples of the tasks and the training procedure of our proposed PLOG model.Task (a) is the table-tologic pretraining task we propose; task (b) is the downstream logical table-to-text task we target.The yellow-colored table cells are annotated as control features for the CONTLOG task, while for LOGICNLG, such highlighted cells are not available.We collect different table-to-logic datasets for CONTLOG and LOGICNLG separately and perform intermediate pretraining for pretrained language models on the collected data, then finetune the model on the downstream tasks.
logical forms to guide the text generation and proposed a LOGIC2TEXT dataset.With logical forms as mediators conveying accurate logical-level facts, models can focus on surface realization from associated logical forms and achieve high fidelity.However, annotating accurate logical forms for textual descriptions requires intensive human efforts.Moreover, generating from a self-contained logical form is actually a different task from table-to-text generation.Prior studies on this dataset

Table - to
-Text Generation Early table-to-text generation tasks are limited to surface-level generation with little focus on logical inference

Table 1 :
Statistics of the downstream tasks and their corresponding table-to-logic pretraining data.
Table Source and Data CollectionWe collect pretraining data separately for the two datasets, LOGICNLG and CONTLOG.For each dataset, we use the tables in its training data as the source tables to avoid potential data leakage.In addition, we remove the sampled logical forms that have appeared in LOGIC2TEXT since they are semantically equal to some of the target sentences in CONT-LOG.To evaluate the performance of table-to-logic models and enable the selection of pretrained models, we also split the collected logical forms into train/val/split sets.The statistics of the pretraining data and their corresponding downstream datasets are shown in Table1.Although we can sample more logical forms with more trials, we find the current pretraining data enough to obtain ideal experimental results.
(Radford et al., 2019)ning task to the table-to-text downstream task.Theoretically, any text generation model applies to our task, such as GPT-2(Radford et al., 2019), BART, and T5.We test different backbone models, including BART-large, T5-base, and T5-large.
cells in row-wise order.For CONTLOG, we only concatenate the highlighted table cells as the input, as suggested by prior works on controlled tableto-text generation

Table 2 :
The experimental results of different models on the test split of LOGICNLG.For the previous models, we compute the TAPEX-Acc and TAPAS-Acc of the only two that have a released official output.We compare each pair of base and PLOG models and mark the better scores as bold.

Table 3 :
The experimental results of different models on the test split of CONTLOG.We compare each pair of base and PLOG models and mark the better scores as bold.

Table 4 :
The human evaluation results of different models.AVG is the average score while ACC means the accuracy of logical fidelity.The average inter-annotator agreement is 0.82 when measured by Fleiss' Kappa.

Table 5 :
table without considering gold references, because the generated sentences may Experimental results of different PLOG models on the validation and test sets of table-to-logic generation.The scores are reported as Execution Accuracy.

Table 6 :
string, object view returns the subview whose values under the header column is equal/not equal to argument 3 filter_greater/less view, header string, object view returns the subview whose values under the header column is greater/less than argument 3 filter_greater_eq /less_eq view, header string, object view returns the subview whose values under the header column is greater/less or equal than argument 3 filter_all view, header string view returns the view itself for the case of describing the whole table all_eq/not_eq view, header string, object bool returns whether all the values under the header column are equal/not equal to argument 3 all_greater/less view, header string, object bool returns whether all the values under the header column are greater/less than argument 3 all_greater_eq/less_eq view, header string, object bool returns whether all the values under the header column are greater/less or equal to argument 3 most_eq/not_eq view, header string, object bool returns whether most of the values under the header column are equal/not equal to argument 3 most_greater/less view, header string, object bool returns whether most of the values under the header column are greater/less than argument 3 most_greater_eq/less_eq view, header string, object bool returns whether most of the values under the header column are greater/less or equal to argument 3 Function definitions of the logic schema borrowed from (Chen et al., 2020b).

Table 7 :
The categorized functions for template abstraction.Functions in the same category have the same argument definitions.cal form sampling.Table 7 lists the function-type placeholders.Templates We provide in Table 8 some examples of our logical form templates.These examples are all based on the example table in Figure 1.Instantiation Here we provide the main rules we design for instantiating a logical form template by sampling from a table.1.For placeholder type Column, we randomly sample a column header from the current input (sub-) table.2. For placeholder type Object, the instantiation depends on the parent function node of this placeholder.If the function node is only or belongs to the category FILTER or MAJORITY, the placeholder is instantiated as a sampled value from a certain column of the current input (sub-) table