SORTIE : Dependency-Aware Symbolic Reasoning for Logical Data-to-text Generation

Logical data-to-text generation is a representative task in measuring the capabilities of both language generation and complex reasoning. Despite the introduction of reasoning skills in generation, existing works still rely on neural language models to output the final table description. However, due to the inefficacy of neural language models in complex reasoning, these methods inevitably have difficulty working out key entities in the description and might produce unfaithful descriptions. To alleviate these issues, we propose a dependency-aware symbolic reasoning framework that reasons out each entity in the table description with our designed table-compatible programming language. To figure out the dependency relationship among entities, we devise an entity scheduling mechanism to determine the order of programme synthesis such that the reasoning of an entity only relies on other “re-solved” entities. Experiments on three datasets and three backbones show that ours outperforms previous methods not only in surface-level fidelity but also in logical fidelity. No-tably, the proposed framework enhances GPT-2, BART and T5 with an absolute improvement of 5 . 7% ∼ 11 . 5% on SP-Acc.


Introduction
Generating logical-consistent sentences is an integral part of human intelligence and has attracted broad research interests in the field of natural language processing recently (Chen et al., 2020a,c;Wei et al., 2022;Creswell et al., 2022;Kazemi et al., 2022).One of the most prominent attempts to investigate this capability in neural models is logical data-to-text generation (Chen et al., 2020a), which requires conducting intricate reasoning on rary deep learning methods and symbolic AI, we get inspiration from recent neural symbolic literature (Gao et al., 2022) that decouples complex reasoning with language generation.In specific, we delegate the inference of entities mentioned in the table description to a programme interpreter.The interpreter executes our generated python-like programme, thus working out the entities correctly and alleviating hallucination.
However, synthesizing such a programme to infer entities is not a trivial task due to two major challenges: First, though there are some domain-specific programming languages on natural text (Chen et al., 2020b;Gupta et al., 2019), we need to design a table-compatible and easyto-execute programming language to support the reasoning over entities in the table.Second, the entities to infer are not independent but have a complex dependency relationship and the inference of one might rely on the others.For instance, as is shown in Figure 1, we can not count the appearance of 16150 unless we work out the 16150 first.Thus, figuring out the synthesis order of the entities is fundamental to the reasoning process.To make it worse, there is no human annotation of programmes or synthesis order for the entities.
To mitigate the aforementioned problems, we propose SORTIE (SymbOlic Reasoning with enTIty schEduling), a framework that reasons out each named entities in the table description with dependency-aware symbolic reasoning.Specifically, (1) we introduce a table-compatible programming language that defines the grammar and operators for reasoning on the tabular data and delegates the reasoning of each entity to the execution of the programme; (2) we devise a new pipeline to predict the dependency relationship between entities and synchronously synthesize the programmes to work out each entity; (3) we heuristically search pseudo labels for both the programmes and synthesis order of entities.We further adjust the sample weight of pseudo labels to alleviate the spurious correlation issue with a self-adaptive training algorithm.
To summarize, our contributions are three-fold: (1) To our best knowledge, we are the first to model the dependency relationship between entities in the table description and propose a new pipeline to synchronously predict the order of entities and reason them out one by one.(2) We successfully apply symbolic reasoning to logical data-to-text generation tasks.To support the reasoning of entities, we design a table-compatible python-like programming language that is more feature-rich and tablefriendly compared to previous ones.(3) We empirically validate the efficacy of SORTIE on three benchmarks for logical data-to-text generation, including LogicNLG (Chen et al., 2020a), Logic2Text (Chen et al., 2020c), and SciGen (Moosavi et al., 2021).When applied on GPT-2, BART or T5, our methods substantially enhance the SP-Acc which is a crucial measurement for logical fidelity with an absolute improvement of 5.7% ∼ 11.5%.

Data-to-text Generation
Early data-to-text generation mainly focuses on surface-level descriptions of the table contents (Lebret et al., 2016;Liu et al., 2018;Ma et al., 2019;Wang et al., 2020).However, in spite of generation fluency, neural-based generation models struggle to perform rich inference based on the facts in table (Chen et al., 2020a,c).To make up for that, logical table-to-text generation is proposed as a new task with the aim of generating logically-consistent descriptions from open-domain tables (Chen et al., 2020a,c).
In recent years, to endow neural models with complex reasoning ability, DCVED (Chen et al., 2021) applies causal intervention methods to reduce the spurious correlation in entities.PLOG (Liu et al., 2022) and TABT5 (Andrejczuk et al., 2022) introduce table-to-logical-form or table denoising as self-supervision tasks in the pretraining stage.Similarly, REASTAP (Zhao et al., 2022) introduces 7 pre-training tasks to mimic the 7 types of reasoning skills of humans.It is worth noting that this line of research is orthogonal to ours since they primarily concentrate on developing training instances that reflect the desired reasoning skills.Similar to the programming language in our proposal, Saha et al. (2022) introduce logic string as an intermediate step to guide generation.However, the surface realization from the logic string to the final description is very prone to hallucinations as it is done purely by neural language models.

Symbolic Reasoning
The idea of symbolic reasoning has garnered considerable attention in numerous natural language processing and computer vision tasks.Andreas et al. (2016)   parsing questions into linguistic substructures and constructing question-specific deep networks from smaller modules that each tackle one subtask.Following this work, numerous efforts have been made to directly predict the instance-specific network layouts in an end-to-end manner (Hu et al., 2017), to alleviate the requirement for mediate supervision on semantic parsers (Hu et al., 2018;Mao et al., 2019), to infer the answer with a purely symbolic executor (Yi et al., 2018), and to conduct visual co-reference resolution (Kottur et al., 2018).Very recently, Gupta et al. (2019) and Chen et al. (2020b) concurrently proposed using neural symbolic approaches to answer questions in machine reading comprehension, which demonstrates advantages in numerical reasoning and interpretability.Compared to the previous tasks, which only need to derive a single entity or value, the task of logical table-to-text generation requires the generation of a complete natural language sentence containing multiple entities or logical types.

Problem Formulation and Overview
Given a table [e 1 , e 2 , • • • , e n ] to fill in the placeholders to form a complete description Y .We focus on the second step in this work while following Chen et al. (2020a) for the first step.

Road Map
We first introduce our designed tablecompatible programming language in § 3.2.The architecture of our model, mostly composed of three components, is illustrated at § 3.3.Finally, the learning algorithm to deal with the scarcity of human annotation labels is described at § 3.4.

Table-compatible programming language
To reason out the named entities faithfully from the table, we introduce a programming language composed of a series of specially designed operators and named entities as operands.We list our operators in Table 1.
Based on the type of output, we roughly sort all the operators into three categories: value operators, list operators, and boolean operators.Borrowed from Chen et al. (2020b), the value operators are designed to select a value from the table (SELECT, MAX and MIN) and calculate some simple arithmetic (SUM, DIFF, DIV and COUNT).Apart from that, since the layout of the data in a table is in the format of columns with each column including homogeneous information, we design list operators (FILTER or UNIQUE) and index operators (ARGMAX, ARGMIN, ARGWHERE) directed against a single column2 to obtain a new list or indies of a list respectively.Finally, we also include boolean operation (EQ, GE, LE, GEQ, LEQ) as an integral part  of FILTER and ARGWHERE.
Compared with the domain-specific language in Chen et al. (2020b) and Gupta et al. (2019), the major novelty lies in its compatibility with structured tabular data, for example, the list operator to accurately pick one or more specific values from a table according to our requirement.We note a concurrent work (Zhou et al., 2022) also puts forward a table-supported programming language.Different from ours, it only operates on linearized tables in natural language form but does not support raw structured tables.Generally speaking, we extrapolate the traditional symbolic reasoning operators in reading comprehension to adapt to a more complex scenario.At the same time, our operators still keep the compositionality, the ability to generate complex programmes by compositionally applying the operators.We leave more detailed discussions about the connections to other domain-specific languages in Appendix A.

Main Components
The main working flow of the proposed method is illustrated in Figure 2. In a nutshell, it is composed of three parts, (1) encoding, (2) entity scheduling and (3) programme synthesis and execution, which we will elaborate on below respectively.
Encoding.Given a table T , we first linearize the table into a natural language form following Chen et al. (2020a).Then we concatenate the linearized table with the template into a single sequence and transform it into a dense representation with a pre-trained language model (PLM): where l is the total length of the linearized table and the template.During the training phase, the template is obtained by substituting the entities in the golden description with placeholders.At inference, the template is obtained with the same PLM.
Entity Scheduling.As mentioned before, entities within a description are not isolated and there exists a latent dependency relationship between each other.If the entities are reasoned in chronological order (i.e., from left to right), the programmer may struggle to synthesize a suitable programme when faced with entities whose dependencies are unsolved yet.To this end, we devise an entity scheduling mechanism to dynamically select to-besolved placeholders that only depends on currently known entities.
In detail, we employ a 1-layer GRU to realize scheduling.At the t-th step, with the entity3 reasoned out at the last step, we concatenate its word embedding together with the dense representation of the corresponding placeholder as input.The former provides the semantics of the last entity while the dense representation of the placeholder carries the contextual and positional information in the template, which is helpful to reason out the next placeholder.The input is used to update the inner hidden state of GRU h s t−1 .Then, we calculate the probability of selecting a placeholder in the template Ỹ according to the similarity between h s t and the embeddings of the placeholders: where is a slice of H enc corresponding to the placeholders in the template, and f sim (•, •) is a similarity function implemented as the dot product.Pr(P i ) is the probability of selecting the i-th placeholder in the template to solve at the t-th step.We choose the placeholder with the highest probability: and use the dense representation of the chosen placeholder to initialize the hidden state of the programmer, which will be illustrated later.To deal with the undifferentiable problem in selecting a single placeholder, we apply gumbel-softmax (Jang et al., 2016)  To find more details and specific implementation, please refer to Appendix B.

Weak Supervision
Since the human annotation about entity scheduling and programme of entities are absent, we initiate the learning of the proposed model with weak supervision: Weak Supervision on Programme Synthesis.Heuristically tagging pseudo labels is a common practice to solve the paucity of human annotation.Following previous works (Min et al., 2019;Chen et al., 2020b), we collect a group of the most common programmes as heuristic set H. For every entity e i that appears in the description, we exhaustively enumerate the heuristic set H and find a subset, S p i , that could derive the target entity as programme candidates for e i .More details about the S p i could be found in Appendix C.1.Weak Supervision on Entity Scheduling.To deal with the paucity of annotation on entity dependency relationships, again, we explore constructing supervision signals with pseudo programmes.
Algorithm 1 The proposed learning algorithm.

5:
Replace the entities in Y to obtain a template Ỹ .

6:
For all the placeholders P1, P2, if A topological order T exists and |D| ≤ β then 9: D = D ∪ (T, Ỹ , P, T ).10: end if 11: end for 12: Calculate the likelihood for each suite of programme and scheduling order in D: end if 16: Optimize p θ and q ϕ on D according to Eq. 3. 17: end for 18: Return: programme synthesis model p θ and entity scheduling model q ϕ Specifically, we define a n-tuple programme candidates (p 1 , p 2 , • • • , p n ) as a suite of programme P, where p i is a programme candidate for the ith placeholder.In fact, it is an element from the cartesian product S p 1 × S p 2 × • • • S p n .For any P, we could construct a dependency graph with an edge pointing from entity e i to entity e j if the reasoning of entity e j is dependent on entity e i .If the dependency graph is a directed acyclic graph (DAG) then we use the topological order T to serve as a possible candidate for an entity scheduling order.

Self-adaptive Training
Although we could obtain more than one suite of programmes for an entity and many possible scheduling orders through weak supervision, usually only one is correct while others are spurious solutions (Min et al., 2019).Inspired by Huang et al. (2020), we employ the self-adaptive learning algorithm to eliminate the influence of the spurious correlation in the training process.
Given all suites of programmes and corresponding scheduling order D = {(P i , T i )} m i=1 for a template where m is the number of programme suites with a legal topological order.We consider a soft pseudo label for each suite: m at the beginning.For each iteration, we calculate and normalize the likeli-hood for each suite [ ŵ1 , ŵ2 , • • • , ŵm ] with the programmer, then we update the pseudo label by w i ← α×w i +(1−α)× ŵi .α is a hyper-parameter and serves as the momentum of the exponentialmoving-average scheme.The learning objective of the programmer and entity scheduling is then defined as: where p θ and q ϕ represents the programme synthesis and the entity scheduling, with trainable parameter θ and ϕ respectively.
A high-level learning algorithm is summarized in Algorithm 1.We leave the specific implementation of the training strategy in Appendix C.3 due to the space constraint.

Datasets
We conduct experiments on three benchmark datasets for logical table-to-text generation: Logic-NLG (Chen et al., 2020a), Logic2Text (Chen et al., 2020c) and SciGen (Moosavi et al., 2021).The test set of SciGen was split by the data owners into the "Computation and Language" (C&L) domain, and the "Other" domain, which primarily contains examples from "Machine Learning" (ML) papers.More details about these three datasets can be found in Appendix D.

Evaluation Metrics
Automatic Evaluation.We evaluate the surfacelevel and logical fidelity of all models, as described in previous works (Chen et al., 2020a(Chen et al., , 2021)).For surface-level fidelity, we calculate multi-reference BLEU-n (abbrv.B-n, n = 1, 2, 3).In terms of logical fidelity, we employ SP-Acc and NLI-Acc following previous works (Chen et al., 2020a(Chen et al., , 2021)).The former aims to measure the logical consistency through a semantic parser, while the latter evaluates the entailment degree.More specific implementations of the automatic evaluation metrics are provided in Appendix E.1.
Human Evaluation.We conduct the human evaluation by selecting 300 samples randomly from the test set of LogicNLG, Logic2Text and SciGen respectively, and hiring 6 well-educated native speakers to conduct qualitative analysis on the descriptions produced by our model and all competitive baselines.Two criteria are used by the annotators to assess the descriptions' quality: Language Fluency and Factual Correctness.Each annotator assigns a score from {0, 1, 2} (representing "bad", "fair" and "good" respectively) to each description for each aspect, and Fleiss' Kappa (Fleiss, 1971) is used to gauge the level of agreement between all annotators.We leave more details about the setup of human evaluation in Appendix E.2.

Baseline Models
The following models are selected as baselines: (1) GPT-Coarse-to-Fine: A template-based model that first generates a global logical structure of the description with all entities and numbers replaced by "[ENT]", and then conducts surface realization based on the logical structure (Chen et al., 2020a).
(2) DCVED: A variational auto-encoder model that employs a confounder to represent the spurious entities and a mediator to represent the precisely picked entities (Chen et al., 2021)

Main Results
Table 2 and Table 3 show the performance of our model on LogicNLG, Logic2Text and SciGen.From the tables, we can observe that ours substantially outperform previous methods, especially on SP-Acc and NLI-Acc, which proves the effectiveness of the proposed method.When compared with PLOG and REASTAP, two representative methods that learn reasoning skills through pre-training, we conclude that symbolic reasoning as well as our table-compatible programming language is helpful to promote faithfulness.
Human Evaluation.The human evaluation results are shown in  in terms of factual correctness, which is consistent with the automatic evaluation results.All kappa values are more than 0.6, demonstrating agreement between the annotators.

Ablation Study
Apart from the main experiments, to have a better understanding of how each component and mechanism contribute to surface-level fidelity and logical fidelity, we conduct an ablation study with the following variants: (1)-symbolic: The programmer and the discrete symbolic reasoning are removed.and topological orders.
The experiment results of ablation are shown in Table 5.We can observe that: (1) Both symbolic reasoning and topological entity decoding is vital for the performance of our approach since the removal of either would cause an evident drop in fidelity.
(2) The surface-level fidelity is less sensitive to different variants, and the chief advantage of our approach lies in improving logical fidelity.

Effect of the Pseudo Label Quantity
To see how the proposed learning algorithm works with respect to the size of (P, T ) pairs, we vary the maximum threshold for the pseudo labels (or the β in Algorithm 1).Performance of the variants -self, which is actually maximum marginal likelihood (MML), is also included for comparison and the results are shown in Figure 3.
It is obvious that the fidelity of the MML optimization deteriorates with the size of the pseudo label set increases.We gauge that is because there is usually only one correct topological order and programme for each entity.More candidates would inevitably introduce noise and mislead the model to assign high probabilities to spurious solutions.Notably, our method is immune to spurious solutions, thus exhibiting a different tendency and keeping competitive.

Effect of Entity Scheduling
To have a closer look at how the complexity of the inter-dependency relationship influence the precision of entity reasoning and how the entity scheduling mechanism takes effect, we bin all the test case of the LogicNLG (Chen et al., 2020a) into three buckets according to the length of the longest directed path l dep in the dependency graph.The results are shown in Figure 4. We can see that with entity scheduling, the precision slightly fluctuates with different l dep but does not show an obvious drop in performance.In comparison, when scheduling is removed and the entities are inferred in leftto-right chronological order, the performance of reasoning declines, possibly due to its inability to deal with more complicated dependency scenarios and directly work out all entities without considering their dependency.Take the case in Figure 1

Conclusion
We propose a neural symbolic approach for logical data-to-text generation.With a table-compatible programming language, our approach automatically synthesizes a programme to reason out each entity.Specifically, to handle the inter-dependency between entities, we propose an entity scheduling mechanism that dynamically predicts the reasoning order of entities such that the entity to be reasoned at each iteration has a minimum dependency on "unseen" entities.In addition, to deal with the paucity of human annotations of both pro-grammes and scheduling order, we put forward a weak supervision method and a self-adaptive learning algorithm to mitigate the spurious correlation issue.Evaluation results on three benchmarks show that our model can significantly outperform stateof-the-art approaches, and considerably boost the performance of a pre-trained language model in terms of logical fidelity.

Ethical Considerations
This paper will not pose any ethical problems.First, logical data-to-text generation is an old task in natural language processing, and several papers about this task are published at ACL conferences.Second, the datasets used in this paper have been used in previous papers.

Limitations
The paper presents a dependency-aware symbolic reasoning approach for logical data-to-text generation.All technologies built upon the largescale PLM more or less inherit their potential harms (Bender et al., 2021).Besides, we acknowledge some specific limitations within our methods: 1. Data-to-text generation is essentially a one-tomany problem since there is more than one plausible and logically-consistent description given a specific table.Our approach has little control over the diversity and the logical form of the generated template.It is also possible that our approach only generates trivial or naive descriptions if trivial data dominate in the training dataset.
2. Our work mostly focuses on the named entities in the description, but logical consistency is not all about entities.The syntactic structure or other semantic information also has an influence on generation fidelity, and we leave the symbolic reasoning for more complex logical structures or formats as our future work.
3. Our table-compatible programming language is mainly designed for simple flat tables, and extra operators are necessary before it could be applied to all tables, especially hierarchical tables where its header exhibits a multi-level structure (Cheng et al., 2022).
4. Currently, it is difficult to directly integrate GPT-3 (Brown et al., 2020) (2020b) to implement operations with parameterfree modules that can be directly executed through an "interpreter".

A.2 Connection to NeRd
NeRd (Chen et al., 2020b) is a neural symbolic model that integrates discrete reasoning in reading comprehension.
In addition to value operations (e.g., DIFF/SUM, COUNT, MAX/MIN, ARGMAX/ARGMIN, which are identical to value operators in our programming language), NeRd also designs operations for picking spans or numbers from the passage and question, including SPAN, VALUE and KEY-VALUE.Similar to NMN, NeRd is also incompatible with tabular data.Although we can flatten a structured table into an unstructured natural language form, this simple strategy may lose some of its feasibility when operating on raw table data.For example, we can directly fetch a column of data from a raw table by utilizing the name of a column, but with NeRd, we must repeatedly invoke the SPAN operation and predict the start and end indices of each span.

A.3 Connection to UniRPG
We notice that a contemporaneous work UniRPG (Zhou et al., 2022) also proposes a collection of domain-specific operations to carry out discrete reasoning on tables.Although our value operators (e.g., SUM, DIFF, and DIV) and those of UniRPG have some similarities, there still exist some crucial differences.In what follows, we first briefly describe the operations in UniRPG and then go into more detail about how our table-compatible language differs from UniRPG.
A Brief Review of Operations in UniRPG.In general, the operations in UniRPG can be roughly categorized into atomic operations and higher-order operations.For atomic operations, aside from the SPAN and VALUE introduced in NeRd, UniRPG also introduces CELL and CELL_VALUE which possess similar functionality to SPAN and VALUE respectively but operate on the linearized table.In order to perform higher-order operations, UniRPG enriches the original set of arithmetic operations in NeRd by introducing 1. MULTI_SPANS which returns all the extracted text fragments or numbers; 2. TIMES and DIV which compute the product and quotient between two numbers; 3. AVG which returns the average value of the argument numbers; 4. CHANGE_R that outputs the rate of change between two numbers.
Differences from UniRPG.The core difference is that UniRPG supports complex reasoning on tabular data simply by linearizing a structured table into unstructured natural language, but ours could directly operate on the structured table and thus is able to capture the structural relationship among cell values.To put it more plainly, tabular data is essentially relational data and cell values in the same row (column) share some common feature or attribute.In light of this, our designed programming language can easily fetch a column of cell values and do further analysis (with MAX/MIN, COUNT and so on), or associate two cell values in a row (by the combination of SELECT and ARGWHERE).On the other hand, when an operation on a whole column is necessary, UniRPG entirely relies on the programmer to predict the start and end index for all cells in an interested column and use CELL operation to pick them out one by one.Another difference lies in searching for a specific list that satisfies some conditions.A number of operations need to take a list of cells as one of their arguments.For instance, in order to count games with the largest number of attendees, we would need to pass a list of games meeting the requirement to the COUNT operation.In these cases, UniRPG infers the list by independently deriving elements in it using the CELL operation.Such an approach makes it challenging to scale to situations when many more objects meet the requirements and need to be retrieved since it demands the programmer to understand the intricate semantics of the natural language (e.g., the meaning of "largest") to precisely predict the start and end indices of each object.In contrast, our model can directly operate on the raw table and shrink the scope by recursively invoking the FILTER and boolean operations, which shifts the responsibility for predicting the indices from the programmer to symbolic execution.We replace all the detected named entities in the table description Y with a special placeholder "[ENT]" to obtain a template Ỹ , and finetune a PLM on (T, Ỹ ) pairs, following previous work (Chen et al., 2020a).

B.2 Entity Scheduling
At the t-th step, with the last reasoned entity e λ t−1 , we obtain its semantic representation h ent λ t−1 by 5 https://spacy.iolooking up and averaging the word embedding of all sub-tokens in the entity.Then, we update the inner hidden state of GRU by: where f mlp (•) is a multi-layer perceptron network and [•; •] denotes the concatenation of two vectors.At inference, the next placeholder is chosen with argmax operation.However, argmax is not differentiable and hinders the gradient propagation from subsequent programme synthesis to the scheduling or encoding part.To solve the problem, at training phase, we apply gumbel-softmax (Jang et al., 2016) to sample the next placeholder: where τ is the temperature.

B.3 Program Synthesis and Execution
At the t-th time step, we first update the hidden state h p t−1 of the 1-layer GRU which is responsible for programme synthesis by the embedding of the last generated operator/operand f emb (op t−1 ): where f emb (•) is an embedding function that converts a programme operator/operand to its embedding, and op t−1 is a programme operator/operand generated at (t − 1)-th step.The definition of f emb is divided into three cases: • If op t is from resolved entities, then f emb (op t ) = E ent 1 ω(opt) , where ω(•) returns the index of op t in the resolved entities [e λ 1 , • • • , e λ le ] and 1 ω is a one-hot vector with a one at index ω and zeros otherwise.E ent is the embedding matrix of the resolved entities and defined as follows: where W ent is a trainable parameter, and l e denotes the number of resolved entities so far; , where ω(•) returns the index of op t in the linearized table with the template.E enc serves as the embedding matrix of the table and template, and is defined as follows: where W enc is a training parameter and H enc is the dense representation of linearized table with the template as defined in § 3.3; • If op t is from the reserved operators, then f emb (op t ) is defined as where ω(•) returns the index of op t in the reserved operators, and E res is the embedding matrix of reserved operators and implemented a training parameter.
After that, h p t performs attention on ] and H enc to obtain a context-aware representation hp t : where W att is a trainable parameter, f o−att (•) returns the attended representation of the operator/operand embeddings and is defined as: The attended representation of the dense representations H enc , f h−att (h p t ), is defined in a similar way: The following step is to predict the next token op t using hp t .We first compute the similarity score between hp t and each column in [E ent ; E enc ; E res ] where [•; •; •] means concatenating three matrices along the column axis, and then acquire op t which corresponds to the index with the highest similarity score.
Finally, we execute the generated programme [op 1 , • • • , op lp ] on the table T to reason out the entity e λ le+1 .

C.1 Programme Heuristic set
When pruning for the possible programme candidate of an entity, we exhaustively search within a heuristic set H listed below, which includes the most common and typical "programme templates" in tabular reasoning.Specifically, we fill <list_name> and <value> with all the possible column names and cell values in the table to instantiate each "programme template" into a real programme.If the execution result of the programme is the correct entity, then we add the instantiated programme into the candidate set S p i for an entity e i .They are by no means complete or cover all the possible situations, but we find it is sufficient in our experiment.

C.3 Self-adaptive Training
When calculating the ŵ for a pair of programme suite and scheduling order (P, T ), where P is a suite of programme where λ i is the index in left-to-right chronological order for the i-th entity in topological order.
We first calculate the log-likelihood for each (P, T ) in a case: where the first part is the likelihood of the programme suite: (13) and the second part is the likelihood of the scheduling order: Finally, we normalize the likelihood among all possible (P, T ) pairs in a case: We also endow the PLM to learn predicting template Ỹ given the Table T in the training process and optimize the PLM with maximum likelihood estimation.

D Dataset Statistics
We conduct experiments on the following three benchmarks for logical data-to-text generation: LogicNLG (Chen et al., 2020a).This dataset is constructed based on the TabFact (Chen et al., 2019), by taking the statements that are entailed by the tabular knowledge as the target text.Tables in this dataset are crawled from Wikipedia and cover a wide range of topics.
Logic2Text (Chen et al., 2020c).This dataset is collected by employing AMT workers to label the statement of each table.Specifically, the workers are encouraged to choose diversified logic types and write descriptions in a creative tone rather than using template-like terminology.Despite the fact that the data owners provide logic forms as well, we only employ the table-description pairs following the setting in prior work (Chen et al., 2021).
SciGen (Moosavi et al., 2021).This dataset is established by collecting tables from scientific articles along with their corresponding descriptions.The tables in SciGen mostly contain numerical values and arithmetic reasoning is required to synthesize the description.The test set was split by the data owners into the "Computation and Language" (C&L) domain, and the "Other" domain, which primarily contains examples from "Machine Learning" (ML) papers.The table-description pairs in the training and development sets are taken from "C&L" articles.We choose the medium-size variant in our experiments.
To facilitate reproducibility, we adopt the datasets shared by the data owners and conduct preprocessing strictly following the released code.The statistics about these three datasets can be found in Table 6.

E More Details about Evaluation Metrics E.1 Automatic Evaluation
We evaluate the surface-level fidelity and the logical fidelity of all models, as described in previous works (Chen et al., 2020a(Chen et al., , 2021)) et al., 2019).All automatic evaluation metrics are calculated using the official code released on https://github.com/wenhuchen/LogicNLG.

E.2 Human Evaluation
According to Chen et al. (2020a,c), automatic evaluation scores are not sufficient for precise evaluation of factual and logical correctness.Because of this, we conduct the human evaluation by selecting 300 samples randomly from the test set of LogicNLG, Logic2Text and SciGen respectively, and hiring 6 undergraduates from the department of linguistics in our school to conduct qualitative analysis on the descriptions produced by our model and all competitive baselines.We pay 20 cents for each case.To obscure their sources, the generated descriptions are mixed up at random.Two criteria are used by the annotators to assess the descriptions' quality: (1) Language Fluency: whether the description is fluent and free of grammatical errors, and (2) Factual Correctness: whether the description is factually supported by the table.Each annotator assigns a score from {0, 1, 2} (representing "bad", "fair" and "good" respectively) to each description for each aspect.Each description receives two scores for the aforementioned aspects, and Fleiss' Kappa (Fleiss, 1971) is used to gauge the level of agreement between all annotators.

F More Implementation Details about Experiment and Hyperparameter
For template generation, We perform experiments on three backbones: GPT-2 (117M), BARTlarge (406M), and T5-large (770M).Theoretically, any pre-trained language model could be our backbone.We employ beam search with a beam size of 5.For entity scheduling and programme synthesis, the dimension of the hidden state in two 1-layer unidirectional GRU are both 512.The temperature for gumbel-softmax is τ = 1.0 and we keep the temperature unchanged through the training process.The f mlp in entity scheduling is a 2-layer MLP network and the hidden sizes are both set to be 512.We apply greedy search when decoding programme tokens.For self-adaptive learning, we set α and β to be 0.9 and 5 respectively, the pseudo labels are kept fixed within the first M 0 = 500 steps in the training process.All models are trained with Adam optimizer with β 1 = 0.9 and β 2 = 0.999.We sweep the learning rate from [5e − 6, 1e − 5, 2e − 5, 4e − 5, 6e − 6, 8e − 5] and the best-found learning rate is 1e − 5; We sweep batch size from [16,32,64,128,256] and the best-found batch size is 32.We set the weight decay as 1e − 2 and sweep the warm-up steps from [500,1000,2000,4000]. The best found warmup step is 1000.Early stopping on validation is adopted as a regularization strategy.All models are trained on an 8×RTX 3090 Ti machine on 5 hours.We report the performance averaged over three repetitive experiments.

G.1 More Analysis about Effects of Entity Scheduling
To have a better understanding of how the entity scheduling mechanism promotes the precision of entity reasoning, we bin all the test cases of Logic-NLG (Chen et al., 2020a) into four bins according to the number of entities in the description.The results are shown in Figure 5.We can observe a similar trend to Figure 4.With the number of entities increasing, variant -scheduling exhibits evident deterioration.We conjecture the reason is that a table description with more entities is more likely to have complicated dependency relationships among entities, and thus more difficult to reason out.But with entity scheduling, the precision is barely impacted by the number of entities.Note that the templates used in this experiment are derived from golden description, rather than generated by PLM.To investigate whether entity scheduling leads to serious latency at inference, we measure the decoding time of SORTIE in comparison with variant -scheduling, -symbolic and baseline method Coarse-to-Fine with BART-large backbone.The experiment results are shown in  a PLM to first generate a template and then a completed description, which results in low efficiency.

H Case Study
To have an intuitive insight into the strengths of SORTIE , we show the predicted programme and the topological order of several cases from Logic-NLG in Table 8, Table 9  and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.
are [ENT1] games with the attendance of [ENT2], the highest attendance in 1992 -93 Vancouver Canucks season.

Figure 2 :
Figure 2: Working flow of SORTIE .The table and template are first encoded into dense representations, and the entity scheduling mechanism dynamically picks the placeholders (i.e., "[ENT2]" and "[ENT1]") that will be resolved by programme synthesis and execution.

Figure 3 :
Figure 3: SP-Acc vs. the maximum number of the pseudo labels on LogicNLG.

Figure 4 :
Figure 4: Precision vs. length of the longest directed path in the dependency graph.

Figure 5 :
Figure 5: Entity reasoning precision vs. the number of entities.
make the first attempt to combine symbolic reasoning with visual question answering,

Table 1 :
The operators used in our symbolic reasoning.
in the training stage.by the embedding of the last generated operator/operand.Next, we calculate the relevance between h p t and all the operator/operand embeddings to predict the next operator/operand op t .With a generated programme [op 1 , • • • , op lp ], we execute it on the table T to reason out current entity.
sis is conducted by a 1-layer GRU.At the t-th time step, we first update the hidden state h p t−1 . (3) PLOG: Proposed by Liu et al. (2022), the model is first pre-trained on a table-to-logic generation task, and then fine-tuned on downstream table-to-text tasks.(4) REASTAP: Zhao et al. (2022) propose 7 table reasoning skill and construct training examples respectively to learn the 7 reasoning skill by pretraining on generative table QA tasks.

Table 4 .
Although our model performs comparably to other baselines in terms of language fluency, it attains a significant improvement

Table 2 :
Automatic evaluation results on the test sets of LogicNLG and Logic2Text.From top to bottom, the models in three blocks use GPT2-small, T5-large and BART-large as backbone respectively.The numbers in bold are the best results.

Table 3 :
Automatic evaluation results on two test splits of SciGen.From top to bottom, the models in two blocks use T5-large and BART-large as the backbone respectively.The numbers in bold are the best results.

Table 4 :
Human evaluation results on LogicNLG, Logic2Text and SciGen.Numbers in bold mean the best performance.

Table 5 :
Placeholders in the template are filled up with en-Ablation experiment results on the test set of LogicNLG.
or other LLMs into SORTIE to substitute the PLM backbones.The reason is that LLM can not be used for encoding since we have no access to the dense representation in an LLM.It might be plausible to only use LLM to generate a template and use another PLM to do encoding, but we leave this exploration to our future work.
Chen et al. (2020a)ncoding is to flatten a structured table into an unstructured paragraph, or "table linearization".FollowingChen et al. (2020a), supposing T i,j is the value of table cell in the i-th row and j-th column, we transform the original table T into a paragraph by horizontally scanning each cell T 1,1 → T 1,C T → T R T ,C T in the table, where R T and C T are the numbers of rows and columns in the table respectively.For example, the table in Figure2is linearized as "Given the table titled 1992 -1993 Vancouver Canucks Season, in row 1, the Date is May 2, the Visitor is Los Angeles, the Score is 2-5, the Series is 1-0, the Attendance is 16150; in row 5, the Date is May 5, the Visitor is

Table 6 :
. For surface-level fidelity, we calculate multi-reference Statistics of the three datasets.
mantic parser, are consistent with the table's facts.While NLI-Acc targets evaluating the entailment score between the table and the generated description based on a pre-trained Table-BERT (Chen

Table 7 :
Average inference time (ms) of SORTIE and three other variants or baselines.CTF = Coarse-to-Fine

Table 7 .
From the table, we can see that SORTIE has comparable latency with -symbolic and -scheduling.Or in other words, programme synthesis and entity scheduling do not enhance generation fidelity at the sacrifice of decoding speed.Besides, SORTIE costs much less time than Coarse-to-Fine, since the latter requires

Table 8 :
A table from LogicNLG with caption New zealand open (badminton).
and Table10.We can see that our model is able to compositionally assemble simple operators into a complicated programme sequence.When executed, the programme emits propitiate and faithful entities to fill in the placeholders, which might account for the impressive fidelity.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?In Appendix F. C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?In Appendix F. C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?In Appendix B. D Did you use human annotators (e.g., crowdworkers) or research with human participants?In Section 4.2 and Appendix E.2.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? In Section 4.2 and Appendix E.2.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students)