Reasoning Like Program Executors

Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a novel reasoning pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge possessed by program executors via a data-driven approach. POET is conceptually simple and can be instantiated by different kinds of program executors. In this paper, we showcase two simple instances POET-Math and POET-Logic, in addition to a complex instance, POET-SQL. Experimental results on six benchmarks demonstrate that POET can significantly boost model performance in natural language reasoning, such as numerical reasoning, logical reasoning, and multi-hop reasoning. POET opens a new gate on reasoning-enhancement pre-training, and we hope our analysis would shed light on the future research of reasoning like program executors.


Introduction
Recent breakthroughs in pre-training illustrate the power of pre-trained Language Models (LM) on a wide range of Natural Language (NL) tasks.Pretraining on self-supervised tasks, such as masked language modeling (Devlin et al., 2019;He et al., 2021) using large amounts of NL sentences, boosts the language understanding of models by a large margin (Wang et al., 2018a).However, existing pre-training paradigms have primarily focused on language modeling and paid little attention to advanced reasoning capabilities (Table 1).As a result, though reaching near-human performance on several tasks, pre-trained LMs are still far behind expectations in reasoning-required scenarios (Rae  1. et al., 2021), such as numerical reasoning (Wallace et al., 2019;Ravichander et al., 2019) and logical reasoning (Yu et al., 2020;Liu et al., 2020).
To alleviate the deficiency, reconciling NL understanding in LMs and reasoning in symbolic representations, i.e., neuro-symbolic reasoning, has been a major area of interest (Besold et al., 2017;Zhang et al., 2021).With a hybrid architecture, i.e., symbolic reasoners attached to LMs, neuralsymbolic reasoning shines in a variety of reasoning tasks (Chen et al., 2020c;Tu et al., 2020;Wolfson et al., 2020).However, the reasoning mechanism remains in the symbolic reasoner and is not internalized into LMs, making it difficult to reuse the reasoning mechanism on unseen tasks.Meanwhile, neural models are notorious for their reliance on correlations among concrete tokens of a representation system and are usually assumed to be hard to grasp abstract rules of a symbolic reasoner (Helwe et al., 2021;Sinha et al., 2021).This drives us to Quantitative Hypothesis: Teva earns $7 billion a year.Premise: After the deal closes, Teva will generate sales of about $7 billion a year, the company said.EQUATE (Ravichander et al., 2019) Natural Language Inference (NLI) Table 1: The demonstration of five representative reasoning types.Listed are the types, the example questions, the representative dataset, and their corresponding tasks.
[DOC] and [TAB] indicates the start of a passage and a semi-structured table respectively.Here we regard Question , Conclusion and Hypothesis as sentence, and Passage , Fact , Context and Premise as natural context in Figure 1.
explore whether symbolic reasoning can be internalized by language models and, especially, Can neural language models advance reasoning abilities by imitating symbolic reasoners?
Motivated by this, we conceive a new pretraining paradigm, POET (Program Executor), to investigate the learnability of language models from symbolic reasoning and transferrability across distinct representation systems.As illustrated in Figure 1, with a program (e.g., SQL query) and its program context (e.g., database) as input, the model receives automatic supervision from an established program executor (e.g., MySQL) and learns to produce correct execution result.By imitating program execution procedures, we believe LMs could potentially learn the reasoning knowledge that humans adopted to create the associated program executor and tackle NL sentences with the learned reasoning capability.This reveals the key hypothesis of POET: program executors are crystallized knowledge of formal reasoning, and such knowledge can be grasped by language models and transferred to NL reasoning via pre-training.In other words, pre-training over natural language might be a contingent condition for LMs to have better reasoning capabilities over natural language.This contingency assumption of NL brings POET another great merit in data quality: while it is typically difficult to obtain large amounts of clean natural language sentences containing clear evidence of reasoning, synthesized programs can be made arbitrarily complicated but readily available on any scale, thanks to the artificial and compo-sitional nature of programming languages.These merits greatly facilitate the construction of highquality corpora, addressing most of the unresolved shortcomings in previous reasoning-enhancement pre-training.In other words, POET differs from existing pre-training paradigms relying on noisy NL data.In summary, our contribution is three-fold: • We propose POET, a new pre-training paradigm for boosting the reasoning capabilities of language models by imitating program executors.Along with this paradigm, we present three exemplary across-program POET instantiations for various reasoning capabilities.
• We show with quantitative experiments that the reasoning ability our models obtains from POET pre-training is transferable to broader natural language scenarios.On six reasoningfocused downstream tasks, POET enables general-purpose language models to achieve competitive performance.
• We carry out comprehensive analytical studies, summarize insightful open questions, and provide insights for future work.We hope these insights would shed light on more research on reasoning like program executors1 .

Related Work
Since we focus on reasoning over natural language, our work is closely related to previous work on Reasoning Skills The literature focuses on reasoning skills, including numerical reasoning (Dua et al., 2019;Li et al., 2022a), multi-hop reasoning (Yang et al., 2018), reasoning in hybrid context (Chen et al., 2020b;Zhu et al., 2021) and logical reasoning (Liu et al., 2020;Yu et al., 2020).Our work concentrates on improving the above reasoning skills, leaving the other reasoning abilities such as commonsense reasoning (Zellers et al., 2018;Talmor et al., 2019;Bhagavatula et al., 2020) for future work.
Reasoning via Specialized Models Early works typically design specialized models and augment them into LMs for different types of questions (Dua et al., 2019;Andor et al., 2019;Hu et al., 2019;Ding et al., 2019).Taking Hu et al. (2019) as an example, they first predicted the answer type of a given question (e.g., "how many"), and then adopted the corresponding module (e.g., count module) to predict the answer.Although these methods work well on a specific dataset, it is challenging for them to scale to complex reasoning scenarios (Chen et al., 2020c).Differently, our work follows the line of reasoning via pre-training, which enjoys better scalability.
Reasoning via Pre-training This line of work focuses on the continued pre-training of LMs using large-scale data which involves reasoning.The pretraining data are generally NL text, which are either crawled from Web with distant supervision (Deng et al., 2021), generated by a model-based generator (Asai and Hajishirzi, 2020), or synthesized via human-designed templates (Geva et al., 2020;Yoran et al., 2022;Campagna et al., 2020;Wang et al., 2022).However, large-scale high-quality textual data involving reasoning is difficult to collect (Deng et al., 2021).Meanwhile, as the complexity of desired reasoning operations increases, synthesizing high-quality (e.g., fluent) NL sentences becomes more challenging.Different from the above pre-training methods relying on NL data, our pre-training is performed on programs.These programs can be synthesized at any scale with high quality, and thus are much easier to collect.
Reasoning in Giant Language Models Recent works demonstrate that with proper prompting (e.g., chain-of-thoughts prompting), giant language models (e.g., GPT-3) can perform well on reasoning tasks (Wei et al., 2022;Kojima et al., 2022;Li et al., 2022b).For example, Wei et al. (2022) find that giant language models have the ability to perform complex reasoning step by step with few-shot examples.Although these prompting strategies do not need further fine-tuning, the basic assumptions of them and POET are similar, i.e., it is difficult to obtain large amounts of clean sentences involving complex reasoning.However, these prompting strategies do not work well for non-giant language models, while POET is simultaneously applicable to language models ranging from millions (e.g., BART) to billions (e.g., T5-11B).It is also interesting to investigate how these prompting strategies and POET can be combined.(Andreas et al., 2016), reading comprehension (Gupta et al., 2019;Khot et al., 2021), knowledge base question answering (Ren et al., 2021) and 3D rendering (Tian et al., 2019).These works mainly focus on learning a neural network to represent the program executor, while ours focuses on transferring the knowledge of program executor to downstream tasks via pre-training.

Reasoning Like Program Executors
Reasoning is the process where deduction and induction are sensibly applied to draw conclusions from premises or facts (Scriven, 1976).As a supreme feature of intelligence, humans apply reasoning across modalities.Taking numerical reasoning as an example, humans can tell how many chocolates are consumed from a math word problem description, or from a real-world event where a mother gets off work and finds the choco-can empty, aside standing their guilty-looking kids with brownish stains on their faces.Through detachment of information from their superficial modality and symbolic abstraction, humans manage to unify input formats and condense their numerical reasoning knowledge into one executable symbolic system -This is the origin of an arithmetic program executor.If a model can master these reasoning skills by imitating program executors, we believe in the possibility of transferring those reasoning skills to different modalities.In our case, we expect language models to transfer reasoning to NL-related tasks.Given this motivation, we discuss the fundamental components of POET in the rest of this section and present its instantiations later.
Program refers to a finite sequence of symbols that can be understood and executed by machines.
For example, a program can be a logical form (e.g., Prolog), a piece of code (e.g., Python), or a math expression.Compared with NL sentences, programs are more formal.Each well-established program follows a specific set of grammar rules and can thus be synthesized systematically.The generalizability of POET framework is free from assumption and derived from the set of grammar rules on which a program follows.In POET, as long as a program returns meaningful output to reflect its computational procedure, it is an acceptable program.
Program Context is the environment in which a program is running, which holds numerous variables accessible to the program.These variables serve as pivot points that anchor the program context with the program.In the same sense, the question and the passage in reading comprehension hold a similar relationship.This suggests a natural analogy between the program-to-program context and the sentence-to-natural context in Figure 1.

POET with Singleton Executors
We first instantiate POET with two singleton (i.e., a single type of reasoning capability) executors and then move on to POET with integrated executors.

Learning from Math Calculators
The POET-Math (Left in Figure 3) aims at injecting numerical reasoning skills into LMs via learning from math calculators.Specifically, POET-Math is designed to boost the basic arithmetic skills (i.e., addition and subtraction) of LMs on downstream tasks.This arithmetic skill aligns with requirements to answer questions centered on addition / subtraction between two numbers, such as "What is the difference in casualty numbers between Bavarian and Austrian?".
Pre-training Task Given several floating-point variables as the program context and a math expression only involving addition/ subtraction as the program, the pre-training task of POET-Math is to calculate the math expression.Taking the leftmost example from Figure 3, receiving the concatenation of the program and the program context as the input, POET-Math is trained to output the number 180.7.Considering the output can be an arbitrary number, the encoder-decoder model (Lewis et al., 2020) is more suitable for this pre-training task.
Pre-training Corpus Each example in the corpus contains a math expression containing up to 2 operators and 3 variables, and a program context that contains at most 30 floating-point variables 2 .The mathematical addition and subtraction operators are denoted by + and -, respectively.The values of variables vary from 0.0 to 1000.0.By random generation, we synthesize 4 million examples as the pre-training corpus for POET-Math.

Learning from Logic Solvers
The POET-Logic (Mid in Figure 3 Pre-training Corpus Each example in the corpus contains several premise statements and a conclusion statement.Initially, the statement collection for each example is empty.To produce it, we first allocate 5 Boolean variables (e.g., p and q in Fig- ure 3) and randomly sample at most 8 pairs from their pairwise combinations.For each sampled pair (p, q), we randomly select a statement from the set {p → q, p → ¬ q, ¬ p → ¬ q, ¬ p → q} and add it to the collection.Once the statement collection is prepared, we randomly select a statement as the conclusion statement (i.e., program) and the rest as the premise statements (i.e., program context).Last, we employ Z3 (De Moura and Bjørner, 2008), the well-known satisfiability modulo theory solver, as our program executor to obtain the implied result.Finally, we synthesize 1 million examples as the pre-training corpus for POET-Logic, and nearly 16% examples3 correspond to True.

Preliminary Observation
We perform experiments on DROP and LogiQA to verify if our method improves the reasoning capability required by the dataset.As observed in Figure 4, POET-Math boosts the numerical rea-

POET with Integrated Executors
POET-Math and POET-Logic each focus on one specific reasoning skill, making the pre-training task heavily dependent on the downstream task.Different from them, POET-SQL is proposed to allow LMs to master different reasoning skills simultaneously.In our implementation, POET-SQL is pre-trained with an integrated SQL executor, since we believe that SQL queries are complex enough to encompass a wide variety of computational procedures (Table 2).
Pre-training Task Given a SQL query as the program and a database as the program context, the pre-training task of POET-SQL is to mimic the query result generation.As shown on the right side of Figure 5, given the concatenation of the program and the program context, the model is pre-trained to output the query result.Since the encoder-decoder LMs can generate arbitrary tokens, they are well suited for the task.On the other hand, encoderonly LMs have insufficient expressiveness to produce out-of-context query results.To allow them to benefit from the SQL execution, we tailor the task into a query result selection task for encoderonly LMs, which only utilizes query results that can be found in the database.Specifically, the task requires encoder-only LMs to perform an IO sequence tagging process to find the query results in the database, as shown on the left side of Figure 5.Note that the tag I is for tokens in the query results (e.g., Athens), while O is for other tokens.
Pre-training Corpus Each example in the corpus contains a SQL query, a database, and a query result.Notably, following Liu et al. (2022), each database is flattened into a sequence when it is fed into LMs.Meanwhile, to avoid databases being too large to fit into memory, we randomly drop the rows of large databases until their flattened sequences contain less than 450 tokens.For the query result generation task, we follow the same corpus construction strategy as described in Liu et al. (2022).

Experiments and Analysis
To verify the effectiveness of POET-SQL on boosting the reasoning capabilities of LMs, we first apply our method on several backbone models, including encoder-only models and encoder-decoder models.Then we conduct experiments on five typical reasoning benchmark datasets and compare POET-SQL with previous methods.Last, we perform a detailed model analysis to provide more insights.

Dataset Setup
We perform experiments on different datasets including DROP, HotpotQA, TAT-QA, and EQUATE.Table 1 shows examples of these datasets and their corresponding reasoning types.Furthermore, SVAMP (Patel et al., 2021), the challenging diagnostic dataset for probing numerical reasoning, is employed in our experiments to test the generalization capability of our fine-tuned models on DROP.We evalute models on their addition and subtraction subsets.More details about datasets can be found in Appendix § B.
Backbone Model RoBERTa (Liu et al., 2019) is elected as the backbone in encoder-only LMs, while BART (Lewis et al., 2020) is chosen as the backbone in encoder-decoder LMs.Afterward, We mark the RoBERTa model and the BART model trained under POET as POET-SQL RoBERTa and POET-SQL BART , respectively.For POET-SQL BART , we treat all datasets as generative tasks and finetune models to generate answers.As for POET-SQL RoBERTa , the fine-tuning strategies on different datasets are slightly different, and more implementation details can be found in Appendix § C. Notably, on all datasets, our models are evaluated with official evaluation metrics EM and F1.

Experimental Results
Comparing to Vanilla LMs Table 3 presents an apple-to-apple performance comparison between POET-SQL models and their associated vanilla LMs.Across all instances, we observe significant performance increment on downstream NL reasoning tasks.Specifically, POET-SQL equips popular encoder-only and encoder-decoder models with an integrated package of reasoning skills, effectively improving their performance on five benchmark datasets.As a highlighted example, POET-SQL BART obtains 11.5% (DROP) and 21.1% (SVAMP) improvements on EM, compared with the vanilla BART.Since POET pre-training is carried purely on program context, whereas all downstream tasks are on natural context, our hypothesis that reasoning capability is transferable from program executors to NL scenarios gets verified.margin, demonstrating the effectiveness of our proposed program execution pre-training.For example, compared with PReasM initialized from T5-Large, POET-SQL BART initialized from BART-Large exceeds it by 8.3%.Furthermore, POET that learns from a mix of program executors (i.e., POET-Math+SQL BART ) achieves a slightly better performance than the single prgoram executor.

Pre-training Analysis
We show part of the analysis results below due to the limited space, and more analysis can be found in Appendix § A and § D.    to characterize the behavior of LMs on the SQL execution task: execution accuracy and perplexity, and the execution accuracy always goes higher when the perplexity goes lower.Here the perplexity is presented because it is smoother compared to the execution accuracy, which is either 100% or 0%.Parallel with our expectation, pre-training on DROP leads to observably lower perplexity for SQL execution learning on both the train and dev sets.The bidirectional enhancement suggests some relative independence between reasoning mechanisms and their symbolic representations.

Necessity of Program Execution
Can POET boost reasoning abilities of giant pre-trained language models?Yes.
Recent work suggest that giant LMs excel at reasoning (Brown et al., 2020), so we are curious if POET is effective for them.Following the same procedure as in § 6, we apply POET-SQL to T5-11B, one of the largest publicly available LMs.As shown in Table 6, albeit not as shining as in cases of smaller LMs, POET still succeeds in boosting numerical reasoning abilities of giant LMs.

Conclusion & Future Work
We introduce POET, a new pre-training paradigm for boosting reasoning capability of language models via imitating program executors.Experimental results on six datasets demonstrate that POET can significantly boost existing language models on several reasoning skills, including numerical, logical and multi-hop reasoning.Our best language model under POET can reach highly competitive performance with previous specialized models.In the future, we hope our work could inspire more transference of reasoning knowledge from program executors to models.And we will also investigate the causes of the reasoning transfer with more insightful experiments, since we still do not know how the reasoning transfer occurs.

Limitations
The first limitation of our approach is that it has a relatively strong coupling between the reasoning skills learned in the pre-training task and the reasoning skills required for the downstream task.In other words, POET expects reasoning abilities of the program executor to overlap with the downstream reasoning requirements to make the execution learning transferable.Such an expectation also applied fo POET-SQL, although it allows LM to master different reasoning skills at the same time.For example, when ablating all programs involving math operations from the pre-training corpus of POET-SQL, it shows poor performance on DROP.The second limitation is that POET still employs instantiated program templates rather than probabilistic context-free grammars to synthesize programs.The latter usually offers a more diverse range of programs that may contribute to the generalization of the pre-trained language models, but are often more complex.

A Program Context Analysis
POET emphasizes the importance of program context for reasoning transferability, owing to the analogy between the program to program context and the sentence to natural context drawn in Figure 1.
To investigate it, we explore the effect of different program context design choices on reasoning transferability by conducting experiments on welldesigned POET-Math variants.

A.1 The Necessairty of Program Context
To verify the necessairty of program context, we experiment with POET-Math without program context, i.e. a variable-free POET-Math variant whose program context is empty.Taking the example of POET-Math in Figure 3, the program is transformed into 152.0+ 99.0 -70.3.The experimental results are shown in Table 7.One can see that there MNLI to perform zero-shot natural language inference tasks over quantitative statements described in (premise, hypothesis) pairs to reach final entailment decisions.
LogiQA A multi-choice reading comprehension dataset that evaluates the logical reasoning ability, whose questions are designed by domain experts (Liu et al., 2020).It contains four types of logical reasoning, including categorical reasoning, disjunctive reasoning, conjunctive reasoning and conditional reasoning.
SVAMP A challenging math word problem dataset (Patel et al., 2021).It is designed specifically to hack models who leverage spurious patterns to perform arithmetic operations without true understanding of context.We only keep addition and subtraction problems in accordance with our pre-training coverage.

B.2 Baseline Setup
We summarize the baseline methods in short below, and refer readers to their papers for more details.

C Implementation Details
C.1 POET-SQL RoBERTa on Different Datasets On DROP, we cast the span selection task as a sequence tagging problem following Segal et al. (2020).On TAT-QA, we in-place substitute the RoBERTa-Large encoder in TAGOP (Zhu et al., 2021) with our POET-SQL RoBERTa to verify its effectiveness, and keep the rest of the components unchanged.On HotpotQA, we train two classifiers independently to predict the start and end positions of the answer span, as done in Devlin et al. (2019).On EQUATE, we train a classifier to perform sequence classification on concatenated premise-hypothesis pairs.Notably, we follow the official setup to train LMs on the MNLI dataset (Williams et al., 2018) and evaluate their zero-shot performance on EQUATE.On SVAMP, the encoder-only model is not suitable since the answers are out-of-context.

C.2 Pre-training Details
By default, we apply AdamW as pre-training optimizer with default scheduling parameters in fairseq.
The coefficient of weight decay is set as 0.05 to alleviate over-fitting of pre-trained models.Additionally, we employ fp16 to accelerate the pre-training.

POET-Math
The pre-training procedure lasts for 10, 000 steps with a batch size of 512.After the warm up in the first 2000 steps, the learning rate arrives the peak at 3×10 −5 during pre-training.

POET-Logic
The pre-training procedure lasts for 5, 000 steps with a batch size of 512.After the warm up in the first 1000 steps, the learning rate arrives the peak at 3×10 −5 during pre-training.
POET-SQL For POET-SQL BART and POET-SQL RoBERTa , the pre-training procedure lasts for 50, 000 steps with a batch size of 512.After the warm up in the first 5000 steps, the learning rate arrives the peak at 3×10 −5 during pre-training.To save memory, each example in the pre-training corpus could at most contains 512 tokens.For POET-SQL T5 , the pre-training procedure lasts for 20, 000 steps with a batch size of 512.After the warm up in the first 2000 steps, the learning rate arrives the peak at 1×10 −5 during pre-training.The maximum input length in each example is truncated to 384 tokens to increase the batch size.

C.3 Fintuning Details
We implement our models based on transformers (Wolf et al., 2020), fairseq (Ott et al., 2019) and DeepSpeed4 .Passage Retrieval in HotpotQA Since the total length of the original passages in HotpotQA is too long to fit into memory, we train a classifier to filter out top-3 passages, as done in previous work (Deng et al., 2021).Specifically, a RoBERTa-Large model is fine-tuned to discriminate if an input passage is required to answer the question.The Hits@3 score of the classifier on HotpotQA is 97.2%.

Models
Numerical Design in DROP and SVAMP As noticed by previous works, sub-word tokenization methods such as byte pair encoding (Sennrich et al., 2015) potentially undermines the arithmetic ability of models.Instead, the character-level number representation is argued to be a more effective alleviation (Wallace et al., 2019).Additionally, the reverse decoding of numbers is proposed as a better way of modelling arithmetic carry (Geva et al., 2020).Therefore, we employ these design strategies on DROP and SVAMP.

C.4 Fine-tuning Hyperpameters
By default, we apply AdamW as fine-tuning optimizer with default scheduling parameters on all datasets.To ensure statistical significance, all finetuning procedures are run with three random seeds, except for T5-11B and POET-SQL T5 due to the limit of computation budgets.
DROP POET-SQL RoBERTa and RoBERTa-Large are trained with the subset of questions marked as "span" from the DROP dataset.tSince a gold answer may occur multiple times in the passage, we optimize over the sum of negative log probability for all possibly-correct IO sequences where each one of gold answers is included at least once, as done in Segal et al. (2020).The fine-tuning procedure runs up to 25, 000 steps with a batch size of 64, with the learning rate of 7.5×10 −6 .As for BART-Large (and POET-SQL BART , POET-Math, the same below) and T5-11B (and POET-SQL T5 , the same below), they are trained with the whole DROP dataset.For BART-Large, the fine-tuning procedure runs up to 20, 000 steps with a batch size as 128 and a learning rate as 3×10 −5 .For T5-11B, due to the computational budget, the fine-tuning procedure only lasts for 10, 000 steps with a batch size of 32, and the learning rate is 1×10 −5 .
TAT-QA In the experiment of TAT-QA, we employ the official implementation and the default hyperparameters provided in TAGOP 5 .The finetuning procedure runs up to 50 epochs with a batch size of 48.For modules introduced in TAGOP, the learning rate is set as 5×10 −4 , while for RoBERTa-Large (and POET-SQL RoBERTa ), the learning rate is set as 1.5×10 −5 .
HotpotQA The fine-tuning procedure runs up to 30, 000 steps with a batch size of 64.The learning rate is 1×10 −5 .Overlong inputs are truncated to 512 tokens for both RoBERTa-Large (and POET-SQL RoBERTa ), T5-11B (and POET-SQL T5 ) and BART-Large (and POET-SQL BART ).
EQUATE The fine-tuning procedure runs up to 20, 000 steps on MNLI with a batch size of 128 for both RoBERTa-Large (and POET-SQL RoBERTa ) and BART-Large (and POET-SQL BART ), with learning rate is 1×10 −5 .After fine-tuning, models are directly evaluated on EQUATE.
LogiQA In the experiment of LogiQA, we employ the open-source implementation and the default hyperparameters provided in ReClor 6 (Yu et al., 2020) to fine-tune RoBERTa-Large (and POET-SQL RoBERTa ).The fine-tuning procedure runs up to 10 epochs with a batch size of 24.The learning rate is set as 1×10 −5 .

D Fine-grained Analysis
DROP In Table 9 we report model F 1 scores by question type on DROP.Comparing three POET pre-trained models with their vanilla versions, we observe that: (i) POET-SQL BART outperforms the vanilla BART-large with a wide margin in all types of questions, i.e. number (15.3%), date (9.8%), span (around 5%).(ii) POET-SQL RoBERTa only deals with span selection questions, and obtain 1.9%, 3.2% gain on span, spans questions, respectively.(iii) For the giant POET-SQL T5 , we also observe 2% improvement on number questions, 2.2% on span and 0.8% on spans questions.These model-agnostic performance boost on DROP reveals the extra numerical reasoning knowledge models learned from SQL program executors.
EQUATE Table 10 presents performance breakdown by subsets of EQUATE (Ravichander et al., 2019), where we compare POET-SQL BART and POET-SQL RoBERTa with their vanilla versions and previous baselines.For both models, we observe around 10% acc improvement on the NR ST subset, where numerical comparison and quantifiers are especially emphasized.Stable performance improvement was also observed in both pre-trained models on the RTE-Q subset, where arithmetics and ranges are primary focus.Interestingly, POET-SQL RoBERTa alone demonstrate improvement on RedditNLI (emphasizes approximation and verbal quantitative reasoning) subset.Performance on other subsets are approximately comparable between POET pre-trained models and vanilla models, suggesting that POET does not harm intrinsic abilities of language models.only performed on table-like texts (i.e., the flatten sequence of databases), it is highly non-trivial for our model to generalize to such a hybrid scenario containing both tables and passages, again illustrating the transferability of reasoning capabilities.

E NL Understanding Performance
Dataset Details We fine-tune POET-SQL RoBERTa on (i) SQuAD v1.0: (Rajpurkar et al., 2016): one of the most classical single-span selection RC benchmarks measuring understanding over natural language context; (ii) MNLI (Williams et al., 2018): a large-scale NLI dataset measuring cross-domain and cross-genre generalization of NLU.Notably, our model is evaluated on the matched setting for the purpose of simplicity.(iii) QuoRef (Dasigi et al., 2019): A Wikipedia-based multi-span selection RC benchmark with a special emphasis on coreference resolution.All dataset Statistics are shown in Table 12.
Implementation Details (i) On SQuAD, we cast the span selection task as a sequence tagging problem following Segal et al. (2020).(ii) On MNLImatched, we train both models to perform sequence classification on concatenated premise-hypothesis pairs.(iii) On Quoref, we cast the span(s) selection task as an IO sequence tagging problem following Segal et al. (2020).

Figure 1 :
Figure 1: Given a program context and a program as input, POET pre-trains LMs to output the execution result.After fine-tuning on downstream tasks, POET can boost LMs on reasoning-required scenarios.Explanations about program context, program, program executor and execution result can be found in § 3.More examples of natural context and sentence are in Table1.

Figure 3 :
Figure 3: The illustration of POET-Math and POET-Logic.During pre-training, the concatenation of program and program context are fed into language model and the model is expected to output result.

Figure 5 :
Figure 5: The illustration of POET-SQL pre-training tasks: query result selection for encoder-only and query result generation for encoder-decoder LMs.

Figure 6 :
Figure 6: The EM performance [%] on the pre-training dev set (Left) and the downstream DROP dev set (Right) with different pre-training steps and scales.POET-SQL (x%) denotes the model trained with x% pre-training examples, while 100% corresponds to the model trained with the whole pre-training corpus of POET-SQL, which contains 5 million examples.
Figure 8: train and dev perplexity of vanilla BART and BART pre-trained on DROP (BART+DROP) on the pre-training corpus of POET-SQL.
What is the difference in casualty numbers between Bavarian and Austrian?Passage: [DOC] The popular uprising included large areas of . . .One employee supervises another who gets more salary than himself.Fact: [DOC] David, Jack and Mark are colleagues in a company.David supervises Jack, and Jack supervises Mark.David gets more . . .At which university does the biographer of John Clare teach English Literature?Passage: [DOC] John Clare : John Clare was an English poet . . .[DOC] CMS College Kottayam : The CMS College is one . . .What was the percentage change in gaming between 2018 and 2019?Context: [TAB] Server products and cloud services | 32, 622 | 26, 129 . . .[DOC] Our commercial cloud revenue, which includes Office . . .

Table 2 :
The six typical SQL programs that require reasoning.Listed are the type and the example SQL programs.[COL]and [VAL] represent the table column and the table cell value, respectively.
(Zhong et al., 2017)ded by WIKISQL(Zhong et al., 2017), 5 million examples are synthesized for pre-training.For the query result selection task, the pre-training corpus is constructed in a similar way as above, except that only the examples whose query results are suitable for encoder-only are retained.This filtering results in a corpus containing nearly 2 million examples.

Table 3 :
The main experimental results of different backbone models on test sets and dev sets (♡) of datasets with or without our POET-SQL.The results of POET are significantly better than the vanilla LMs (p < 0.05).Note the performance of RoBERTa and POET-SQL RoBERTa are reported on the subset of DROP where the answer is span(s).

Table 4 :
The comparison of our models with baselines on test sets and dev sets (♡) of different datasets.LMs used by all baselines are in Large size, except for ReasonBERT.Bold numbers indicate the best results.

Table 5 :
The EM and F 1 of POET-SQL BART on the DROP dev set with respect to different naturalness of program and program context.
An intuitive hypothesis is that the effectiveness of POET should be positively associated with the naturalness of program and program context, due to Table 5 provide counter-evidence to the intuitive hypothesis, since tuning the naturalness of program or program context do not significantly impact POET effectiveness.For example, unnatural program only leads to a slight decrease in DROP EM from 77.7% to 76.9%.It also indictaes that the model learns certain abstract reasoning capabilities rather than lexical associations.If reasoning ability can be transferred from program execution to NL reasoning tasks in POET, then the reversed process of POET may also work, i.e., models pre-trained with NL reasoning would have better learnability on program execution.To test this speculation, we compare the behavioral difference of vanilla BART and BART pre-trained on DROP in terms of learning SQL execution in Figure 8.There are two indicators that can be used

Table 6 :
The experimental results of T5-11B and POET-SQL T5 on test sets and dev sets (♡) of different datasets.

Table 9 :
Breakdown of model F 1 score by answer types on the dev set of DROP.Some works only report overall span type performance (marked by *), and single-span is non-separable from multi-span performance.Bold and underlined numbers indicate the best and second-best results, respectively.

Table 10 :
The EM performance of different models on all subsets of the EQUATE benchmark.Bold and underlined numbers indicate the best and second-best results, respectively.

Table 11 :
The EM performance of TAGOP (POET-SQL RoBERTa ) with respect to answer types and sources on the dev set of TAT-QA.

Table 12 :
Table 11 shows the detailed experimental results of TAGOP (POET-SQL RoBERTa ).Considering that the pre-training of POET-SQL RoBERTa is POET on NL understanding experiment dataset statistics.