Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

While large language models (LLMs), such as GPT-3, appear to be robust and general, their reasoning ability is not at a level to compete with the best models trained for specific natural language reasoning problems. In this study, we observe that a large language model can serve as a highly effective few-shot semantic parser. It can convert natural language sentences into a logical form that serves as input for answer set programs, a logic-based declarative knowledge representation formalism. The combination results in a robust and general system that can handle multiple question-answering tasks without requiring retraining for each new task. It only needs a few examples to guide the LLM's adaptation to a specific task, along with reusable ASP knowledge modules that can be applied to multiple tasks. We demonstrate that this method achieves state-of-the-art performance on several NLP benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN. Additionally, it successfully tackles robot planning tasks that an LLM alone fails to solve.


Introduction
A typical way to handle a question-answering task is to train a neural network model on large training data and test it on similar data.Such models work well with linguistic variability and ambiguity but often learn statistical features and correlations rather than true reasoning (Ruder, 2021), which makes them not robust, lack generalization, and difficult to interpret.
Alternatively, transformer-based large language models (LLMs) have recently shown wide success on many downstream tasks, demonstrating general reasoning capability on diverse tasks without being retrained.However, when we restrict our attention to individual NLP reasoning benchmarks, they usually do not perform as well as state-of-the-art models despite various efforts to improve accuracy through prompt engineering (Wei et al., 2022;Zhou et al., 2022).
Similarly, LLMs gained attention for plan generation for robots due to the rich semantic knowledge they acquired about the world (Ahn et al., 2022;Huang et al., 2022;Zeng et al., 2022).However, LLMs are known to perform shallow reasoning and cannot find complex plans (Valmeekam et al., 2022).
In another context, Nye et al. (2021) note that while LLMs are good at System-1 thinking, their outputs are often inconsistent and incoherent.This is because LLMs are trained to predict subsequent words in a sequence and do not appear to have a deep understanding of concepts such as cause and effect, logic, and probability, which are important for reasoning.
Nevertheless, we note that the rich semantic knowledge that LLMs possess makes them effective general-purpose few-shot semantic parsers that can convert linguistically variable natural language sentences into atomic facts that serve as input to logic programs.We also note that the fully declarative nature of answer set programs (Lifschitz, 2008;Brewka et al., 2011) makes them a good pair with the LLM semantic parsers, providing interpretable and explainable reasoning on the parsed result of the LLMs using background knowledge.Combining large language models and answer set programs leads to an attractive dual-process, neuro-symbolic reasoning that works across multiple QA tasks without retraining for individual tasks.
We tested this idea with several NLP benchmarks, bAbI (Weston et al., 2016), StepGame (Shi et al., 2022), CLUTRR (Sinha et al., 2019), andgSCAN (Ruis et al., 2020), by applying the same dual-system model and achieved state-of-the-art performance in all of them.Furthermore, the high accuracy and transparency allow us to easily identify the source of errors, making our system a useful data set validation tool as well.In particular, we found a significant amount of errors in the original CLUTRR dataset that are hard to detect manually.
While the new version of GPT-3 (Brown et al., 2020) (text-davinci-003) shows improvement over its predecessors, we observe that it also retains critical limitations.In the process, we develop prompt methods for semantic parsing to overcome some of them.
The implementation of our method is publicly available online at https://github.com/azreasoners/LLM-ASP.

Preliminaries 2.1 Semantic Parsing and LLMs
Semantic parsing involves converting a natural language query or statement into a structured representation that a computer can understand and manipulate.Statistical methods have increased in popularity (Zelle and Mooney, 1996;Miller et al., 1996;Zettlemoyer and Collins, 2005;Wong and Mooney, 2007), and encoder-decoder models in particular have been widely used (Dong and Lapata, 2016;Jia and Liang, 2016;Kočiskỳ et al., 2016).However, these statistical methods require annotated input and output pairs.Furthermore, machine learning models often fail to compositionally generalize to unseen data (Lake and Baroni, 2018).
More recently, pre-trained language models have been applied to semantic parsing tasks (Liu et al., 2021), such as generating SQL queries, SPARQL queries, logical forms, or programs, from natural language, together with fine-tuning or prompttuning on pre-trained models, such as BART, RoBERTa and GPT-2 (Chen et al., 2020a;Shin et al., 2021;Schucher et al., 2022).With larger pre-trained networks, such as GPT-3, prompting appears to yield a reasonable semantic parser without the need for fine-tuning (Shin et al., 2021;Drozdov et al., 2022).
Another line of related work is to apply pretrained language models to relation extraction, the task of extracting semantic relationships from a text given two or more entities (Liu et al., 2021).Wang et al. (2022) do zero-shot relation extraction with pre-trained language models from the BERT family and GPT-2 variants.Zhou and Chen (2022) finetune BERT and RoBERTa models for the extraction of sentence-level relations.Chen et al. (2022) apply prompt-tuning to RoBERT_LARGE for relation extraction.Similar to ours, Agrawal et al. (2022) use a few-shot prompt with GPT-3 for the extraction of clinical relations.

Dual-System Model
There is increasing interest in combining neural and symbolic systems (Marcus, 2018;Lamb et al., 2020;Sarker et al., 2021).Such dual-system models achieved new state-of-the-art results in visual question answering (Goldman et al., 2018;Sampat and Lee, 2018;Yi et al., 2019;Chen et al., 2020b;Ding et al., 2021).In the case of textual problems, to improve LLMs to generate more consistent and coherent sentences, Nye et al. (2021) suggest that generation be decomposed into two parts: candidate sentence generation by an LLM (system 1 thinking) and a logical pruning process (system 2 thinking) implemented via a separate symbolic module.They demonstrate that this neurosymbolic, dual-process model requires less training data, achieves higher accuracy, and exhibits better generalization.However, the main limitation of their work is that the symbolic module is manually constructed in Python code for the specific task at hand, requiring subtantial efforts.Additionally, their Python symbolic module is not readily reusable or composable.Furthermore, their main results primarily focus on the problem of consistent text generation, rather than evaluating the method on the datasets and comparing it with existing models.This is because writing the world models in Python is not a scalable approach.
In our work, we follow the idea presented in (Nye et al., 2021) but adopt logic programming in place of the System 2 process.We argue that this combination is much more appealing than the approach in (Nye et al., 2021), as it can achieve the promised results without the limitations mentioned above.

Answer Set Programming
Answer Set Programming (ASP) (Lifschitz, 2008;Brewka et al., 2011) is a declarative logic programming paradigm that has been shown to be effective in knowledge-intensive applications.It is based on the stable model (a.k.a.answer set) semantics of logic programs (Gelfond and Lifschitz, 1988), which could express causal reasoning, default reasoning, aggregates, and various other constraints.There are several efficient solvers, such as CLINGO, DLV, and WASP.We use CLINGO v5.6.0 as the answer set solver.For the language of CLINGO, we refer the reader to the textbook (Lifschitz, 2019) or the CLINGO manual. 1t is also known that classical logic-based action formalisms, such as the situation calculus (Mc-Carthy and Hayes, 1969;Reiter, 2001) and the event calculus (Shanahan, 1995), can be formulated as answer set programs.For example, the following is one of the axioms in Discrete Event Calculus stating the commonsense law of inertia, saying that fluent F holds at the next time if there is no action affecting it.% (DEC5) holds_at(F,T+1) :-timepoint(T), fluent(F), holds_at(F,T), -released_at(F,T+1), not terminated(F,T).
Such a rule is universal and applies to almost all objects.
Answer set programs are also known to be elaboration tolerant (McCarthy, 1998).There has been work on modularizing knowledge bases in ASP, such as module theorem (Oikarinen and Janhunen, 2006;Babb and Lee, 2012) and knowledge modules (Baral et al., 2006).While ASP has been widely applied to many reasoning problems, it has not been considered as much in reasoning with natural language text because its input is expected to be strictly in a logical form, giving little flexibility in accepting diverse forms of natural language input.

Our Method
We refer to our framework as [LLM]+ASP where [LLM] denotes a large pre-trained network such as GPT-3, which we use as a semantic parser to generate input to the ASP reasoner.Specifically, we assume data instances of the form ⟨S, q, a⟩, where S is a context story in natural language, q is a natural language query associated with S, and a is the answer.We use an LLM to convert a problem description (that is, context S and query q) into atomic facts, which are then fed into the ASP solver along with background knowledge encoded as ASP rules.The output of the ASP solver is interpreted as the prediction for the given data instance.Figure 1 illustrates the inference flow in the context of StepGame.The pipeline is simple but general enough to bke applied to various tasks without the need for retraining.It only requires replacing the few-shot prompts to the LLM and the ASP background knowledge with those suitable for the new tasks.
By combining LLMs and ASP in this manner, we enable robust symbolic reasoning that can handle diverse and unprocessed textual input.The ASP knowledge modules remain unaffected by the diverse forms of input text that express the same facts.Our method does not rely on training datasets.Instead, a few examples that turn natural language sentences into atomic facts are sufficient to build a semantic parser due to the learned representations in LLMs.Furthermore, ASP knowledge modules can be reused for different tasks.

Prompts for Fact Extraction
We use GPT-3 to extract atomic facts from the story and query.Most of the time, giving several examples yields accurate semantic parsing.The following is an example prompt for bAbI.
Please parse the following statements into facts .The available keywords are: pickup, drop, and go.Sentence: Max journeyed to the bathroom.Semantic parse: go(Max, bathroom).
Sentence: Mary grabbed the football there.Semantic parse: pickup(Mary, football).... We find that GPT-3 is highly tolerable to linguistic variability.For example, in StepGame, GPT-3 can turn various sentences below into the same atomic fact top_right("C","D").
C is to the top right of D. C is to the right and above D at an angle of about 45 degrees.C is at a 45 degree angle to D, in the upper righthand corner.C is directly north east of D. C is above D at 2 o'clock.
In the experiments to follow, we find that the following strategy works well for fact extraction.
1.In general, we find that if the information in a story (or query) can be extracted independently, parsing each sentence separately (using the same prompt multiple times) typically works better than parsing the whole story.
2. There is certain commonsense knowledge that GPT-3 is not able to leverage from the examples in the prompt.In this case, detailing the missing knowledge in the prompt could work.For example, in StepGame, clock numbers are used to denote cardinal directions, but GPT-3 couldn't translate correctly even with a few examples in the prompt.It works after enumerating all cases ("12 denotes top, 1 and 2 denote top_right, 3 denotes right, . ..") in the prompt.
3. Semantic parsing tends to work better if we instruct GPT-3 to use a predicate name that better reflects the intended meaning of the sentence.For example, "A is there and B is at the 5 position of a clock face" is better to be turned into down_right(B,A) than top_left(A,B) although, logically speaking, the relations are symmetric.
The complete set of prompts for semantic parsing is given in Appendix C.

Knowledge Modules
Instead of constructing a minimal world model for each task in Python code (Nye et al., 2021), we use ASP knowledge modules.While some knowledge could be lengthy to be described in English, it could be concisely expressed in ASP.The second rule above contains six conditional literals among which Dx=-1:X<0 says that "Dx must be -1 if X<0."For example, if A's location (X,Y) is (-3,0), then (Dx,Dy) is (-1,0) and the answer R is left.Similar rules can also be applied to bAbI task 17, which asks if In the above rules, the relation R in, e.g., is(A,R,B), is a variable and can be substituted by any binary relation.Such high-order representation turns out to be quite general and applicable to many tasks that query relation or its arguments.Figure 2 shows the knowledge modules used in this paper, where DEC denotes the Discrete Event Calculus axioms from (Mueller, 2006;Lee and Palla, 2012).In this section, we explained the main rules in the location module.The complete ASP knowledge modules are given in Appendix E.

Experiments
We apply the method in the previous section to four datasets.2As a reminder, our approach involves few-shot in-context learning and does not require training.We use the same pipeline as shown in Figure 1, but with different prompts and knowledge modules for each dataset.For more detailed information about the experimental settings, please refer to the appendix.

bAbI
The bAbI dataset (Weston et al., 2016) is a collection of 20 QA tasks that have been widely applied to test various natural language reasoning problems, such as deduction, path-finding, spatial reasoning, and counting.State-of-the-art models, such as self-attentive associative-based two-memory model (STM) (Le et al., 2020) and Query-Reduction networks (QRN) (Seo et al., 2017) achieve close to 100% accuracy after training with 10k instances while QRN's accuracy drops to 90% with 1k training instances.
We first designed two GPT-3 baselines, one with few shot prompts (containing a few example questions and answers) and the other with Chain-of-Thought (CoT) prompts (Wei et al., 2022), which state the relevant information to derive the answer.
We also apply GPT-3+ASP.For example, we use GPT-3 to turn "the kitchen is south of the bathroom" into an atomic fact is(kitchen, southOf, bathroom) by giving a few examples of the same kind.Regarding knowledge modules, Tasks 1-3, 6-9, 10-14, and 19 are about events over time and use the DEC knowledge module.Tasks 4, 17, and 19 require various domain knowledge modules such as location and action knowledge modules.The remaining tasks do not require domain knowledge and rely only on simple rules to extract answers from parsed facts.
Table 1 compares our method with the two GPT-3 baselines, as well as two state-of-the-art methods on bAbI datasets, STM and QRN.Interestingly, the new ), with basic few-shot prompting achieves 80.34% accuracy, while CoT improves it to 86.18%.GPT-3(d3)+ASP achieves state-of-the-art performance on bAbI with 99.99% average performance among all tasks, producing only two answers that disagree with the labels in the dataset.It turns out that the two questions are malformed since the answers are ambiguous, and our model's answers can be considered correct.3

StepGame
Although bAbI has been extensively tested, it has several problems.Shi et al. (2022) note data leakage between the train and the test sets where named entities are fixed and only a small number of relations are used.Palm et al. (2018) point out that models do not need multi-hop reasoning to solve the bAbI dataset.To address the issues, Shi et al. (2022) propose the StepGame dataset.It is a contextual QA dataset in which the system is required to interpret a story S about spatial relationships among several entities and answers a query q about the relative position of two of those entities, as illustrated in Figure 1.Unlike the bAbI dataset, StepGame uses a large number of named entities, and requires multi-hop reasoning up to as many as 10 reasoning steps.
In the basic form of the StepGame dataset, each story consists of k sentences that describe k spatial relationships between k + 1 entities in a chain-like shape.In this paper, we evaluate the StepGame dataset with noise, where the original chain is extended with noise statements by branching out with new entities and relations.
Similarly to bAbI, we designed two GPT-3 baselines and applied our method to the StepGame data set.More details on the prompts are available in Appendix C.2.
For each k ∈ {1, . . ., 10}, the StepGame dataset with noise consists of 30,000 training samples, 1000 validation samples, and 10,000 test samples.To save the API cost for GPT-3, we only evaluated the two GPT-3 baselines on the first 100 test samples and evaluated our method on the first 1,000 test samples for each k ∈ {1, . . ., 10}.Table 2 compares the accuracy of our method with the two baselines of GPT-3 and the current methods, i.e.RN (Santoro et al., 2017), RRN (Palm et al., 2018), UT (Dehghani et al., 2018), STM (Le et al., 2020), Task GPT-3(d3) GPT-3(d3)    Kordjamshidi, 2022).Surprisingly, the GPT-3 baselines could achieve accuracy comparable to other models (except for SynSup) for large k values.CoT does not always help and decreases the accuracy with big ks.This may be because there is a higher chance of making a mistake in a long chain of thought.GPT-3(d2)+ASP outperforms all stateof-the-art methods and the GPT-3 baselines by a large margin for k = 4, . . ., 10.Although SynSup achieves a higher accuracy for k = 1, 2, 3, this is misleading due to errors in the dataset.As we analyze below, about 10.7% labels in the data are wrong.The SynSup training makes the model learn to make the same mistakes over the test dataset, which is why its performance looks better than ours.
The modular design of GPT-3+ASP enables us to analyze the reasons behind its wrong predictions.We collected the first 100 data instances for each k ∈ {1, . . ., 10} and manually analyzed the predictions on them.Among 1000 predictions of GPT-3(d2)+ASP, 108 of them disagree with the dataset labels, and we found that 107 of those have errors in the labels.For example, given the story and question "J and Y are horizontal and J is to the right of Y. What is the relation of the agent Y with the agent J?", the label in the dataset is "right" while the correct relation should be "left". 4Recall that our method is interpretable, so we could easily identify the source of errors.(Sinha et al., 2019) is a contextual QA dataset that requires inferring family relationships from a story.Sentences in CLUTRR are generated using 6k template narratives written by Amazon Mechanical Turk crowd-workers, and thus are more realistic and complex compared to those in bAbI and StepGame.

CLUTRR
CLUTRR consists of two subtasks, systematic generalization that evaluates stories containing unseen combinations of logical rules (Minervini et al., 2020;Bergen et al., 2021) et al., 2017), MAC (Hudson and Manning, 2018), BiLSTM-attention (Sinha et al., 2019), andGSM (Tian et al., 2021) on the original CLUTRR dataset, namely CLUTRR 1.0, in four categories of data instances: clean, supporting, irrelevant, and disconnected (Sinha et al., 2019).Except for our method, all other models are trained on the corresponding category of CLUTRR training data.Although our method achieves similar or higher accuracies in all categories, they are still much lower than we expected.
We found that such low accuracy is due to the clear errors in CLUTRR, originating mostly from errors in the template narratives or the generated family graphs that violate common sense.The authors of CLUTRR recently published CLUTRR 1.3 codes to partially resolve this issue. 5With the new code, we created a new dataset, namely CLUTRR 1.3, consisting of 400 data instances with 100 for each of the four categories.The last row in Table 3 shows that our method actually performs well on realistic sentences in CLUTRR.Indeed, with our method (using text-davinci-003) on CLUTRR 1.3 dataset, 363 out of 400 predictions are correct, 16 are still wrong due to data mistakes (e.g., the label says "Maryann has an uncle Bruno" while the noise sentence added to the story is "Maryann told her son Bruno to give the dog a bath"), and 21 are wrong due to GPT-3's parsing mistakes (e.g., GPT-3 turned the sentence "Watt and Celestine asked their mother, if they could go play in the pool" into mother("Watt", "Celestine").Since the sentences in CLUTRR 1.3 are more realistic than those in bAbI and StepGame, GPT-3 makes more mistakes even after reasonable efforts of prompt engineering.More details on data errors and GPT-3 errors are available in Appendix F.2 and Appendix D. We also evaluated our method on a simpler and cleaner variant of the CLUTRR data set, namely CLUTRR-S, that was used as a benchmark problem for a state-of-the-art neuro-symbolic approach DeepProbLog (Manhaeve et al., 2021).Table 4 compares the accuracy of our method and Deep-ProbLog in all 4 categories of test data.GPT-3(d3)+ASP achieves 100% accuracy, outperforming DeepProbLog without the need for training.
Remark: Due to the modular structure, our method could serve as a data set validation tool to detect errors in a dataset.We detected 107 wrong data instances in the first 1000 data in StepGame and 16 wrong data instances in the 400 data in CLUTRR 1.3.

gSCAN
The gSCAN dataset (Ruis et al., 2020) poses a task in which an agent must execute action sequences to achieve a goal (specified by a command in a natural language sentence) in a grid-based visual navigation environment.The dataset consists of two tasks, and we evaluate our method on the data splits from the compositional generalization task.There is one shared training set, one test set (split A) randomly sampled from the same distribution of the training set, and seven test sets (splits B to H) with only held-out data instances (i.e., not appearing in the training set) in different ways.
In the gSCAN dataset, each data instance is a tuple ⟨G, q, a⟩ where G is the grid configuration (in JSON format) describing the size of the gird, the location and direction of the agent, and the location and features of each object in the grid; q is a query (e.g., "pull a yellow small cylinder hesitantly"); and a is the answer in the form of a sequence of actions (e.g., "turn right, walk, stay, pull, stay, pull, stay").For each data instance, we (i) use a Python script to extract atomic facts (e.g., pos(agent,(2,3))) from the grid configuration G; (ii) extract atomic facts from query q into atomic facts (e.g., query(pull), queryDesc(yellow), while(hesitantly)) using and (iii) predict the sequence of actions for this query using ASP.Table 5: Test accuracy on the gSCAN dataset Table 5 compares the accuracy of our method and the state-of-the-art methods, i.e., GECA (Ruis et al., 2020), DualSys (Nye et al., 2021) and Vil-bert+CMA (Qiu et al., 2021), on the gSCAN test dataset in eight splits.To save API cost for GPT-3, we only evaluated the first 1000 data instances of each split.With text-davinci-002, our method GPT-3+ASP achieves 100% accuracy.With textcurie-001, the accuracy is slightly lower, making 17 errors in split A. The errors are of two kinds.The language model fails to extract adverbs in the correct format for 11 data instances (e.g., GPT-3 responded queryDesc(while spinning) instead of while(spinning)) and didn't ground the last word in a query for 6 data instances (e.g., for query walk to a small square, GPT-3 missed an atomic fact queryDesc(square)).
Once the parsed results are correct, ASP does not make a mistake in producing plans.

Findings
The following summarizes the findings of the experimental evaluation.
• Our experiments confirm that LLMs like GPT-3 are still not good at multi-step reasoning despite various prompts we tried.Chain-of-Thought is less likely to improve accuracy when a long chain of thought is required.
• On the other hand, LLMs are surprisingly good at turning a variety of expressions into a "canonical form" of information extraction.This in turn allows ASP knowledge modules to be isolated from linguistic variability in the input.
• Even for generating simple atomic facts, larger models tend to perform better.For example, in StepGame and gSCAN, text-curie-001 performs significantly worse compared to text-davinci-002 (Tables 2 and 5).
• The total amount of knowledge that needs to be encoded for all of the above datasets is not too large.This is in part due to the fact that GPT-3 "normalized" various forms of input sentences for ASP to process and that knowledge modules could be reused across different datasets.
• The modular design of our approach makes it possible to locate the root cause of each failed prediction in the training data and improve upon it.There are three sources of errors: semantic parsing in LLMs, symbolic constraints, and the dataset itself, and we can resolve the first two issues by improving the prompts and updating the constraints, respectively.
• Our framework could serve as a few-shot dataset justifier and corrector.Among all predictions by our method that do not align with the labels, almost all of them (with only a few exceptions discussed in the paper) are due to errors in the dataset.

Conclusion
Symbolic logic programming was previously considered limited in its ability to reason from text due to its inability to handle various and ambiguous linguistic expressions.However, combining it with a large language model that has learned distributed representations helps alleviate this problem.The method not only achieves high accuracy but also produces interpretable results, as the source of the errors can be identified.It is also general; by using pre-trained networks with few-shot prompts and reusable knowledge modules, adapting to a new domain does not require extensive training.
The knowledge modules used in our experiments are reusable.For the above experiments, the modules are relatively simple to write, as are the prompts for parsing natural language for LLMs.However, acquiring this kind of knowledge on a massive scale is also an important line of research (Liu and Singh, 2004;Bosselut et al., 2019;Hwang et al., 2021) that needs to be combined.In addition, it is possible to use LLM's code generation capability (Chen et al., 2021) to generate logic program rules, which we leave for future work.
One may think that the logic rules are too rigid.However, there are many weighted or probabilistic rules that can be defeated (Richardson and Domingos, 2006;Fierens et al., 2013;Lee and Wang, 2018).They could be used for more realistic settings, but for the benchmark problems above, they were not needed.

Ethical Considerations
All datasets used in this paper are publicly available.For CLUTRR dataset, the gender information is essential to tell if, e.g., A is B's uncle or niece.We used GPT-3 to predict the genders of persons in each story.Since each story is systematically generated using sampled common first names and sampled sentence templates, it does not reveal any identity.As mentioned, the original CLUTRR dataset had some errors, and we describe carefully the codes and settings of the generated CLUTRR 1.3 dataset in Appendix B.1.

Limitations
The current work requires that knowledge modules be written by hand.Commonly used axioms, such as general knowledge like the commonsense law of inertia expressed by event calculus, can be reused easily, but there are vast amounts of other commonsense knowledge that are not easy to obtain.LLMs could be used to supply this information, but we have not tried.Knowledge graphs, such as Con-ceptNet (Liu and Singh, 2004), COMET (Bosselut et al., 2019) and ATOMIC (Hwang et al., 2021), can be utilized to populate ASP rules.Like code models, we expect that LLMs could generate ASP code, which we leave for future work.Also, when using large language models, despite various efforts, sometimes it is not understandable why they do not behave as expected.

Appendix
Section A presents another experiment with robot planning.Section B discusses more details about how we generated CLUTRR dataset and the experimental result on CLUTRR 1.0.Section C presents GPT-3 prompts for semantic parsing.Section D enumerates the errors with GPT-3 in semantic parsing.Section E presents ASP knowledge modules we used for the experiments.Section F enumerates the errors in the datasets.
For bAbI, the prompts for the baseline few-shot prompting can be found in the directory bAbI_ baseline/example_prompts, while the prompts for chain-of-thought can be found in bAbI_baseline/COT_prompts_v3.
For StepGame, the prompts for the baseline few-shot prompting and chain-of-thought can be found in the directory stepGame/prompts.The following table records the cost for GPT-3 queries used in GPT-3 baselines and our method, where Eng.denotes the engine of GPT-3, c1, d2, d3 denote text-curie-001, text-davinci-002, and text-davinci-003.

A Robot Planning
Recently, there has been increasing interest in using LLMs to find a sequence of executable actions for robots, aiming to achieve high-level goals expressed in natural language, such as SayCan (Ahn et al., 2022) and Innermonologue (Huang et al., 2022).However, it is worth noting that the actions generated by LLMs tend to be loosely connected and do not take into account the intermediate state changes that occur during the execution of these actions., where a robot is tasked with achieving a goal, such as "stack the blocks," on a table with colored blocks and bowls.We noticed that the successful plans demonstrated by SayCan are restricted to simple one-step look-ahead plans that do not take into account intermediate state changes.
We randomly sampled 40 data instances of the form ⟨S i , S g , L⟩ in the Pick&Place domain with 4 to 7 blocks and 3 to 7 bowls, possibly stacked together and with 3 to 10 steps of pick_and_place actions required by the robot to change the initial state S i to the goal state S g .Here, the label L is the set of instructions to achieve the goals (e.g., "1.Move the violet block onto the blue block.2...").Among 40 data instances, 20 data instances contain only blocks that can be placed on the table while 20 data instances contain both blocks and bowls and assume all blocks must be on the bowls.
The baseline for this dataset follows the method in SayCan's open-source virtual tabletop environment, where GPT-3 is used as the large language model to directly find the sequence of actions from S i to S g .However, the baseline fails to find successful plans for all 40 randomly sampled data instances.This result confirms the claim by (Valmeekam et al., 2022) that large language models are not suitable as planners.
We also applied our method to this task.We let GPT-3 turn the states S i and S g into atomic facts of the form on(A, B, 0) and on(A, B), respectively.
Then, an ASP program for the Pick&Place domain is used to find an optimal plan.We found that while GPT-3 has only 0% accuracy in predicting the whole plan, it has 100% accuracy in fact extraction under the provided format.When we apply symbolic reasoning to these extracted atomic facts with an ASP program, we could achieve 100% accuracy on the predicted plans.Details of the prompts are available in Appendix C.5.    with the text-davinci-002 and text-davinci-003 model.
Table 8 compares the accuracy of our method and the state-of-the-art method, DeepProbLog (Manhaeve et al., 2021) on the CLUTRR-S test dataset.With GPT-3(d2)+ASP on the CLUTRR-S dataset, 550 out of 563 predictions are correct, and 13 are wrong.All errors occur due to the entities in a relation being swapped.For example, we use "son(A,B)" to represent "A has a son B" while GPT-3 text-davinci-002 responded with "son(Robert,Ryan)" for the sentence "Robert is Ryan's son."On the other hand, text-davinci-003 performed better, with only a single error and 562 out of 563 predictions being correct.

C Prompts for Semantic Parsing
Below, we present the details of the general knowledge of the prompts that we summarized and applied in this paper, followed by some examples.
1.If the information in a story (or query) can be extracted independently, parsing each sentence separately (using the same prompt multiple times) typically works better than parsing the whole story.Since people usually cache all GPT-3 responses to save cost by avoiding duplicated GPT-3 requests for the same prompt, parsing each sentence separately also yields better usage of cached responses.Below are some examples.
• In most bAbI tasks (except for tasks 11 and 13), the sentences in a story (including the query sentence) are independent of each other.We parse each sentence separately using GPT-3 as in the Appendix C.1.• In the stepGame dataset, each sentence in a story describes the spatial relation between 2 objects.There are 4 sentences in a story when k = 1 and about 20 sentences when k = 10.If we ask GPT-3 to extract all the atomic facts from the whole story, it always misses some atoms or predicts wrong atoms.Since every sentence is independent of each other as shown in Figure 1, we use the following (truncated) prompt multiple times for each data instance where each time [INPUT] is replaced with one sentence in the story or the query.This yields a much higher accuracy as in Section 4.3.However, if some sentences in a story are dependent, splitting them may lead to unexpected results in the GPT-3 response.Below are some examples.
• In bAbI task #11 and #13, a story may contain the two consecutive sentences "Mary went back to the bathroom.After that she went to the bedroom."There is a dependency on the sentences to understand that "she" in the second sentence refers to "Mary" in the first.For this reason, task #11 stories are parsed as a whole.This is similar for task #13.• In the CLUTRR dataset, a story may contain sentences with coreferences like "Shirley enjoys playing cards with her brother.His name is Henry."where the latter sentence depends on the former one, and a family relation can be correctly extracted only with both sentences.Thus for CLUTRR datasets (i.e., CLUTRR 1.0, CLUTRR 1.3, and CLUTRR-S), we extract the family relations and gender relations from the whole story.
2. There is certain commonsense knowledge that GPT-3 is not aware of, and describing the missing knowledge in the prompt works better than adding examples only.This happens when GPT-3 cannot generalize such knowledge well with a few examples.Sentence: What is the relation of the agent X to the agent K? Semantic Parse: query("X", "K").
Sentence: H is positioned in the front right corner of M. Semantic Parse: top_right("H", "M").
Sentence: F is on the left side of and below Q. Semantic Parse: down_left("F", "Q").
Sentence: Y and I are parallel, and Y is on top of I. Semantic Parse: top("Y", "I").
Sentence: V is over there with T above.Semantic Parse: top("T", "V").
Sentence: V is slightly off center to the top left and G is slightly off center to the bottom right.Semantic Parse: top_left("V", "G").
Sentence: The objects S and A are over there.
The object S is lower and slightly to the left of the object A. Semantic Parse: down_left("S", "A").
Sentence: D is diagonally below Z to the right at a 45 degree angle.Semantic Parse: down_right("D", "Z").
Sentence: O is there and C is at the 5 position of a clock face.Semantic Parse: down_right("C", "O").
Sentence: If H is the center of a clock face, B is located between 10 and 11.Semantic Parse: top_left("B", "H").

C.3 CLUTRR
For CLUTRR dataset, there are two prompts to extract the family relations and genders from a story respectively.All example sto-ries in both prompts are from the training data "data_06b8f2a1/2.2,2.3_train.csv" in the original CLUTRR dataset.9Below is the prompt to extract family relations from a story where [Input] at the end is replaced with the story in each test data instance.sister("Verdie", "Amanda").daughter("Henry ", "Amanda").grandfather("Amanda", "Kyle").
Story: [Michelle] was excited for today, its her daughter's, [Theresa], spring break.She will finally get to see her.
[Kristen] loved to care for her newborn child [Ronald].
Story: [Vernon] was present in the delivery room when his daughter [Raquel] was born, but when his daughter [Constance] was born he was too sick.
[Vernon] and his daughter [ Margaret] went to the movies.
Story: [Raquel] and her brother [Casey] took her grandmother [Karen] to the store to buy a new dress.
[Karen] and her husband [Kyle] just celebrated 10 years of marriage.
[Allen]'s brother [ Arthur] came home from school, so she baked some extra for him, too.

Story: [Input]
Semantic Parse: We also use a variant of the above prompt to extract the gender of each person in a story.The prompt context is a bit simpler as there are only two genders.The examples are the same while the Semantic Parse result is simply replaced with the atomic facts about gender information.Below is the prompt to extract the gender of each person in a story where [Input] is replaced with the story in each test data instance.
Given a story, extract atomic facts of the form male("Person") or female("Person") for every person that appears in the sentences.
Story: [Verdie] waved good bye to her dad [Henry ] for the day and went next door with her sister [Amanda].
Story: [Michelle] was excited for today, its her daughter's, [Theresa], spring break.She will finally get to see her.
[Kristen] loved to care for her newborn child [Ronald].
Story: [Vernon] was present in the delivery room when his daughter [Raquel] was born, but when his daughter [Constance] was born he was too sick.
[Vernon] and his daughter [ Margaret] went to the movies.
Story: [Shirley] and [Edward] are siblings and best friends.They do everything together.
[ Henry] walked his daughters [Amanda] and [ Michelle] to school.
Story: [Raquel] and her brother [Casey] took her grandmother [Karen] to the store to buy a new dress.
[Karen] and her husband [Kyle] just celebrated 10 years of marriage.
[Allen]'s brother [ Arthur] came home from school, so she baked some extra for him, too.
Story: [Karen] was spending the weekend with her grandson, [Eddie].
[Eddie]'s sister [ Michelle] was supposed to come too, but she was busy and could n't make it.
[Theresa] took her daughter, [Michelle], out to High Tea yesterday afternoon.

Story: [Input]
Semantic Parse: Note that, although the sentences in the CLUTRR-S dataset is much simpler than those in CLUTRR dataset, we don't achieve 100% accuracy in GPT-3 responses with the above long prompt.This is partially because the above prompt violates prompting strategy 3 in Section 3 as the order of names in a binary relation in sentences is mostly following "relationOf(A,B)" instead of "relation(B,A)".

C.4 gSCAN
For gSCAN dataset, there is only one prompt below to extract the command in each data instance.All example sequences are from the training data. 11he [Input] at the end of the prompt is replaced with the command in each test data instance.
Please parse each sequence of words into facts.

C.5 Pick&Place
For the Pick&Place dataset, there are two prompts below to extract the atomic facts from the initial state and the goal state, respectively.
Turn each sentence into an atomic fact of the form on(A, B, 0).
Sentence: The red block is on the yellow bowl.Semantic Parse: on("red block", "yellow bowl", 0).

Sentence: [INPUT]
Semantic Parse: Turn each sentence into an atomic fact of the form on(A, B).
Sentence: The red block is on the yellow bowl.Semantic Parse: on("red block", "yellow bowl").
Sentence: The violet block is on the blue block.Semantic Parse: on("violet block", "blue block") .

Sentence: [INPUT]
Semantic Parse: For each sentence in the initial or goal state, we replace [INPUT] in the corresponding prompt above with this sentence and request GPT-3 to extract a single atomic fact.The union of these atomic facts extracted from all sentences is then used in the symbolic reasoner module to find an optimal plan.For the GPT-3 baseline, we use the following prompt to let GPT-3 directly find a plan where [INPUT] at the end of the prompt is replaced with the initial and goal state of the queried data instance.
Find a shortest plan to move blocks from an initial state to a goal state.Note that you cannot move a block if anything is on it.You cannot move a block onto a target block or bowl if there is anything is on the target block or bowl.At most two blocks can be placed in the same bowl with one on top of the other.

# Initial State:
Nothing is on the green bowl.
The violet block is on the blue bowl.
The blue block is on the violet bowl.
The green block is on the blue block.

# Goal State:
The violet block is on the green bowl.
The green block is on the violet block.
The blue block is on the blue bowl.
Nothing is on the violet bowl. Plan: 1. Move the violet block onto the green bowl.2. Move the green block onto the violet block.
3. Move the blue block onto the blue bowl.

# Initial State:
Nothing is on the blue bowl.
The yellow block is on the green bowl.
The green block is on the violet bowl.
The violet block is on the green block.
The blue block is on the yellow bowl.
The red block is on the blue block.

# Goal State:
The yellow block is on the blue bowl.
The green block is on the yellow block.
The red block is on the green bowl.
Nothing is on the violet bowl.
The blue block is on the yellow bowl.
The violet block is on the blue block. Plan: 1. Move the yellow block onto the blue bowl.
2. Move the red block onto the green bowl.
3. Move the violet block onto the blue block.4. Move the green block onto the yellow block. [INPUT] Plan:

D GPT-3 Errors in Semantic Parsing
In this section, we group and record the errors in the GPT-3 responses in tables where each row records a 3-tuple ⟨ dataset, sentence(s), GPT-3 response ⟩.
In this section, we list the following.

D.1 Argument misorder
A common mistake in the GPT-3 response is that the relation and arguments for an atom are correctly extracted, but the order of the arguments is incorrect.Such mistakes can be greatly alleviated by proper few-shot prompting where the orders of arguments in the example target atoms follow their orders in the stories.There are only 3 errors in CLUTRR 1.3 due to argument misorder.The first 2 mistakes are indeed due to their missing periods at the end of the sentences -if we simply add the periods back, their GPT-3 responses would become correct.

D.2 Wrong relation
Sometimes the arguments are correct, but the relations extracted by   These kinds of mistake may be resolved by restricting the space of possible relations.For example, the mistakes in the first four rows can be resolved by simply adding the sentence "Use spouse("Person", "Person") if two persons are couples." in the prompt.).% -released_at/2 % 1. -released_at(F, T) means commonsense law of inertia (CLI) can be applied to fluent F at T % 2. CLI is also applied to this literal itself -released_at(F, 0) :-fluent(F).% holds_at/2 % initial states of fluents --only location of items needs to be guessed {holds_at(at(I, L), 0): location(L)} = 1 :-item (I).holds_at(on(B, L), 0) :-on(B, L, 0).% happens/2 % for each timepoint, at most 1 event happens;
% derive the location of every object % the search space of X or Y coordinate is within -100 and 100 % (to avoid infinite loop in clingo when data has error) nums(-100..100).

E.5 Domain Specific Modules
In this section, we list all domain-specific rules for each task.Some rules serve as an interface to turn the atoms in GPT-3 responses into a general format used in ASP modules.These rules are not necessary and can be removed if we let GPT-3 directly return the general atoms, e.g., "query(at(A, where))" instead of "whereAgent(A)".To save the cost for GPT-3 requests, we did not reproduce the experiments using new GPT-3 prompts with atoms in general formats.
For example, the location module contains rules for spatial reasoning in a 2D grid space and is used for bAbI, StepGame, and gSCAN.Below is the main rule in the location module that computes the location (Xa,Ya) of object A from the location (Xb,Yb) of object B by adding the offsets (Dx,Dy) defined by the spatial relation R between A and B. The location module also includes 9 predefined offsets, e.g., offset(left,-1,0), that can be used to model multi-hop spatial relations of objects or effects of a robot's moving in a 2D space.For example, queries in StepGame are about the spatial relation R of object A to B. Using the location module, one can fix B's location to be (0,0) and compute the spatial relation R based on the location of A as follows.location(B, 0, 0) :-query(A, B). answer(R) :-query(A, B), location(A, X, Y), offset(R, Dx, Dy), Dx=-1: X<0; Dx=0: X=0; Dx=1: X>0; Dy=-1: Y<0; Dy=0: Y=0; Dy=1: Y>0.

Figure 2 :
Figure 2: The knowledge modules at the bottom are used in each task on the top.

Figure 4 :
Figure 4: A simple plan predicted by GPT-3+ASP in the Pick&Place domain.

Table 3
compares our method with RN (Santoro
The details of the prompts are given in Appendix C.4.

Table 7
compares the accuracy of our method and the state-of-the-art methods, i.e., BiLSTM-Attention (Sinha et al., 2019) and GSM (with a BiLSTM encoder) (Tian et al., 2021), on the (original) CLUTRR test dataset.Except for our method, all other models are trained on a specific split of the CLUTRR training dataset.
are incorrect or cannot be recognized by the ASP program.
or incorrect co-reference

Table 9 :
Knowledge modules used for each of the tasks.Note that DEC Axioms, action, and location modules are used in at least two datasets.Some domains aren't listed as they are small and domain specific.