Don’t Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments

A key missing capacity of current language models (LMs) is grounding to real-world environments. Most existing work for grounded language understanding uses LMs to directly generate plans that can be executed in the environment to achieve the desired effects. It thereby casts the burden of ensuring grammaticality, faithfulness, and controllability all on the LMs. We propose Pangu, a generic framework for grounded language understanding that capitalizes on the discriminative ability of LMs instead of their generative ability. Pangu consists of a symbolic agent and a neural LM working in a concerted fashion: The agent explores the environment to incrementally construct valid plans, and the LM evaluates the plausibility of the candidate plans to guide the search process. A case study on the challenging problem of knowledge base question answering (KBQA), which features a massive environment, demonstrates the remarkable effectiveness and flexibility of Pangu: A BERT-base LM is sufficient for setting a new record on standard KBQA datasets, and larger LMs further bring substantial gains.Pangu also enables, for the first time, effective few-shot in-context learning for KBQA with large LMs such as Codex.


Introduction
Language models (LMs) such as BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), and Codex (Chen et al., 2021a) have demonstrated an extraordinary ability in understanding and generating both natural language (Minaee et al., 2021;Liang et al., 2022) and generic programs (e.g., Python) Jain et al., 2022;Austin et al., 2021). The recent release of ChatGPT 2 is elevating this paradigm to a new level. It seems to point us towards a future where natural language 1 Code and data will be released at https://github.com/ dki-lab/Pangu. 2 chat.openai.com   Figure 1: A schematic illustration of the proposed framework, Pangu, where a symbolic agent interacts with the target environment to propose candidate plans, and a neural LM evaluates the plausibility of each plan based on the input utterance. The agent searches in the environment to incrementally construct the plans, and the LM guides the search process. serves as a universal device, powered by LMs, for automated problem solving and interacting with the (computing) world.
However, a key missing piece in realizing this future is the connection between LMs and realworld environments, including both digital environments (e.g., databases, knowledge bases, Excel spreadsheets, software, websites, among others) and physical environments (e.g., instruction following robots in household settings (Shridhar et al., 2020)). Such environments are where many real problems lie. For example, a biologist may need to find all the species of a certain butterfly genus and their geographic distribution from a biology knowledge base, a local grocery store owner may want to visualize the historical sales of different item categories in Excel to decide what and how much to restock before the holiday season, and a physician may need to find patients in a large database of electronic medical records who exhibited a rare combination of symptoms to inform the current diagnosis. How can LMs enable solving all these problems, which involve seeking information or taking actions in a specific environment, with natural language?
Each environment is a unique context for interpreting natural language requests from users. Grounding, i.e., linking of (natural language) concepts to contexts (Chandu et al., 2021), therefore becomes the fundamental problem. More precisely, we need to produce a plan that can be executed in an environment to achieve the desired effects of the corresponding language request. When a plan is described in a formal language (e.g., SQL for relational databases (Yu et al., 2018) or APIs for web services (Su et al., 2017;Andreas et al., 2020)), it is also called a program. The unique challenge of such grounded language understanding problems stems from 1) the vast heterogeneity of environments and their planning languages (e.g., SQL, GraphQL/REST APIs, λ-calculus, and robot planning languages), and 2) the vast, oftentimes infinite, number of possible instantiations (or states) of each environment. Some environments can also be dynamic (e.g., a database that is constantly updated or a physical environment with moving objects).
Most existing methods for grounded language understanding follow the popular sequence-tosequence framework (Sutskever et al., 2014;Cho et al., 2014) and generate the plans/programs in a left-to-right autoregressive fashion (Wang et al., 2021;Xie et al., 2022;Ye et al., 2022;Liu et al., 2022a). A core thesis of this paper is that directly generating plans may not be the optimal way of using LMs for grounded language understanding. It requires LMs to have intimate knowledge about each specific planning language and specific environment to ensure the grammaticality (i.e., conforming to the grammar of the planning language) and faithfulness (i.e., executable in the environment) of the generated plans, neither of which may be part of an LM's pre-training. The infinite and dynamic environment states also reduce the potential effectiveness of pre-training for improving faithfulness even if one manages to do so. Furthermore, autoregressive generation with a neural LM lacks fine-grained control over planning; it is cumbersome, though not impossible, to factor preferences, business logic, and other values and constraints into the plan generation process. The focus of recent work is to alleviate (some of) these three limitations by augmenting the autoregressive generation paradigm with environment-specific pretraining (Yu et al., 2021;Deng et al., 2021) or constrained decoding (Scholak et al., 2021;Shin et al., 2021;. However, the fundamental challenges still largely remain. Mathematically, an LM is simply a joint distribution p(x 1 , x 2 , ..., x n ) that factors as a product of conditional distributions n i=1 p(x i |x 1 , ..., x i−1 ). Existing work leverages the conditional distribution formulation to generate or sample the target plan. It thus casts the burden of ensuring grammaticality, faithfulness, and controllability all on the LM itself. The main proposal of this paper is to disentangle LMs from these responsibilities and let LMs be what they are originally defined as-a model that assigns a probability to a sequence of tokens. In other words, we advocate for using the joint distribution formulation of LMs to evaluate the plausibility of (utterance, candidate plan) pairs instead of directly generating the plan.
To this end, we propose Pangu, 3 a generic framework for grounded language understanding that capitalizes on the discriminative ability of LMs instead of their generative ability ( Figure 1). Pangu consists of a symbolic agent and a neural LM working in a concerted way. The symbolic agent operates in the environment to propose candidate plans, which are guaranteed by design to be both grammatical and faithful. If the environment is tiny, it may be possible to enumerate all candidate plans up front and let the LM select the best one. For most realistic environments, however, due to the size of the search space or partial observability, it is necessary for the agent to search in the environment and incrementally extend or refine the plans. The LM plays a key role in this search processit evaluates the candidate (partial) plans at each search step and guides the agent towards promising search directions for the next step. The LM also determines when the search ends, i.e., when no further extension of the current best plan would score higher. Finally, it is much easier to control the search process of a symbolic agent than to control the generation process of a neural LM. For example, one can easily define a list of disallowed actions for a certain search state and prevent the agent from proposing such plans.
As a case study, we instantiate the proposed framework for complex question answering over knowledge bases (KBQA). KBQA provides an ideal testbed for grounded language understand-ing because of its massive environment. We show that simply using BERT-base with Pangu is sufficient for achieving a new state of the art on standard KBQA benchmarks, and using larger LMs further improves the performance by a large margin. Pangu also enables, for the first time, prompting large language models (e.g., Codex) for few-shot KBQA with competitive performance. It provides unprecedented uniformity for using LMs: one can easily plug encoder-only LMs (e.g., BERT), encoderdecoder LMs (e.g., T5 (Raffel et al., 2020)), or decoder-only LMs (e.g., Codex) into Pangu. These results highlight the remarkable effectiveness and flexibility of Pangu and validate the proposal of using LMs for discrimination instead of generation for grounded language understanding.

Generation for Grounded Language Understanding
The Seq2Seq framework (Sutskever et al., 2014;Bahdanau et al., 2015) has been the de facto choice for grounded language understanding, where the LM directly generates a plan over the environment given an input utterance. However, due to the lack of grounding during pre-training, generating valid plans from the LM is challenging. Recent studies endeavor to alleviate this issue via inferencetime augmentation or pre-training augmentation. Two families of inference-time augmentation have been proposed so far, namely, input augmentation and constrained decoding. For input augmentation, the environment-or a representative subset of the environment-is fed to the LM's encoder together with the utterance (Hwang et al., 2019;Xie et al., 2022). Such methods rely on the LM itself to understand the environment and output a plan, and thus are more data-inefficient and have no guarantee for grammaticality or faithfulness. In contrast, several recent methods use constrained decoding to regulate the decoder's behavior, which can guarantee grammaticality (Scholak et al., 2021;Shu et al., 2022) or even faithfulness (Liang et al., 2017;. However, such uses still cast the burden of generating grammatical and faithful plans on the LM itself. In our proposal, the LM is only used to discriminate candidate plans proposed by an agent that explores the environment. Different from inference-time augmentation, pretraining augmentation methods seek to reduce the gap between LMs' pre-training settings and grounded language understanding tasks by performing environment-specific pre-training (Yu et al., 2021;Deng et al., 2021). However, these methods need a large set of environment-specific data for pre-training (e.g., aligned corpus for table schema and utterances (Yu et al., 2021)) and do not generalize to different environments (e.g., LMs with tablespecific pre-training are not applicable to knowledge bases). By contrast, Pangu requires no pretraining and can be easily plugged into different environments.

Few-Shot Grounded Language
Understanding with LLMs The use of Large Language Models (LLMs) such as GPT-3 (Brown et al., 2020) and Codex (Chen et al., 2021a) in grounded language understanding tasks has attracted increasing interest recently.
LLMs have demonstrated strong few-shot learning capabilities in a variety of environments, from writing programs to query and manipulate structured and unstructured data sources (Austin et al., 2021;Rajkumar et al., 2022;Cheng et al., 2022), interacting with mobile UI  and online websites (Gur et al., 2022;Nakano et al., 2021), to generating procedural plans and guiding embodied agents in virtual environments (Singh et al., 2022;Ahn et al., 2022;Shah et al., 2022;Song et al., 2022). Most existing work still capitalize on the generative ability of LLMs. A common strategy to encourage an LLM to produce valid plans is to directly describe the environment as context to the LLM (i.e., input augmentation), which is difficult for more complex environments like KBs. In contrast, Pangu shields the LLM from the complexity of the environment and lets the LLM focus on evaluating the plausibility of candidate plans proposed by an agent. One interesting related work is Ahn et al. (2022), where an LLM is used to score atomic action (skill) proposals, which are guaranteed to conform to affordance constraints, from an embodied agent. Our framework shares a similar spirit of using LMs for discrimination, but we support more complex plans through a search process in the environment guided by an LM.

Bottom-Up Semantic Parsing
Our instantiation of Pangu on KBQA is closely connected to bottom-up semantic parsing, particularly SmBoP (Rubin and Berant, 2021), a text-to-SQL model that iteratively constructs a complex plan from a set of sub-components. At each step of parsing, SmBoP enumerates candidate parse trees (i.e., plans) from all valid combinations of trees from the previous step and scores them to get the top-ranked ones. Pangu similarly constructs a complex plan incrementally from smaller subplans, but makes the following three departures. First, Sm-BoP requires all ingredients (i.e., column headers, table names, and DB values) at the beginning of parsing. This assumption does not generally hold for more complex or partially observable environments, where ingredients need to be discovered through search. In our implementation, only topic entities are required at the beginning, which can be readily obtained using an entity linker . Second, our scoring function is based on a straightforward application of LMs, while SmBoP uses a more intricate architecture with extra parameters. Finally, SmBoP parses for fixed T steps, while Pangu adaptively terminates the search process based on the scores from the LM. Also related is an array of earlier KBQA methods that adopt an enumerate-and-rank approach (Yih et al., 2015;Chen et al., 2019;Lan and Jiang, 2020). Because they try to enumerate all candidate plans up front, the maximum complexity of the plans is bound to be small. Our adaptive search process allows for flexible generation of more complex plans.

The Pangu Framework: Overview
An overview of the Pangu framework is presented in Algorithm 1. An overarching assumption of Pangu is that a complex plan can be incrementally constructed by an agent through its exploration in an environment. Such an agent can be a robot doing household tasks in a physical environment (Shridhar et al., 2020), or a virtual agent that orchestrates API calls of different web services (Andreas et al., 2020) or traverses a database/knowledge base (KB) (Yu et al., 2018;. Starting from a set of initial plans P 0 (maybe empty), at each step, the agent interacts with the environment E to extend the current plans into a new set of candidate plans (line 4). The candidate plans are guaranteed to be valid (i.e., both grammatical and faithful). An LM then scores the candidate plans, and the top K (the beam size) plans are retained for further exploration in the next step (line 5). The same procedure loops until a termination check is passed (line 6); the best plan is then returned. Pangu mainly shines in that a parameter-free symbolic agent explores the environment to propose valid plans and shields the LM from having to handle the large search space for plan generation. Instead, the LM only focuses on evaluating the plausibility of the proposed plans, i.e., to what extent a candidate plan matches the intent of the input utterance. An LM can be easily fine-tuned to excel at this assignment, or, in the case of LLMs such as Codex, they come with such ability out of the box, which enables few-shot in-context learning. Pangu is a generic framework and can potentially accommodate many grounded language understanding tasks by instantiating the various functions in Algorithm 1 accordingly.

KBQA: Preliminaries
Without loss of generality, we use KBs as our target environment and the knowledge base question answering (KBQA) task as a concrete example for ease of discussion. It is an ideal testbed because of the massive environment provided by modern KBs (e.g., FREEBASE (Bollacker et al., 2008) contains 45 million entities and 3 billion facts for over 100 domains), which makes grounding particularly challenging. Given a KB K ⊂ E × R × (E ∪ L ∪ C), where C is a set of classes, E a set of entities, L a set of literals and R a set of binary relations, the task of KBQA is to find a set of answer entities to an input utterance in the KB. KBQA is typically modeled as semantic parsing , where the utterance is mapped to an executable plan/program 4 in a certain formal language (e.g., SPARQL, λcalculus, or S-expression) whose denotation is the answer. We use S-expressions (Gu et al., 2021) for its compactness. An example is shown in Figure 2.

Candidate Plan Enumeration
To handle the large search space, the agent casts the task as a step-wise decision-making problem.
For example, an instruction-following robot may Figure 2: (a) An illustration of how an agent collaborates with an LM to incrementally produce a complex target plan over a KB using beam search (beam size = 1 in this example). At each step, the agent enumerates a set of valid plans based on the current plans and the environment. An LM then scores the candidate plans and returns the top-ranked ones. The search process terminates when there is no candidate plan that scores higher than the current best plan (e.g., 4a-c are all worse than 3c). (b) Using different LMs (left: BERT, right: Codex) to evaluate the plausibility of plan 2a. It resembles using LMs for semantic matching between the utterance and the plan. decompose a plan into a sequence of subplans (e.g., making a cup of coffee entails first finding a cup then picking up the cup, etc.; Song et al. (2022)). Similarly, a program for KBQA can be decomposed into a nested sequence of subprograms (Rubin and Berant, 2021;   (Figure 2). The length of a plan is defined as the number of atomic subplans it contains.
For KBQA (and similar semantic parsing tasks), P 0 can be a set of entity proposals (e.g., {Java}) obtained using off-the-shelf entity linkers . At step t, the agent considers P t−1 , the length t − 1 plans, and decides how to further extend them into C t , the valid plans of length t, based on the environment. This often involves executing the current plans in the environment. Consider the example in Figure 2 at t = 1, the agent finds all the relations connected to Java and enumerates all the length-1 valid plans. The LM scores the candidate plans and prunes all but the top-ranked plan because beam size is 1. At t = 2, the agent executes plan 1c to get its denotation (i.e., a set of entities) in the KB, based on which the agent further discovers the relations and classes (e.g., ComputerEmulator, ComputerSoftware, and ReadBy) connected to those entities to form valid length-2 plans. All the plans produced in this process are guaranteed to be valid. See Appendix A for a more detailed discussion of this process.

LM-Based Scoring
After the agent enumerates a set of candidate plans, an LM assists with its decision making by evaluating the plausibility of each candidate plan. The interface for evaluating a plan using LMs resembles using LMs for sentence-pair classification: given a pair of (u : utterance, c ∈ C t : candidate plan). An LM behaves as a scoring function: S(u, c) → R, which indicates to what extent the candidate plan matches the intent of the utterance. Evaluating the plausibility of each candidate basically boils down to semantic matching based on linguistic cues. For example, an LM should be able to tell that 2a in Figure 2 is a positive candidate due to the keyword ComputerEmulator in its surface form.
We follow the common practice of using LMs for semantic matching. For encoder-only LMs like BERT, we directly get a score from the representation of the [CLS] token (Figure 2(b)). For encoderdecoder LMs like T5, we follow Zhuang et al. (2022) to feed both the utterance and the candidate plan to the encoder and use the decoding probability over an unused token 5 during pre-training as a proxy for matching score. For decoder-only LMs like Codex, we model the score as the probability of generating the candidate plan conditioned on the utterance, i.e., P (c|u).
Intuitively, a good scoring function should re-spect the following partial order: S(u, c 1 ) > S(u, c 2 ), ∀c 1 ∈ G t and c 2 ∈ G t−1 , S(u, c 1 ) > S(u, c 2 ), ∀c 1 ∈ G t and c 2 ∈ C t \G t , where G t is the set of gold (sub-)plans at step t (i.e., length-t subplans of the target plan) and c is the target plan. In other words, a gold subplan should be scored higher than 1) any negative (i.e., not gold) plans at the same step (e.g., 2a should be scored higher than 2c), because negative plans contain information irrelevant to u, and 2) any gold sub-plans of length < t (e.g., 2a should be scored higher than 1c) because they are less complete. In addition, the gold target plan should be scored higher than any other plan.

Termination Check
Assuming the LM can assign reasonable scores to candidate plans following the above partial order, we can naturally define the condition for termination in Algorithm 1: it terminates if the highest score of candidate plans at step t is lower than the highest score of candidate plans at step t − 1, which, ideally, should indicate no candidate plan of length ≥ t is more complete than the plans at step t − 1, and thus the search process terminates. It is worth noting that this may not be the only way of checking termination for other grounded language understanding tasks. For example, an instructionfollowing robot may check the environment state (e.g., whether a cup of coffee has been successfully made) for that purpose.

Learning
We discuss the learning procedure for both finetuning LMs (e.g., BERT and T5) and in-context learning with LLMs (e.g., Codex). For both settings, we use pairs of utterances and gold plans for supervision.

Fine-Tuning
Given a gold plan of length T , we first derive its gold sub-plans G t of each step t ≤ T (e.g., 1c for step 1 and 2a for step 2 in Figure 2). Finetuning proceeds with beam search similar to the test-time behavior, but with bottom-up teacher forcing (Williams and Zipser, 1989;Rubin and Berant, 2021), i.e., the gold plans of the current step should always be inserted into the beam. At each step of beam search, we get the probability of each candidate plan c ∈ C t with softmax over the scores:  p(c) = softmax{S(u, c)} c∈Ct∪G t−1 . G t−1 is also included here to encourage LMs to explicitly learn the partial order by minimizing the following loss: where Z is the total number of summed items, and p(c) equals 1 if c ∈ G t and 0 elsewise. Note that, for the T + 1 step, we let G T +1 = G T . This additional step aims to enforce the third condition in the partial order. Our objective is essentially a listwise learning-to-rank objective based on the cross entropy (Cao et al., 2007).

In-Context Learning
For in-context learning, we directly use pairs of utterances and gold plans as in-context examples to the LLM, with a simple task instruction in the prompt: "Please translate the following questions to lisp like programs." The LLM is therefore expected to capture the desired partial order by observing the in-context examples. For concrete examples of prompts, please refer to Appendix B.

Datasets
We experiment with three standard KBQA datasets of different scale and nature (Table 1). GRAILQA (Gu et al., 2021) is a large-scale dataset that evaluates three levels of generalization, namely, i.i.d., compositional (novel compositions of seen constructs), and zero-shot (totally novel domains). It also features diverse questions of different complexity (e.g., programs may involve up to 4 relations) and aggregation functions (e.g., comparatives, superlatives, and counting). GRAPHQ (Su et al., 2016) is a moderate-scale dataset that focuses on non-i.i.d. generalization in KBQA. Due to the small size of its training set and the non-i.i.d. setting, GRAPHQ is a particularly challenging benchmark. In our experiments, we use the processed version by ,

Baselines
We mainly compare Pangu with multiple stateof-the-art baselines that use LMs as a generative model for KBQA. Different LMs and decoding strategies are used in the baseline models. ArcaneQA  is an encoderdecoder model built on top of a BERT encoder. It leverages constrained decoding and incrementally synthesizes a sequence of subprograms, where the constraints come from both the grammar and the execution of existing subprograms, to enforce grammaticality and faithfulness. TIARA (Shu et al., 2022) first uses BERT to retrieve a set of schema items, which are further used as the input, together with the question, to T5 for plan generation. They also apply constrained decoding but only for grammaticality. DecAF  similarly retrieves a relevant subgraph from the KB using DPR (Karpukhin et al., 2020), and then input the retrieved items to FiD (Izacard and Grave, 2021), a T5 model finetuned for question answering. RnG-KBQA (Ye et al., 2022) first uses BERT to rank a set of enumerated candidate programs (up to a limited complexity), and then uses T5 to edit the top programs into more complex programs. UnifiedSKG (Xie et al., 2022) also retrieves a subgraph from the KB as input to T5. The setting of UnifiedSKG is different from other baselines. It assumes the gold schema items are always included in the retrieved subgraph and restricts the number of negative schema items in the subgraph (i.e., at most 20 schema items for GRAILQA). It is thus a less fair comparison for other methods, but we include it anyway because it is a representative way of autoregressive plan generation using a large LM.
There is no constrained decoding for the last three baseline methods, and thus neither grammaticality nor faithfulness can be guaranteed. These methods rely on labeled training data to learn how to produce valid programs; their sample efficiency suffers as a result ( §5.3). A summary of the baselines can be found in Table 2. Compared with the baselines, Pangu requires no extra parameter, no modification to the LM, and no need to combine multiple LMs (for retrieval/encoding/decoding). Pangu also provides unprecedented uniformity and flexibility of using LMs, and one can easily plug LMs of different nature, e.g., encoder-only, encoder-decoder, or decoder-only, into our framework and they work in a plug-and-play fashion.

Implementation Details
We instantiate Pangu for KBQA. For the finetuning experiments, we experiment with BERTbase, T5-base, T5-large, and T5-3B, and use the full training set of each dataset for fine-tuning. For the in-context learning experiments, we experiment with Codex. 6 We first randomly sample 100 training examples from each dataset and use that as the pool for dynamic retrieval, therefore a 100-shot setting. During inference, for each test example, we retrieve 10 in-context examples from the pool using BM25-based utterance similarity.
We use entity linking results from off-the-shelf entity linkers. For GRAILQA we use the entity linking results from TIARA. For WEBQSP, we get that from ELQ , which is also used by our baseline models. For GRAPHQ, get that from ArcaneQA. The entity proposals for the input utterance form the initial plans (P 0 ) for our search process. We use beam size 5 for our fine-tuning experiments and beam size 1 for Codex. We run our experiments with T5-3B using a single NVIDIA UDEPLAMBDA  17.7 PARA4QA (Dong et al., 2017) 20.4 SPARQA (Sun et al., 2020) 21.5 BERT+Ranking (Gu et al., 2021) 25.0 (27.0) ArcaneQA  31
A100 80GB card, while for all other fine-tuning experiments, we run them using 4× NVIDIA A6000 48GB cards.

Main Results
Fine-tuning results. The main results are shown in Table 3. Using a BERT-base LM, Pangu already achieves a new state of the art on GRAILQA and GRAPHQ, and only trails behind DecAF on WEBQSP, which uses a 3B-parameter LM. On GRAILQA, Pangu with BERT-base outperforms the prior art DecAF by 5.3% in EM and 1.2% in    F1. On GRAPHQ, Pangu with BERT-base dramatically improves the state-of-the-art F1 from 31.8% to 48.2%. These are strong evidence for Pangu being a better protocol for using LMs for grounded language understanding. Pangu's strong generalizability with limited training data is also confirmed by its performance on the zero-shot generalization of GRAILQA, i.e., Pangu with BERT-base improves the previous best zero-shot F1 by 2.2%. Because GRAILQA has more training data, which gives models a better transfer learning opportunity for zero-shot generalization, our gain is not as conspicuous as on GRAPHQ but still decent. Our method also shows great flexibility in accommodating different LMs and a reliable return from model size-using increasingly larger LMs yields monotonically improved results across the board, with T5-3B setting the new state of the art on all datasets. One interesting observation is that Pangu slightly underperforms on the i.i.d. subset of GRAILQA. It turns out that, because the discriminative task is much easier for LMs to learn than the generative task, Pangu converges very fast and gets fewer training steps for overfitting the i.i.d. setting, in exchange for better non-i.i.d. generalization. For both BERT and T5, Pangu takes at most two epochs to converge. The strong performance on WEBQSP, an i.i.d. dataset, further supports this observation.
In-context learning results. For the first time, we show the feasibility of effective few-shot KBQA with LLMs. On GRAILQA, Pangu with Codex achieves an overall F1 of 53.3%. Though there is still a gap to the fine-tuning results, it is still impressive given that only 10 in-context examples and 100 training examples in total are used, es-pecially considering the massive meaning space of the KB. On GRAPHQ, Pangu with Codex even outperforms ArcaneQA. This finding is consistent with our fine-tuning experiments, i.e., Pangu is particularly strong in generalizing to new environments with limited training data. On WEBQSP, Pangu trails behind fine-tuning methods, which is expected given WEBQSP's i.i.d. nature. Finetuning methods can better memorize the patterns in the training data.

Decomposition by Question Complexity
We present a fine-grained analysis of Pangu with T5-3B and Codex on questions of different complexity, measured by the number of relations in the gold program, in Table 4. For GRAILQA, we report the performance on its dev set because the test set is hidden. Unsurprisingly, performance generally decreases as question complexity increases. Pangu performs competitively across all complexity. Note that there are only two questions in GRAILQA's dev set with 4 relations, so the results on that may not be indicative. On GRAPHQ, Pangu significantly outperforms ArcaneQA. The F1 of Pangu with T5-3B is almost three times higher than ArcaneQA on questions with 2 and 3 relations. Interestingly, Pangu with Codex (100-shot) also outperforms Ar-caneQA considerably on questions with 2 and 3 relations. These findings suggest the superiority of Pangu in generalizing to more complex programs, but also leave room for further improvement, especially for in-context learning.

Sample Efficiency Analysis
Intuitively, by using LMs for discrimination instead of generation, the task becomes easier for LMs and thus improves their sample efficiency. Our sample efficiency experiments in Figure 3 confirm this hypothesis. We downsample GRAILQA's training data and randomly sample 1, 10, 100, and 1,000 training examples and report the results on 500 random dev examples. We compare Pangu with Ar-caneQA and UnifiedSKG using the same LMs. We use oracle entity linking to have a more direct comparison with UnifiedSKG (though UnifiedSKG still has an unfair advantage as previously mentioned).
In addition, we also include Pangu with Codex and use the downsampled training set as the pool for retrieval. We also adapt UnifiedSKG to work under in-context learning with Codex, so we can have a direct comparison between discrimination vs. generation under in-context learning. First, we observe that when both using T5-base, UnifiedSKG significantly underperforms Pangu. The main reason is that most predicted plans are ungrammatical or unfaithful in the low-data regimes. ArcaneQA uses constrained decoding to alleviate this issue, but still consistently underperforms Pangu when both using BERT-base. We also observe that T5base performs worse than BERT-base under 1-shot and 10-shot settings, which can be explained by the added complexity of T5 that makes it harder to train under extreme low-data settings.
For in-context learning using Codex, Pangu achieves an EM of over 50% with only one training instance, which significantly outperforms all other settings, including UnifiedSKG+Codex. This may point to interesting opportunities for practical KBQA under extreme low-data settings.

Conclusions
In this paper, we proposed to capitalize on the discriminative ability of language models (LMs) instead of their generative ability for grounded language understanding. Building on this proposal, we proposed a generic framework, Pangu, for grounded language understanding. It consists of a symbolic agent and a neural LM working in a concerted fashion and creates a better separation between the realm of the neural and the symbolic. As a case study, we instantiated the proposed framework for the challenging knowledge base question answering (KBQA) task, which features a massive environment and a particularly challenging setting for grounding. Extensive experiments convincingly demonstrated the remarkable effectiveness and flexibility of the proposed framework. This work opens the door for developing versatile and sample-efficient grounded language understanding systems that fully capitalize on the language understanding ability of LMs while avoiding their limitations. It also sheds light on developing better neuro-symbolic systems in general.

Limitations
Despite the strong performance of Pangu, we identify several limitations for future improvement. The first major limitation lies in efficiency. Because Pangu requires an LM to iteratively score candidate plans, it tends to be more resource-consuming, in terms of both time and computing. Compared with ArcaneQA, which efficiently handles complex questions in KBQA, Pangu is about twice slower for both training and inference and consumes about twice as much GPU memory when both using BERT-base. Specifically, to predict a plan of L tokens, generation-based methods involve using an LM to do L forward passes. For Pangu, the number of forward passes is proportional to the number of candidate plans, which can range widely. To alleviate this issue, algorithms with a complexity better than O(N ), N is the number of candidate plans, can be developed to find the top-K candidates.
Second, though Pangu has shown some promising results with Codex, the true potential of enabling few-shot grounded language understanding with Pangu has yet to be realized. We only experiment with a straightforward scoring function and have not experimented with different prompt designs systematically. In the future, we plan to try different prompt designs, retrievers, and scoring functions, including using latest techniques like chain of thought .
Last but not least, though orthogonal to the general framework of our proposal, in our current instantiation, we assume gold plans for training. However, gold plans can be expensive to collect for some environments. Exploring fine-tuning LMs with weak supervision can be an interesting direction. In addition to proposing candidate plans to the LM, the agent may also respond to the LM with rewards based on its decisions (Liang et al., 2017   . R: relation, T : type, E: entity, E : a set of entities, N : integer.

A Candidate Enumeration
Our candidate enumeration for KBQA strictly follow the definitions of different functions in Table 5. Specifically, given a set of current plans P t , to construct the candidate set C t+1 , for each plan p i in P t , the agent executes it and gets types and relations that are reachable from its execution by exploring the KB. For each type t, the agent enumerates (AND t pi) as a candidate. For each relation r, the agent enumerate (JOIN r pi) as a candidate. If the execution of p i is a numerical value, then four similar candidates with comparatives are also included (LT/LE/GT/GE r pi). In addition, candidate plans with superlatives can be enumerated as (ARGMAX/ARGMIN pi r). Also, (COUNT pi) can always be included to C t+1 . After check each p i independently, the agent then check each pair of plans p i and p j from P t , if the execution of p i and p j has an overlap, then (AND pi pj) is also included as a candidate plan. The candidate enumeration process is totally transparent to the LM and can be easily controlled based on different needs.

B Examples of Prompts
We show two examples of prompts with 10 incontext samples retrieved from the 100 training data pool in Figure 4 and Figure 5.