Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing

Recent years pretrained language models (PLMs) hit a success on several downstream tasks, showing their power on modeling language. To better understand and leverage what PLMs have learned, several techniques have emerged to explore syntactic structures entailed by PLMs. However, few efforts have been made to explore grounding capabilities of PLMs, which are also essential. In this paper, we highlight the ability of PLMs to discover which token should be grounded to which concept, if combined with our proposed erasing-then-awakening approach. Empirical studies on four datasets demonstrate that our approach can awaken latent grounding which is understandable to human experts, even if it is not exposed to such labels during training. More importantly, our approach shows great potential to benefit downstream semantic parsing models. Taking text-to-SQL as a case study, we successfully couple our approach with two off-the-shelf parsers, obtaining an absolute improvement of up to 9.8%.


Introduction
Recent breakthroughs of Pretrained Language Models (PLMs) such as BERT (Devlin et al., 2019) and GPT3 (Brown et al., 2020) have demonstrated the effectiveness of self-supervised learning for a range of downstream tasks. Without being guided by structural information in training, PLMs show the potential for learning implicit syntactic structures and language semantic, which can be transferred to other tasks. To better understand and leverage what PLMs have learned, several work has emerged to probe or induce syntactic structures from PLMs. According to prior studies (Rogers et al., 2020), most existing work focuses on syntactic structures such as part of speech (Liu et al.,Figure 1: Typical scenarios for grounding, here the linguistic tokens "george washington" can be grounded into different real-world concepts. 2019), constituency tree (Wu et al., 2020) and dependency tree (Hewitt and Manning, 2019;Jawahar et al., 2019), paying much less attention on language semantics (Tenney et al., 2019). However, as well known, semantic information is essential for high-level tasks like machine reading comprehension (Wang and Jiang, 2019).
Regarding to language semantics, an important branch is grounding, which is overlooked by most previous work. Broadly speaking, grounding means "connecting linguistic symbols to real-world perception or actions" (Roy, 2005). It is generally thought to be important for a variety of tasks, such as video descriptions (Zhou et al., 2019), visual question answering (Zhu et al., 2016) and semantic parsing (Guo et al., 2019). In this paper, we focus on single-modal scenarios, where grounding refers more specifically to mapping linguistic tokens into a real-world concept described in natural language. As shown in Figure 1, "george washington" can be grounded into either a cell value in a structured table, or an entity in knowledge bases.
In single-modal scenarios, grounding is especially important for semantic parsing, the task of translating a natural language sentence into its corresponding executable logic form. For earlier work, grounding is essential since earlier work almost conceptualized semantic parsing as grounding an utterance to a task-specific meaning representation (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2013;Cheng et al., 2017). As for modern approaches based on the encoder-decoder architecture, grounding also plays an important role and considerable work has demonstrated the positive effect of it (Guo et al., 2019;Dong et al., 2019;Wang et al., 2020b;. Despite its success, existing grounding methods mainly relied on heavy manual efforts like high-quality lexicons (Reddy et al., 2016) or adhoc heuristic rules like n-gram matching (Guo et al., 2019), suffering from poor flexibility. To explore more flexible methods, researchers recently tried a data-driven way: they collected grounding annotations as supervision to train grounding models (Li et al., 2020a;Lei et al., 2020;Shi et al., 2020). However, this modeling flexibility in their approaches requires expensive annotations of grounding, which most of the time are not available.
To alleviate the above issues, we present a novel approach Erasing-then-Awakening (ETA) 1 . It is inspired by recent advances in interpretable machine learning (Samek et al., 2017), where the importance of individual pixels can be quantified with respect to the classification decision. Similarly, our approach firstly quantifies the contribution of each word with respect to each concept, by erasing it and probing the variation of concept prediction de-cisions (elaborated later). Then it employs these contributions as pseudo labels to awaken latent grounding from PLMs. In contrast to prior work, our approach only needs supervision of concept prediction, which can be easily derived by downstream tasks (e.g., text-to-SQL) instead of full grounding supervision. Empirical studies on four datasets demonstrate that our approach can awaken latent grounding which is understandable to human experts. It is highly non-trivial because our approach is not exposed to any human-annotated grounding label in training. More importantly, we find that the grounding can be easily coupled with downstream models to boost their performance, and the absolute improvement is up to 9.8%. In summarization, our contribution is as three-fold: 1. To the best of our knowledge, we are the first one to highlight and demonstrate the possibility of awakening latent grounding from PLMs.
2. We propose a novel weakly supervised approach erasing-then-awakening, to awaken latent grounding from PLMs. Empirical studies on four datasets demonstrate that our approach can awaken latent grounding which is understandable to human experts.
2 Method: Erasing-then-Awakening In the task of grounding, we are given a question x = x 1 , · · · , x N and a concept set C = {c 1 , · · · , c K }, where each concept consists of several tokens. The goal of grounding is to find out tokens (also known as mentions) in x which are relevant to concepts in C. Generally, the grounding procedure learns to create a N ×K matrix, which we call latent grounding. In some cases, a set of pairs is needed, of which each one explicitly shows a token and a concept is grounded. We call this kind of pairs as grounding pairs below. As illustrated in Figure 2, our model consists of a PLM module, a CP module and a grounding module. In this section, we first present the training procedure of ETA, which at a high-level involves three steps: (1) Train an auxiliary concept prediction module.
(2) Erase tokens in a question to obtain the concept prediction confidence differences as pseudo alignment. (3) Awaken latent grounding from PLMs by applying pseudo alignment as supervision. Then we introduce the procedure to produce grounding pairs in inference.

Training a Concept Prediction Module
Given x and C, the goal of the concept prediction module is to identify if each concept c k ∈ C is mentioned or not in the question x. Although it does not seem to be directly related to grounding, it is a pre-requisite for the erasing mechanism, which will be elaborated later. As for c k 's supervision l k ∈ {0, 1}, it is the weak supervision ETA relies on, and can be readily obtained through downstream task signals. Taking text-to-SQL as an illustration, each database schema (i.e., table, column and cell value) in an annotated SQL can be considered as mentioned in the question (l k = 1), with others as negative examples (l k = 0).
Once the supervision is prepared, the CP module is trained to conduct binary classification over the representation of each concept. As done in previous work (Hwang et al., 2019), we first concatenate the question and all concepts into a sequence as input to the PLM module. As illustrated in Figure 2, the input sequence starts with [CLS], with the question and each concept being separated by [SEP]. Then, the sequence is fed into the PLM module to produce deep contextual representations over each position. Denoting q 1 , q 2 , ..., q N and e 1 , e 2 , ..., e K as the token representations and concept representations, they can be obtained by: where q n and e k correspond to the representations at the position of n-th question token and the first token in c k respectively. Finally, each concept representation e k is passed to a classifier to predict if it is mentioned in x as: where W l is a learnable parameter. p k is the probability of c k mentioned in the question, which is referred to by concept prediction confidence below.

Erasing Question Tokens
Once the concept prediction module is converged, we apply an erasing mechanism to assist in the following awakening phase. It follows a similar idea from the interpretable document classification (Arras et al., 2016), where a word is considered important for the document classification if removing it and classifying the modified document results in a strong decrease of the classification score. In our case, a token is considered highly relevant to certain concepts if there is a large drop in these concept prediction confidences after erasing the token. Therefore, we need the above mentioned concept prediction module to provide a reasonable concept prediction confidence. Concretely, as shown in Figure 2, the erasing mechanism erases the input sequentially, and feeds each erased input into the PLM module and the subsequent CP module. For example, with x n being substituted by a special token [UNK], we can obtain an erased input as [CLS], x 1 , · · · , x n−1 ,[UNK], x n+1 , · · · , c K . Denotingp n,k the concept prediction confidence for c k after erasing x n , we believe the difference betweenp n,k and p k reveals c k 's relevance to x n from a PLM's view. The confidence difference ∆ n,k can be obtained by ∆ n,k = l k · max(0, p k −p n,k ). Repeating the above procedure on the input question sequentially, ∆ ∈ R N ×K is filled completely.

Awakening Latent Grounding
As mentioned above, we believe ∆ reflects the relevance between each token and each concept from a PLM's view. Therefore, we could directly use ∆ as ETA's output. However, according to our preliminary study, the method performs poorly and cannot produce high-quality alignment 2 . Different from directly using ∆, we employ it to "awaken" the latent grounding. To be specific, we introduce a grounding module upon representations of the PLM module and train it using ∆ as pseudo labels (i.e., pseudo alignment). The grounding module first obtains grounding scores g n,k between each question token x n and each concept c k based on their deep contextual representations q n and e k as: where W e , W q are learnable parameters and d is the dimension of e k . Then it normalizes the grounding scores into latent grounding α as: Finally, the grounding module is trained to maximize the likelihood with ∆ as the weight: n k ∆ n,k · log α n,k .

Producing Grounding Pair
Repeating erasing and awakening iteratively for epochs until the grounding module converges, we can readily produce grounding pairs. Formally, we aim to obtain a set of pairs, where each pair x n , c k indicates that x n is grounded to c k . Noticing c k may contain several tokens, we keep all probabilities in α ·,k which exceeds τ /|c k |, where τ is a threshold and |c k | is the number of tokens in c k . Also, taking into account that x n should be grounded to only one concept, we keep only the highest probability over α n,· . Finally, for each pair x n , c k , it is thought to be a grounding pair if α n,k is kept and p k ≥ 0.5, otherwise it is not.

Experiments
In this section, we conduct experiments to evaluate if the latent grounding awakened by ETA is understandable to human experts. Here we accomplish the evaluation by comparing the grounding pairs produced by ETA with human annotations.

Experimental Setup
Datasets We select two representative grounding tasks where human annotations are available: schema linking and entity linking. Schema linking 2 More experimental results can be found in §3.3. is to ground questions into database schemas, while entity linking is to ground questions into entities of knowledge bases. For schema linking, we select SPIDER-L (Lei et al., 2020) and SQUALL (Shi et al., 2020) as our evaluation benchmarks. As mentioned in §2.1, the supervision for our model is obtained from SQL queries. As for entity linking, we select WebQSP EL and GraphQ EL (Sorokin and Gurevych, 2018). The supervision for our model is obtained from SPARQL queries in a similar way.
Evaluation For schema linking, as done in previous work (Lei et al., 2020), we report the microaverage precision, recall and F1-score for both columns (Col P , Col R , Col F ) and tables (Tab P , Tab R , Tab F ). For entity linking, we report the weak matching precision, recall and F1-score for entities (Ent P , Ent R , Ent F ). The weak matching metric is a commonly used metric in previous work (Sorokin and Gurevych, 2018), which considers a prediction as correct whenever the correct entity is identified and the predicted mention boundary overlaps with the ground truth boundary. More details can be seen in §A.
Baselines For schema linking, we consider four strong baselines. (1) N-gram Matching enumerates all n-gram (n ≤ 5) phrases in a natural language question, and links them to database schemas by fuzzy string matching. (2) SIM computes the dot product similarity between each question token and schema using their PLM representations without fine-tuning, to explore grounding capacities of unawakened PLMs. (3) CONTRAST learns by comparing the aggregated grounding scores of mentioned schemas with unmentioned ones in a contrastive learning style, as done in . Concretely, in training, CONTRAST is trained to accomplish the same concept prediction task as our approach. With a similar architecture to the Receiver used in , it first computes the similarity score between each token and each concept, and then uses max pooling to aggregate the similarity scores of a concept over an utterance into a concept prediction score. Finally, a margin-based loss is used to encourage the baseline to give higher concept prediction scores on mentioned concepts than unmentioned concepts.   For entity linking, we compare ETA with three powerful methods. (1) Heuristic picks the most frequent entity among the candidates found by string matching over Wikidata.
(2) VCG (Sorokin and Gurevych, 2018) aggregates and mixes contexts of different granularities to perform entity linking. (3) ELQ (Li et al., 2020a) uses a bi-encoder to perform entity linking in one pass, achieving state-of-the-art performance on WebQSP EL and GraphQ EL . VCG and ELQ utilize entity linking supervision in training, while ETA does not.
Implementation For schema linking we follow the procedure in §2.4 to produce grounding pairs to evaluate, while for entity linking we further merge adjacent grounding pairs to produce spanlevel grounding pairs. We implement ETA in Pytorch (Paszke et al., 2019). With respect to PLMs in experiments, we use the uncased BERT-base (BERT) 4 and BERT-large (BERT L ) from Trans-4 Our approach is theoretically applicable to different PLMs. In this paper, we chose BERT as a representative and we leave exploration of different PLMs for future work. formers library (Wolf et al., 2020). As for the optimizer, we employ AdamW (Loshchilov and Hutter, 2019). More details (e.g., learning rate) of each experiment can be found in §C.1. Table 1 shows the experimental results on the schema linking task. As shown, our method outperforms all weakly supervised methods and heuristicbased methods by a large margin. For example, on SPIDER-L, ETA + BERT achieves an absolute improvement of 7.2% Col F and 2.8% Tab F over the best baseline CONTRAST. The same conclusion can be drawn from the experimental results on the entity linking task shown in Table 2. For instance, ETA + BERT can obtain a high Ent F up to 74.5% on WebQSP EL , which is a satisfying performance for downstream tasks. All results above demonstrate the superiority of our approach on awakening latent grounding from PLMs. With respect to the reason that PLMs work well on both schema linking and entity linking, it may be because both schema linking and entity linking require text-based semantic matching (e.g., synonyms), which PLMs excel at.

Experimental Results
Furthermore, it is very surprising that although not trained under fine-grained grounding supervision, our model is comparable with or slightly worse than the fully supervised models across How many points did arnaud demare receive? GOLD: points→ "UCI world tour points" PRED: Technically Correct (21.0%) Total population of millbrook first nation? GOLD: population→ "Population" PRED: population→ "Population"; nation→ "Community" Partially Correct (15.8%) Who was the first winning captain? GOLD: the first→ "Year"; winning captain→ "Winning Captain" PRED: first→ "Year"; winning captain→ "Winning Captain" Wrong Grounding (10.1%) Were the matinee and evening performances held earlier than the 8th anniversary? GOLD: earlier→ "Date" PRED: matinee→ "Performance"; earlier→ "Date" datasets. For instance, on SPIDER-L, our model exceeds the fully supervised baseline SLSQL L by 0.9 points on Tab F . On SQUALL, our model holds a slightly worse performance than the fully supervised baseline ALIGN L . It is highly nontrivial since CONTRAST, the best weakly supervised baseline on SPIDER-L, is far from the fully supervised model on SQUALL, while our model has only a small drop. Besides, on WebQSP EL and GraphQ EL , although our model is inferior to the state-of-theart model ELQ, it also achieves a comparable performance with the fully supervised baseline VCG. These results provide strong evidence that PLMs do have very good grounding capabilities, and our approach can awaken them from PLMs.

Model Analysis
In this section, we try to answer four interesting research questions via a thorough analysis: RQ1.  Liang, 2019). Similarly, since our approach depends on extra modules (e.g., grounding module), it faces the same dilemma: how can we know whether the latent grounding is learnt from PLMs or extra modules? Therefore, we apply our approach to a randomly initialized Transformer encoder (Vaswani et al., 2017), to probe the grounding capability of a model that has not been pretrained. To make it comparable, the encoder has the same architecture as BERT. However, it only gets a 40% Col F on SQUALL, not even as good as the N-gram baseline. Considering it contains the same extra modules as ETA + BERT, the huge gap between it and ETA + BERT supports the opinion that the latent grounding is mainly learnt from PLMs. Meanwhile, one concern shared by our reviewers is the risk of supervision exposure during training of the concept prediction module. In other words, our approach may "steal" some supervision in the concept prediction module to achieve good performance on grounding. However, the above experiment demonstrates that a non-pretrained model is far from strong grounding capability even with the same concept prediction module. We hope the finding will alleviate the concern.
RQ2 As mentioned in §2.3, the pseudo alignment ∆ can also be employed as the model prediction. Therefore, we conduct experiments to verify if our proposed awakening phase is necessary. As shown in Figure 3, even with various normalization methods (e.g., Softmax), ∆ does not produce satisfactory alignment. In contrast, our model consistently performs well. To investigate deeper, we conduct a careful analysis on ∆, and we are surprised to  Figure 4: The illustration of the solution to couple ETA with downstream text-to-SQL parsers.
find that values of ∆ are generally small and not as significantly different with each other as we would expect. Therefore, we believe the success of our approach stems from the fact that it encourages the grounding module to capture subtle differences and strength them.

RQ3
We apply our approach on BERT-large (BERT L ) and conduct experiments on SPIDER-L.
The results show BERT L brings an improvement of 2.5% Col F and 0.5% Tab F , suggesting the possibility of awakening better latent grounding from larger PLMs. Nevertheless, the improvement may also come from more parameters, so the conclusion needs further investigation.

RQ4
We manually examine 20% of our model's errors on the SQUALL dataset and summarize four main error types: (1) missed grounding -where our model did not ground any token to a concept, (2) technically correct -where our model was technically correct but the annotation was missing, (3) partially correct -where our model did not find all tokens of a concept, (4) wrong grounding -where the model produced incorrect grounding. As shown in Table 3, only a small fraction of errors are wrong grounding, indicating that the main challenge of our approach is recall rather than precision.

Case Study: Text-to-SQL
The ETA model is proposed for general-purpose uses and intends to enhance different downstream semantic parsing models. To verify it, we take the text-to-SQL task as a case study. In this section, we first present a general solution to couple ETA with different text-to-SQL parsers. Then, we conduct experiments on two off-the-shelf parsers to verify the effectiveness of ETA.

Coupling with Text-to-SQL Parsers
Inspired by Lei et al. (2020), we present a general solution to couple ETA with downstream parsers in   Figure 4. As shown, we first obtain a schema-aware representation for each question token, by fusing the token representation and its related schema representation according to the latent grounding α∈R N ×K (gray matrix in Figure 4). Specifically, given a token representation q n and all schema representations e 1 , e 2 , ..., e K , the schema-aware representationq n for q n can be computed as: Then we feed everyq n into a question encoder to generate hidden states, which are attended by a decoder to decode the SQL query. By contributing to the schema-aware representation, ETA is able to prompt the decoder to predict appropriate schemas during decoding. Notably, the encoder and decoder are not limited to specific modules, and we follow the paper settings in subsequent experiments.

Experimental Setup
Datasets and Evaluation We conduct experiments on two text-to-SQL benchmarks: WikiTable-Questions(WTQ) (Pasupat and Liang, 2015) (Guo et al., 2019) 63.9 55.0 BRIDGE + BERTL (Lin et al., 2020) 70.0 65.0 RATSQL + BERTL (Wang et al., 2020a)   Baselines On WTQ, our baselines include ALIGN P and ALIGN, where the former is a vanilla attention based sequence to sequence model and the latter enhances ALIGN P with an additional schema linking task (Shi et al., 2020). Similarly, on Spider, our main baselines are SLSQL P and its schema linking enhanced version SLSQL (Lei et al., 2020). SLSQL P is made up of a question encoder and a two-step SQL decoder. In the first decoding step, a coarse SQL (i.e., without aggregation functions) is generated. Then the coarse SQL is used to synthesize the final SQL in the second decoding step. Here we also report the performance of SLSQL + BERT (Oracle), where the learnable schema linking module is replaced with human annotations in inference. It represents the maximum potential benefit of schema linking for the text-to-SQL task. Meanwhile, for a comprehensive comparison, we also compare our model with state-of-the-art models on the Spider benchmark 6 . We refer readers to their papers for details.
Implementation As for our approach, on WTQ, we employ ALIGN P 7 as our base parser, while on Spider we select SLSQL P 8 as our base parser. For both parsers, we try to follow the same hyperparameters as described in the paper to reduce other factors that may affect the performance. More implementation details can be found in §C.2.  Figure 5: The latent grounding produced by ETA + BERT L for the question "Where is the youngest teacher from?". an absolute improvement 7.1% on the Ex.Set metric. As the PLM becomes larger (e.g., BERT L ), the improvement becomes more significant, up to 9.8%. Compared with state-of-the-art methods, our model ETA + BERT L also obtains a competitive performance, which is extremely impressive since it is based on a simple parser.

Experimental Results
More interestingly, on both datasets, our model can achieve similar even better performance compared to methods which employ extra grounding supervision. For instance, in comparison with SLSQL + BERT on Spider, our ETA + BERT outperforms it by 3.7%. Taking into account that SLSQL utilizes additional supervision, the performance gain is very surprising. We attribute the gain to two possible reasons: (1) The PLMs already learn latent grounding which is understandable to human experts. (2) Compared with training with strong schema linking supervision, training with weak supervision alleviates the issue of exposure bias, and thus enhance the generalization ability of ETA. Table 6 presents the model predictions of ETA + BERT L on three real cases. As observed, ETA has learned the grounding about adjective (e.g., oldest → age), entity (e.g., where → hometown) and semantic matching (e.g., registered → student enrolment). Meanwhile, grounding pairs provide us a useful guide to better understand the model predictions. Figure 5 visualizes the latent grounding for Q2 in Table 6, and more visualization can be found in §D.

Related Work
The most related work to ours is the line of inducing or probing knowledge in pretrained language models. According to the knowledge category, there are mainly two kinds of methods: one focuses on syntactic knowledge and the other pays attention to semantic knowledge. Under the category of syntac-  tic knowledge, several work showed that BERT embeddings encoded syntactic information in a structural form and can be recovered (Lin et al., 2019b;Warstadt and Bowman, 2020;Hewitt and Manning, 2019;Wu et al., 2020). However, recent work also showed that BERT did not rely on syntactic information for downstream task performance, and thus doubted the role of syntactic knowledge (Ettinger, 2020; Glavas and Vulic, 2020). As for semantic knowledge, although it is less explored than syntactic knowledge, previous work showed that BERT contained some semantic information, such as entity types (Ettinger, 2020), semantic roles (Tenney et al., 2019) and factual knowledge (Petroni et al., 2019). Different from the above work, we focus on the grounding capability, an under-explored branch of language semantics.
Our work is also closely related to entity linking and schema linking, which can be viewed as subareas of grounding on specific scenarios. Given an utterance, entity linking aims at finding all mentioned entities in it using a knowledge base as candidate pool (Tan et al., 2017;Chen et al., 2018;Li et al., 2020a), while schema linking tries to find all mentioned schemas related to specific databases (Dong et al., 2019;Lei et al., 2020;Shi et al., 2020). Previous work generally either employed full supervision to train linking models (Li et al., 2020a;Lei et al., 2020;Shi et al., 2020), or treated linking as a minor pre-processing (Yu et al., 2018a;Guo et al., 2019;Lin et al., 2019a) and used heuristic rules to obtain the result. Our work is different from them since we optimize the linking model with weak supervision from downstream signals, which is flexible and practicable. Similarly, Dong et al. (2019) utilized downstream supervision to train their linking model. Compared with them using policy gradient, our method is more efficient since it directly learns the grounding module using pseudo alignment as supervision.

Conclusion & Future Work
In summary, we propose a novel weakly supervised approach to awaken latent grounding from pretrained language models via erasing. Only with downstream signals, our approach can induce latent grounding from pretrained language models which is understandable to human experts. More importantly, we demonstrate that our approach could be applied to off-the-shelf text-to-SQL parsers and significantly improve their performance. For future work, we plan to extend our approach to more downstream tasks such as visual question answering. We also plan to utilize our approach to improve the error locator module in existing interactive semantic parsing systems (Li et al., 2020b

A.1 Schema Linking
Let Ω col be a set {(c, q) i |1 ≤ i ≤ N } which contains N gold (column-question token) tuples. Let Ω col be a set {(c, q) j |1 ≤ j ≤ M } which contains M predicted (column-question token) tuples. We define the precision(Col P ), recall(Col R ), F1score(Col F ) as: where Γ col = Ω col Ω col . The definitions of Tab P , Tab R , Tab F are similar. Note that the result reported in Table 8 of Shi et al. (2020) use a different evaluation metrics. Here we re-evaluate their model by the above mentioned metrics for fair comparison.

A.2 Entity Linking
Let Ω = {(e, [q s , q e ]) i |1 ≤ i ≤ N } be the gold entity-mention set and Ω = {(e, [q s , q e ]) j |1 ≤ j ≤ M } be the predicted entity-mention set, where e is the entity, q e , q s are the mention boundaries in the question q. In the weak matching setting, a prediction is correct only if the ground-truth entity is identified and the predicted mention boundaries overlap with the ground-truth boundaries. Therefore, the True-Positive prediction set is defined as: ∈ Ω, (e, [q s , q e ]) ∈ Ω, [q s , q e ] [q s , q e ] = ∅}.

B Dataset Statistic
All details of datasets used in this paper are shown in Table 7.

C Implementation Details
For all experiments, we employ the AdamW optimizer and the default learning rate schedule strategy provided by Transformers library (Wolf et al., 2020).

C.1 Experiments on Grounding
SQUALL We use uncased BERT-base as the encoder. The learning rate is 3 × 10 −5 . The training epoch is 50 with a batch size of 16. The dropout rate and the threshold τ are set to 0.3 and 0.2 respectively. The training process lasts 6 hours on a single 16GB Tesla P100 GPU.
SPIDER-L We implement two versions: uncased BERT-base and uncased BERT-large. For both versions, the learning rate is 5 × 10 −5 and the training epoch is 50. For BERT-base (BERT-large) version, the batch size and gradient accumulation step are set to 12 (6) and 6 (4). The dropout rate and the threshold τ are set to 0.3 and 0.2 respectively. As for training time, BERT-base (BERT-large) version is trained on a 24GB Tesla P40 and it takes about 16 (48) hours to finish the training process.
WebQSP EL & GraphQ EL Due to the large amount of entity candidates, we first use the candidate retrieval method proposed in (Sorokin and Gurevych, 2018) to reduce the number of candidates. After that, we still can not feed all candidates along with the question due to the maximum encoding length of BERT. Therefore, we divide the candidates into multiple chunks and feed each chunk (along with the question) into BERT sequentially. In implementation, we use uncased BERT-base as the encoder. The learning rate is 1 × 10 −5 The training epoch is 50 with a batch size of 16. The dropout rate and the threshold τ are set to 0.3 and 0.3 respectively. The training procedure finishes within 10 hours on a single Tesla M40 GPU.

C.2 Experiments on Text-to-SQL
For experiments of the text-to-SQL task, we employ the official code released along with Shi et al.
(2020) (on WTQ) and Lei et al. (2020) (on Spider). When coupling ETA with these models, we first produce a one-hot grounding matrix derived by grounding pairs and then feed it into them as described in §4.
WTQ We use uncased BERT-base as the encoder. The training epoch is 50 with a batch size of 8. The learning rate is 1 × 10 −5 for the BERT module and 1 × 10 −3 for other modules. The dropout rate is set to 0.2. The training process finishes within 16 hours on a single 16GB Tesla P100 GPU.
Meanwhile, we follow the previous work (Shi et al., 2020) to employ 5-fold cross-validation, and    experimental results of all five splits on WTQ using ETA + BERT are shown in Table 9.
Spider We implement two versions: uncased BERT-base and uncased BERT-large. For BERTbase (BERT-large), the learning rate is 1.25 × 10 −5 (6.25 × 10 −6 ) for the BERT module and 1 × 10 −4 (5 × 10 −5 ) for other modules. The batch size and gradient accumulation step are set to 10 (6) and 5 (4) for BERT-base (BERT-large) version. The dropout rate is set to 0.3. As for training time, BERT-base (BERT-large) version is trained on a 24GB Tesla P40 and it takes about 36 (56) hours to finish the training process.  Table 6.