Topic Transferable Table Question Answering

Weakly-supervised table question-answering (TableQA) models have achieved state-of-art performance by using pre-trained BERT transformer to jointly encoding a question and a table to produce structured query for the question. However, in practical settings TableQA systems are deployed over table corpora having topic and word distributions quite distinct from BERT’s pretraining corpus. In this work we simulate the practical topic shift scenario by designing novel challenge benchmarks WikiSQL-TS and WikiTable-TS, consisting of train-dev-test splits in five distinct topic groups, based on the popular WikiSQL and WikiTable-Questions datasets. We empirically show that, despite pre-training on large open-domain text, performance of models degrades significantly when they are evaluated on unseen topics. In response, we propose T3QA (Topic Transferable Table Question Answering) a pragmatic adaptation framework for TableQA comprising of: (1) topic-specific vocabulary injection into BERT, (2) a novel text-to-text transformer generator (such as T5, GPT2) based natural language question generation pipeline focused on generating topic-specific training data, and (3) a logical form re-ranker. We show that T3QA provides a reasonably good baseline for our topic shift benchmarks. We believe our topic split benchmarks will lead to robust TableQA solutions that are better suited for practical deployment


Introduction
Documents, particularly in enterprise settings, often contain valuable tabular information (e.g., financial, sales/marketing, HR). Natural language question answering systems over a table (or TableQA) have an additional complexity of understanding the tabular structure including row/column head- * Equal contribution by first two authors. 1 The source code and new dataset splits are available at https://github.com/IBM/T3QA Question: Who ran in the election for labour party? Answer : Mark Chiverton Figure 1: Topic-sensitive representations are important to infer that, in the context of the topic politics, the query span "ran in the election" should be linked to the "Candidate" column in the table.
ers compared to the more widely-studied passagebased reading comprehension (RC) problem. Further, TableQA may involve complex questions with multi-cell or aggregate answers.
Most of the TableQA systems use semantic parsing approaches that utilizes language encoders to produce an intermediate logical form from the natural language question which is executed that over the tabular data to get the answer. While some systems (Zhong et al., 2017) were fully supervised, needing pairs of questions and logical forms as training data, more recent systems (Pasupat and Liang, 2015;Krishnamurthy et al., 2017;Dasigi et al., 2019) rely only on the answer as weak supervision and search for a correct logical form. The current best TableQA systems (Herzig et al., 2020;Yin et al., 2020a) capitalize on advances in language modeling, such as BERT, and extend it to encode table representations as well. They are shown to produce excellent results on popular benchmarks such as WikiSQL (Zhong et al., 2017) and WikiTableQuestions(WikiTQ) (Pasupat and Liang, 2015).
With increasing prevalence of text analytics as a centrally-trained service that serves diverse customers, practical QA systems will encounter tables and questions from topics which they may not have necessarily seen during training. It is critical that the language understanding and parsing capabilities of these QA models arising from their training regime are sufficiently robust to answer questions over tables from such unseen topics.
As we show later in this paper, the existing approaches degrade significantly when exposed to questions from topics not seen during training (i.e., topic-shift). 2 To examine this phenomenon, we first instrument and dissect the performance of these recent systems under topic shift. In particular, we experiment with TaBERT (Yin et al., 2020b), which is a weakly supervised TableQA model which encodes the table and question using BERT-encoder and outputs a logical form using an LSTM decoder. In the example shown in Figure 1, topic shift may cause poor generalization for specific terminology or token usage across unseen topics.
We introduce a novel experimental protocol to highlight the difficulties of topic shift in the context of two well-known Wikipedia-based TableQA datasets: WikiSQL (Zhong et al., 2017) and Wik-iTableQuestions Pasupat and Liang (2015). Despite recent transformer-based TableQA models being pre-trained with open-domain data, including Wikipedia itself, we observe a performance drop of 5-6% when test instances arise from topics not seen during training.
To address this challenge, we next propose a novel T3QA framework for TableQA training that leads to greater cross-topic robustness. Our approach uses only unlabeled documents with ta-2 Topic shift may be regarded as a case of domain shift studied in the ML community. However, here we refrain from referring to the proposed topic-driven splits as "domains" due to the open-domain nature of these datasets and the pretraining data used to build these models. bles from the never-seen topic (which we interchangeably call the target topic), without any handcreated (question,logical form) pairs in the target topic. Specifically, we first extend the vocabulary of BERT for the new topic. Next, it uses a powerful text-to-text transfer transformer module to generate synthetic questions for the target topic. A pragmatic question generator first samples SQL queries of various types from the target topic table and transcribes them to natural language questions which is then used to finetune the TableQA model on target topic. Finally, T3QA improves the performance of the TableQA model with a post-hoc logical form re-ranker, aided by entity linking. The proposed improvements are applicable to any semantic parsing style TableQA with transformer encoders and is shown to confer generally cumulative improvements in our experiments. To the best of our knowledge, this is the first paper to tackle the TableQA problem in such a zero-shot setting with respect to target topics.
The main contributions of this work are: • This is the first work to address the phenomenon of topic shift in Table Question Answering systems.
• We create novel experimental protocol on 2 existing TableQA datasets to study the effects of topic shift. (WikiSQL-TS and WikiTQ-TS) • We propose new methods that uses unlabeled text and tables from target topic to create TableQA models which are more robust to topic shift.

Related work
Most TableQA systems take a semantic parsing view (Pasupat and Liang, 2015;Zhong et al., 2017; for question understanding and produce a logical form of the natural language question. Fully-supervised approaches, such as by (Zhong et al., 2017) need pairs of questions and logical form for training. However, obtaining logical form annotations for questions at scale is expensive. A simpler, cheaper alternative is to collect only question-answer pairs as weak supervision (Pasupat and Liang, 2015;Krishnamurthy et al., 2017;Dasigi et al., 2019). Such systems search for the correct logical forms under syntactic and semantic constraints that produce the correct answer. Weak supervision is challenging, owing  to the large search space that includes many possible spurious logical forms (Guu et al., 2017) that may produce the target answer but not an accurate logical transformation of the natural question. Recent TableQA systems (Herzig et al., 2020;Yin et al., 2020a;Glass et al., 2021) extend BERT to encode the entire table including headers, rows and columns. They aim to learn a table-embedding representation that can capture correlations between question keywords and target cell of the table. TAPAS (Herzig et al., 2020) and RCI (Glass et al., 2021) are designed to answer a question by predicting the correct cells in the table in a truly end-to-end manner. TaBERT (Yin et al., 2020a) is a powerful encoder developed specifically for the TableQA task. TaBERT jointly encodes a natural language question and the table, implicitly creating (i) entity links between question tokens and tablecontent, and (ii) relationship between table cells, derived from its structure. To generate the structured query, the encoding obtained from TaBERT is coupled with a memory augmented semantic parsing approach (MAPO) (Liang et al., 2018).
Question generation (QG) Sultan et al., 2020;Shakeri et al., 2020) has been widely explored in reading comprehension (RC) task to reduce the burden of annotating large volumes of Q-A pairs given a context paragraph. Recently, Puri et al. (2020) used GPT-2 (Radford et al., 2019) to generate synthetic data for RC, showing that synthetic data alone is sufficient to obtain stateof-art on the SQUAD1.1 dataset. For the QG task in TableQA To the best of our knowledge, our approach is the first to generate questions specifically for TableQA with the assistance of a logical query and large pre-trained multitask transformers.
Domain adaptation approaches in QA (Lee et al., 2019;Ganin et al., 2016) have so far mostly used adversarial learning with an aim to identify domain agnostic features, including in RC applications (Wang et al., 2019;Cao et al., 2020). However, for the TableQA systems using BERT-style language models with vast pre-training, topic shifts remain an unexplored problem.

T3QA framework
To our knowledge, this is the first work to explore TableQA in unseen topic setting. Consequently, no public topic-sliced TableQA dataset is available. We introduce a topic-shift benchmark by creating new splits in existing popular TableQA datasets: WikiSQL (Zhong et al., 2017) and WikiTQ (Pasupat and Liang, 2015). The benchmark creation process is described in Section 3.1. Then, we introduce the proposed framework (illustrated in Figure 2) to help TableQA system cope with topic shift. Section 3.2 describes the topic specific vocabulary extension for BERT, followed by Question Generation in target topic in Section 3.3 and reranking logical forms in Section 3.4.

TableQA topic-shift benchmark
To create a topic-shift TableQA benchmark out of existing datasets, topics have to be assigned to every instance. Once topics are assigned, we create train-test splits with topic shift. I.e., train instances and test instances come from non-overlapping sets of topics. TableQA instances are triplets of the form {table, question, answer}. For the datasets WikiSQL and WikiTQ, these tables are taken from Wikipedia articles. WikiSQL has 24,241 tables taken from 15,258 articles and WikiTQ has 2,108 tables from 2,104 articles.
The Wikipedia category graph (WCG) is a dense graph organized in a taxonomy-like structure. For the Wikipedia articles corresponding to tables in WikiSQL and WikiTQ, we found that they are connected to 16000+ categories in WCG on an average. Among the Wikipedia Category:Main topic articles, Wikipedia articles were connected to 38+ out of 42 categories in WCG.
We use category information from Wikipedia articles to identify topics for each of the article and then transfer those topics to the corresponding tables. The main steps are listed below; details can be found in Appendix B.
• We identify 42 main Wikipedia categories.
• For each table, we locate the Wikipedia article containing it. • From the page, we follow category ancestor links until we reach one or more main categories. • In case of multiple candidates, we choose one based on the traversed path length and the number of paths between the candidate and the article. We cannot take an arbitrary subset of topics for train and the rest for test split to create a topicshift protocol, because many topics are strongly related to others. For example, topic Entertainment is more strongly related to Music than to Law. To avoid this problem, we cluster these Wikipedia main topics into groups such that similar topics fall in the same group. Using a clustering procedure described in Appendix B, we arrive at 5 high-level topic groups: Politics, Culture, Sports, People and Miscellaneous. Table 1 gives the membership of each topic group and the number of instances in WikiSQL and WikiTQ dataset per topic. For ease of discussion, we will be calling the five topic groups as topics from now on. For both datasets, we create five leave-one-out topic-shift experiment protocols where in each topic becomes the test set, called the target topic and the rest four the training set is called the source topic(s).
In our protocol, for training, apart from the instances from source topic, we also provide tables and document from the target topic. Documents are the text crawled from the target topic articles from Wikipedia. Collecting unlabeled tables and text data for a target topic is inexpensive. We name these datasets WikiSQL-TS (WikiSQL with topic shift) and WikiTQ-TS.

Topic specific BERT vocabulary extension
Sub word segmentation in BERT has a potential risk of segmenting named entities or in general unseen words in the target corpus. Vocabulary extension ensures that topic specific words are encoded in entirety and avoids splitting into sub-words. Our goal is to finetune BERT with extended vocabulary on topic specific target corpus to learn topic sensitive contextual representation. So we add frequent topic-specific words to encourage the BERT encoder to learn better topic sensitive representation, which is crucial for better query understanding and query-table entity linking.

Table-question generation
In our proposed topic-shift experiment protocol with the training set from source topic, unlabeled tables and free text from target topic are provided in the training phase. We propose to use tables from the target topic to generate synthetic questionanswer pairs and use these augmented instances for training the TableQA model. Unlike question generation from text, a great deal of additional control is available when generating questions from tables. Similar to Guo et al. (2018), we first sample SQL queries from a given table, and then use a text-to-text transformers (T5) (Raffel et al., 2020) based sequence-to-sequence model to transcribe the SQL query to a natural language question.

SQL sampling
For generating synthetic SQL queries from a given table T, we have designed a focused and controllable SQL query generation mechanism presented in Algorithm 1. Our approach is similar to Zhong et al. (2017) but unlike the existing approaches, we use guidance from target query syntax to offer much more control over the type of natural language questions being generated. We also use additional context such as table header, target answer cell to help the model generate more meaningful questions suitable for T3QA . We sample the query type (simple retrieval vs. aggregations) and associated where clauses from a distribution that matches the prior probability distribution of training data, if that is available. Sampling of query type and number of where clauses is important to mitigate 20 Return generatedSQLs the risk of learning a biased model that cannot generalize for more complex queries with more than 2 where clauses, as reported by Guo et al. (2018). The generated SQL queries are checked for various aspects of semantic quality, beyond mere syntactic correctness in typical rule based generations. WikiSQL has a known imitation: even an incorrect SQL query can produce the same answer as the gold SQL query. To avoid such cases, we make two important checks: (1) The WHERE clauses in the generated SQL queries must all be mandatory to produce the correct answer. i.e., dropping a WHERE clause should not produce the expected answer and (2) a generated SQL query with an aggregation must have at least 2 rows to aggregate on and therefore, dropping the aggregation will not produce the expected answer. These quality checks ensure that the generated synthetic SQL queries are fit to be used in TableQA training pipeline.

T5 transfer learning for QG
For question generation in the TableQA setup, it is more intuitive to create SQL queries first and then use the structure of the SQL query to translate it to a natural language question. Previously, Guo et al. (2018) and Benmalek et al. (2019) used LSTM-based sequence to sequence models for direct question generation from tables. However, we hypothesize that apart from SQL queries, using answers and column headers with the help of transformer based models, can be more effective.
For our question generation module we have used unified text-to-text transformers (T5) (Raffel et al., 2020), which is popular for its constrained text generation capabilities for multiple tasks such as translation and summarization. To leverage this capability of T5 for generating natural language questions from SQL queries, we encode a SQL query in a specific text format. We also pass the answer of the SQL query and the column headers of table to T5 as we observe that using these two sets of extra information along with the SQL query helps in generating better questions, especially with "Wh" words. As illustrated in Figure 3, the generated SQL query with answer and column headers are encoded into a specific sequence before passing onto T5 model. Special separator tokens are used to demarcate different parts of the input sequence: [S] to specify the main column and operation, [W] demarcates elements in a WHERE clause, [A] marks the answer, [C] and [CS] show the beginning of set of column headers and separation between them, respectively.
In this example, one can observe that although the SQL query do not have any term on day or date, our QG module was able to add "What day". Furthermore, ill-formed and unnatural questions gener- What camera has the highest Mpix with an aspect ratio of 2:1, a height less than 1536, and a width smaller than 2048?
SELECT AVG(Score) WHERE Player = Lee Westwood What is Lee Westwood's average score?
What is the average score with lee westwood as the player?  ated by T5 model are filtered out using a pretrained GPT-2 model (Radford et al., 2019). We removed questions with the highest perplexity scores before passing the rest to the TableQA training module.
For training the QG module, we use SQL queries and questions provided with the WikiSQL dataset. In our experiments, only query+question pairs from the source topics are used to train the question generation module and synthetic questions are gener-ated for the target topic.
We are able to produce high-quality questions using this T5 transcription. Table 2 shows a few example of generated questions from ground truth SQL and Table 3 on sampled SQLs. Observe that the model is able to generate lookup questions, multiple conditions, and aggregate questions of high quality. It is interesting to see that for the first example in Table 2, T5 model included the term car in the question even though it was not available in the SQL query, probably taking the clue from chassis. Some questions created from sampled SQLs for WikiTQ tables is provided in Appendix C.

Reranking logical forms
We analysed the logical forms predicted by TaBERT model in WikiSQL-TS and observed that the top logical forms often do not have the correct column headers and cell values. In fact, in WikiSQL-TS there is a 15-20% greater chance of finding a correct prediction from the top-5 predicted logical form than the top 1.
We propose to use a classifier, Gboost (Friedman, 2002) to rerank the predicted top-5 logical form. Given a logical form and table-question pair we create a set of features on which a classifier is trained to give higher score to the correct logical form.
The logical form-question pair which gives the correct prediction is labelled as +ve and wrong predictions as -ve. We use the predicted logical forms for source topic dev set to train this classifier and in the inference step while predicting for target topic, the logical form which got highest score by the classifier is selected.

Features for logical form reranker
Two sets of features are extracted for the reranker: (1) entity linking based features, (2) logical form based features. Entity linking based features: This captures matches between query fragments and table elements. Our system of entity linking using string matching also finds partial matches. Partial matches happen when only a part of column name or cell value appear in the question. Another scenario is when token in the question partially matches with multiple entities in the table. We create three feature separately for cell values and column headers.

Experiments and Analysis
Here we describe key details of the experimental setup, the models compared and evaluation techniques. We also provide a thorough analysis of the results to highlight the key takeaways.

Setup
We consider WikiSQL-TS and WikiTQ-TS for our experiments with topic assignments as described in Section 3.1. The larger WikiSQL-TS dataset consists of tables, questions and corresponding ground truth SQL queries, whereas WikiTQ-TS contains only natural language questions and answers. The five topics are 1) Politics 2) Culture 3) Sports 4) People and 5) Miscellaneous. Table 1 captures some interesting statistics about the topic split benchmark created from WikiSQL. All experiments are conducted in a leave-one-out (LOO) fashion where the target topic examples are withheld. For example, if the target topic is Politics then the model is trained using the train set and dev set of Culture, Sports, People, Misc and evaluated on test set of Politics. Further, a composite dev set is curated by adding equal number of synthetically generated questions from the target topic to the dev set of source topics.

Models
We perform all experiments using a variant of TaBERT+MAPO 3 architecture, with the underlying BERT model initialized with bert-base-uncased. TaBERT+MAPO uses standard BERT as table-question encoder and MAPO (Liang et al., 2018) as the base semantic parser. TaBERT t +MAPO uses topic spe-  To ensure that the target topic is not leaked through the T5 model, we trained five topic-specific T5 models, one for each leave-oneout group by considering only SQL-question pairs from the source topic only. As WikiTQ-TS does not have ground truth SQL queries included in the dataset, we use T5 trained on WikiSQL-TS to generate synthetic questions. We use a batch-size of 10 with a learning rate of 10 −3 . Implementation details: We build upon the existing code base for TaBERT+MAPO released by Yin et al. (2020b) and use BERT base as the encoder for tables and questions. We use topic-specific vocabulary (explained in Section 3.2) for BERT's tokenizer and train it using MLM (masked language model) objective for 3 epochs with p=0.15 chance of masking a topic-specific high frequency (occurring more than 15 times in target topic corpus) token . We optimize BERT parameters using Adam optimizer with learning rate of 5×10 −5 . All numbers reported are from the test fold, fixing system parameters and model selection with best performance on the corresponding composite dev set. Further details and the dataset are provided in the supplementary material.

Results and Analysis
WikiSQL-TS: TaBERT t +MAPO improves over TaBERT+MAPO for four out of five test topics by an average of 1.66%, showing the advantage of vocabulary extension (Table 4). In addition to supplying the topic-specific sense of vocabulary, fine tuning also avoids introducing word-pieces that adversely affect topic-specific language understanding. For instance, for the topic culture the whole word 'rockstar' is added to the vocabulary rather than the word-pieces 'rocks', '##tar'. We implement vocabulary extension by using the 1000 placeholders in BERT's vocabulary, accommodating high frequency words from the target topic corpus . Further, TaBERT+MAPO+QG significantly outperforms TaBERT+MAPO and also TaBERT t +MAPO when finetuned with target topic samples obtained from QG (after careful filtering). In WikiSQL-TS, QG also improves the performance of TaBERT t +MAPO, though relevant vocabulary was already added to BERT, suggesting additional benefits of QG in T3QA framework. While vocabulary extension ensures topical tokens are encoded, QG improves implicit linking between question and table header tokens within the joint encoding of question-table. The largest improvement of 10.53% and 7.74% is obtained for People and Culture respectively. Moreover, TaBERT+MAPO+QG out-performs an in-topic performance of 64.07% and 67% with 66.27% and 69.88% (details in Appendix D), showing that the unseen topic performance can be substantially improved with only auxiliary text and tables from documents without explicitly annotated table, question, and answer tuples.
As mentioned, Misc is a topic chimera with a mixed individual statistics, hence an explicit injection of frequent vocabulary does not significantly improve TaBERT t +MAPO over   TaBERT+MAPO. However, TaBERT+MAPO+QG outperforms TaBERT t +MAPO by 5.4% due to QG, suggesting that the improvement from both methods are disjoint. Further, Question generation, though conditioned on the table and topic specific text is not supplied with the topic vocabulary. We also observe that the composite dev set with 50% real questions and 50% questions generated on tables from target topic improves performance. Tables 4 & 5 take the advantage of ground truth SQL queries to further dissect the performance along question types and number of WHERE clauses. Number of Where clauses: As described previously, performance of TaBERT+MAPO is substantially affected by the number of WHERE clauses in the ground truth logical form (also observed by (Guo et al., 2018)), see Appendix A. Table 5, shows that performance improvement by "Reranker" is significantly higher for more than 1 WHERE clause. This might have happened because TaBERT+MAPO prefers to decode shorter logical forms, whereas the reranker prioritizes logical forms with more linked entities present from the question. WikiSQL question types: Table 4 breaks down the performance of TaBERT+MAPO+QG based on the question types labels obtained from the dataset ground truth only for analysis. The improvement, viewed from the lens of question types is more significant with average gain in SELECTstyle queries at 9.76%. Aggregate (count, min/max, sum, avg) questions are more challenging to gener-ate as the answer is not present in the table. Consequently, the performance improvement with QG is less significant for these question types.
WikiTQ-TS: WikiTQ-TS is a smaller dataset and contains more complex questions (negatives, implicit nested query) compared to WikiSQL-TS. Correspondingly, there is also less topic specific text to pretrain the TaBERT encoder. Despite these limitations, we observe in Table 7 that TaBERT t with vocabulary extension and pretraining shows overall improvement. We resort to using synthetic questions generated from QG model of WikiSQL-TS, due to unavailability of ground truth SQL queries in WikiTQ. Hence, the generated questions are often different in structure from the ground truth questions. Samples of real and generated questions are in Table 8 of Appendix C. Despite this difference in question distribution we see TaBERT+QG consistently performs better than the baseline. We provide an analysis of the perplexity scores from TaBERT and TaBERT t on the generated questions in Appendix G. Ultimately, the proposed T3QA framework significantly improves performance in all target domains.

Conclusion
This paper introduces the problem of TableQA for unseen topics. We propose novel topic split benchmarks over WikiSQL and WikiTQ and highlight the drop in performance of TaBERT+MAPO, even when TaBERT is pretrained on a large open domain corpora. We show that significant gains in performance can be achieved by (i) extending the vocabulary of BERT with topic-specific tokens (ii) fine-tuning the model with our proposed constrained question generation which transcribes SQL into natural language, (iii) re-ranking logical forms based on features associated with entity linking and logical form structure. We believe that the proposed benchmark can be used by the community for building and evaluating robust TableQA models for practical settings. We analyse accuracy of TaBERT model on Wiki-SQL in terms of the number of WHERE clauses, which are skewed as shown in Fig. 4(a). In Fig. 4(b), we observe that accuracy decreases when ground truth SQL has a larger number of WHERE clauses. Interestingly, we observe in Fig. 4(c) and (d) that even though the model achieves 30% to 40% accuracy for 2-4 WHERE clauses, the predicted logical form still produced one WHERE clause. This shows that, for many questions, wrong or incomplete logical forms can produce correct answers.

B Topic-shift benchmark details
Continuing from Section 3.1, this section provides more details about the creation of the topic shift benchmark datasets. Each Wikipedia article is tagged with a set of categories and each category is further tagged with a set of parent categories, and those to their parent categories, and so on. The whole set of Wikipedia categories are organized in a taxonomy-like structure called Wikipedia Category Graph (WCG) (Zesch and Gurevych, 2007). These categories range from specific topics such as "Major League Soccer awards" to general topics such as "Human Nature". To have categories of similar granularity, we use the 42 categories listed in Wikipedia Category:Main topic articles 4 as topics.
To assign a unique category to a Wikipedia article, we proceed as follows: • For each Table T , we extract the Wikipedia Article A which contains Table T . • We start with the category of A and traverse the hierarchical categories till we reach one (or more) of the 42 categories listed in Wikipedia Category:Main topic articles.
• If multiple main topic categories can be reached from A, we take the category which is reached via the shortest path (in terms of number of hierarchical categories traversed from A) and assign that as the category for table T .
• If there are multiple main topic categories which can be reached with the same length of shortest path, we consider the number of different paths between the main topic category and A as the tie breaker to assign the topic for A. Now we describe the method used to cluster categories into topics. For every article we identify five categories closest to the article in Wikipedia Category Graph. We then compute the Jaccard similarity between two topics as the ratio of number of common articles between topics (in the first-5 list) to the total number of articles assigned to both topic. Using this similarity, we apply spectral co-clustering (Dhillon, 2001) to form five topic groups.
To verify the coherence of the five topic groups, we performed a vocab overlap exercise. For questions in WikiTQ, we find the 100 most frequent words in the test set of each of the topics. Then we measure how many of these frequent words appeared in the train set of each of these topics. Table 9 shows the that word overlap is large within clusters.

Gold questions in the dataset
Generated Questions -how many v8 engines competed in the 1982 British formula one season?
-which constructor has an Entrant of Colin Bennett racing, and a no smaller than 7.0? -how many entrants have names that contain the word "team"?
-what is the average number that has a constructor of Ensign? -name an entrant with no cosworth engines.
-who is the driver with a no of 14.0? -how many drivers use v8 engines?
-what is the average number of FW07 chassisassis? -what is the total number of drivers listed?
-what engine has a driver of Steveo'rourke? -who is the only driver to use a v12 engine?
-name the most number for Chassis being N180b. -Are there any other engines listed besides cosworth or brm?
-what is the lowest number that has a constructor of Ensign? -Which is the only driver whose vehicle used a brm 202 v12 engine?
-what is the total number that has a constructor of Williams? -What is the last chassis listed?
-what is the largest number for teamensign?

D Performance when topics are seen
We further analyse the performance of the model in both seen-topic training (when the topic specific train set is available), against the unseen topic train (when the topic specific train set is not used during training). In Table 10, we present results in both training setups.   Table 11 shows the absolute values corresponding to Table 6. in the paper. The performance of both models is lower for questions with larger WHERE clauses.  Table 12:

E Additional Experiments
WikiSQL-TS performance for TaBERT t +MAPO +QG +Reranker and TaBERT+MAPO (separated by '/') across number of WHERE clauses in the ground truth logical forms.

F Training details
We train all TaBERT+MAPO variants for 10 epochs on 4 Tesla V100 GPUs using mixed precision training 5 . For training TaBERT+MAPO , we set batch size to 10, number of explore samples 10 and other hyperparameters are kept same as (Yin et al., 2020a). We build upon codebase 6 released by (Yin et al., 2020a). The hyper-parameters (where not mentioned explicitly) are the same are the original code. We include all the data splits and predictions from our best model as supplementary material with the paper. These will be released publicly upon acceptance. The experimentation requires for 5 topics, we performed 6 variations of the model. We performed search over 4 sets of hyper-parameters, primarily on the composition of generated vs. real questions.

G TaBERT vs. TaBERT t perplexity of generated questions for WikiTQ-TS
We compute the perplexity scores over a subset of 50 generated questions used in the experiments using both TaBERT and TaBERT t language models. Note that TaBERT is pretrained on large open domain set whereas TaBERT t was further fine-tuned on topic specific documents closely related to the tables of target domain. As shown in Table 13, the average perplexity score from TaBERT t is larger than TaBERT. This indicates that the generated questions are not aligned to the topic in the case of WikiTQ-TS. This is due to the lack of any training examples for specific to the dataset, as mentioned in Section 4.3. Future work on topic-specific question generation may address this issue.

Topic
TaBERT TaBERT   We suspect that this might be the reason why TaBERT t +QG does not outperform TaBERT+QG