HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies

A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external models such as gpt-3.5-turbo . We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. In essence, "HiddenTables" is played between the code-generating LLM "Solver" and the "Oracle" which evaluates the ability of the LLM agents to solve Table QA tasks. This game is based on natural language schemas and importantly, ensures the security of the underlying data. We provide evidential experiments on a diverse set of tables that demonstrate an LLM’s collective inability to generalize and perform on complex queries, handle compositional dependencies, and align natural language to programmatic commands when concrete table schemas are provided. Unlike encoder-based models, we have pushed the boundaries of "HiddenTables" to not be limited by the number of rows - therefore we exhibit improved efficiency in prompt and completion tokens. Our infrastructure has spawned a new dataset "PyQ-Tax" that spans across 116,671 question-table-answer triplets and provides additional fine-grained breakdowns & labels for varying question taxonomies. Therefore, in tandem with our academic contributions regarding LLMs’ deficiency in TableQA


Introduction
Encoder-based approaches in contextually analyzing table question-answering tasks for language models typically prioritize and highlight the methods' achievement in accuracy (Herzig et al., 2020;Liu et al., 2022).However, in many cases, a prerequisite for these approaches to achieve such accuracy is the exposition of tabular content in its entirety and the indulgent ingestion of tokens (Herzig et al., 2020;Yin et al., 2020;Yu et al., 2021;Liu et al., 2022).Such liberal dispositions towards privacy and efficiency can be deemed as impractical in the tangible deployment process of language models within institutions.Moreover, such necessities to expose the underlying data begs the question of whether the model actually understands the question to provide an accurate answer.In essence, our endeavor is also an intellectual pursuit to answer the "chinese room argument" with regards to language models (Cole, 2023).Therefore, we propose an alternative approach for table questionanswering tasks -a cooperative game dubbed "Hid-denTables".HiddenTables is comprised of two agents: an "Oracle" and a "Solver", in which the latter generates code to answer user queries relying solely on the Oracle's instructions and relaying of schema.In other words, the game is played without the Solver knowing the tabular content.The Solver's code is then evaluated by the secure Oracle that relays the answer to the user or asks followup questions to the Solver.Figure 1 summarizes the environmental set-up that our method enables between a user and gpt-3.5-turbo.
Therefore this paper sets forth a general system architecture that can be employed across a myriad of taxonomies and tabular formats.We find that the accuracy of gpt-3.5-turbohas decreased with our cooperative game albeit with lesser tokens and tightened privacy.In summary, HiddenTables and its pertinent experiments have brought forth the following contributions to the academic community: • We have devised a construct that can complement an encoder-based approach in table question-answering tasks for language mod-

Code
Oracle encapsulates the schema of the data table How many people born before 1990 have been subpoenaed?

people
Figure 1: Overview of our system apparatus to encourage HiddenTables.The setup requires two agents, an Oracle and the Solver, which may or may not be on the same device.For our purposes, the Solver is a gpt-3.5-turboLLM agent that handles generation off-site, and therefore potentially offers risk of adversarial attacks.We outline the conversation between our agents, which is a message-passing channel that transfers solution code along with follow-up questions, without exposing any information from the datalake.Finally, the Oracle will provide the answer to the user.els, as a less costly and more secure alternative with significantly decreased risk in data exploitation.
• Leveraging code-generation capabilities of language models allows for a full chain of thought exposition via programmatic commands, enabling further interpretability into the answer retrieval process than what prior encoder or sequence-to-sequence models provided.• Our cooperative game is a robust demonstration that the accuracy of gpt-3.5-turbodecreases rapidly when language models are not given the entirety of the data yet improves with consecutive rounds of feedback.• Therefore, our study contributes to not only the institutional adoption process of language models but also the critical question of general intelligence capabilities of language models with regards to

Related Work
Since the advent of Transformer-based attention models, pre-trained language models have shown remarkable success in learning and encoding the semantics of tabular content (Vaswani et al., 2017).Methods employing encoder-based architectures rely on Masked Language Modeling (MLM) to learn semantics and dense representations of tabular content.Yet they are pre-trained on natural language text tokenized by byte-pair-encoding or WordPiece (Devlin et al., 2019;Sennrich et al., 2016) which can misalign with tabular structure.TaPaS (Herzig et al., 2020) (Liu et al., 2022) relies on a BART encoder-decoder backbone (Lewis et al., 2019) to encode tables and generate answers in an autoregressive fashion.However, HiddenTables relies solely on the generative power of autoregressive decoders (Brown et al., 2020) and instruction-aligned models trained with reinforcement learning from human feedback Role Instructions Schema Query " You are an AI Assistant that can answer questions from tables by writing python pandas code." + " You will receive a question and will have to write code to best answer it." + " You will only be provided the columns to the table and their type." " Follow these instructions:" + " 1) You must write python code to operate on a pandas dataframe named df" + " 2) Use reset_index() after any groupby operation involving aggregation, and sort" + …. + " 8) If you think you cannot answer the question, look for columns such as note, comments, etc. that may contain the answer and return the correct row item" " The table df has the following columns: (RLHF) (Ouyang et al., 2022), to generate solutions based on prompts (Liu et al., 2023) rather than fine-tuning.Furthermore, prior work shows that language models can more effectively solve problems when decomposing them into steps or a chain of thought (Wei et al., 2023;Nye et al., 2021).HiddenTables is inspired by using chain of thought through code, as demonstrated by (Liang et al., 2022) for robotic programs, action plan generation in robotics (Ahn et al., 2022), web browsing (Nakano et al., 2022), tool APIs (Schick et al., 2023), automated workflows (Zeng et al., 2023), or the generation of valid programs for arithmetic computation (Gao et al., 2022).ReAct (Yao et al., 2023) explores how LLMs can improve their chain of thought reasoning via intermediate actions and interactions with external sources.Furthermore, BINDER (Cheng et al., 2023) demonstrated a neural-symbolic approach to mapping questions to a program, building upon the work in (Rajkumar et al., 2022) for semantic parsing and code generation.Also, previous literature has explored how LLMs can interact with themselves through intermediate followups (Press et al., 2023), chained LLM prompts (Wu et al., 2022), or cascades (Dohan et al., 2022).(Reynolds and McDonell, 2021) has proposed how LLMs can be encouraged to generate their own prompts for solving tasks.Finally, MemPrompt (Madaan et al., 2022) demonstrated that memories of errors and user feedback can be incorporated as part of the conversation to help prevent repetitive mistakes.

Methodology
Our proposed framework is inspired from the "chinese room argument" -to what extent could lan-guage models truly comprehend natural language and align language to the correct solution when only given the table schema?In HiddenTables, two agents exist: the Oracle and the Solver.The clear delineation between these two agents' respective roles not only allows the user to test the model's holistic ability to comprehend tabular content but also enables the preservation of privacy with regards to the underlying data on-premise.In this context, our proposed apparatus allows the two agents to engage in a conversation, in which the Oracle may ask questions and the Solver will generate code that could solve the Oracle's question.Next, the Oracle will evaluate and follow-up which enables the Solver to correct any mistakes or misunderstandings.This game is played for a maximum of seven rounds to prevent infinite cycles between the agents.Throughout this process, no data entries are exposed to the Solver -the Solver must produce executable code relying solely on the schema and the set of instructions.s

The Oracle
The Oracle takes the user query and crafts an appropriate prompt for the Solver, which is structured as a role, instruction, relevant schema, and the question (RISQ)1 .It will not expose any individual data entries in the table.This allows the Oracle to protect highly confidential information in a firewalled system from any adversaries.This prompt is then sent to the Solver, which is fully outlined in Figure 2. Furthermore, we include a discussion on the prompt burden ( §3.8) juxtaposed against holistic encoder methods (Table 1 116,703 1,725,897 67,114,262 985,727 69,825,866 Table 1: Number of Tokens required to be analyzed by gpt-3.5-turboif a holistic table encoding approach was adopted, as in (Herzig et al., 2020;Liu et al., 2022).Query, Table and Answer totals are provided per dataset and in aggregate.Note that the largest table dimensions encountered were 1,956 rows, 44 columns, and 11,600 entries.
Our system seeks to minimize the token usage through schemas only -therefore bounding the number of tokens used to the number of columns, instead of to the number of entries.
The Oracle also maintains the datalake in the Secure Interpreter, that executes the code produced by the Solver ( §3.2).Moreover, the Secure Interpreter ensures that any request to expose the dataset via code injections is rejected and that it only returns the answer to the user's query.We provide more details into the Oracle's followups in Section §3.3.

The Solver
The Solver is a code-generating LLM agent that accepts the Oracle's instructions, question, and tablular schema.Then, it strives to translate and align the prompt into a sequence of executable operators that can be applied to the hidden table.In prior literature, the main choice of query language was SQL (Zhong et al., 2017); however, within our construct, the Solver does not need to be restricted to any specific programming language.HiddenTables opted to use Python as the Solver's language of choice, as it is dynamically typed, easily readable, and procedure-oriented.Therefore, it is convenient to view the chain-of-thought through iterative commands.Finally, byproducts of our generative experiments have yielded an amalgamation of verified python programs grounded to each question-tableanswer triplet that are linked to varying taxonomies -we introduce this new dataset as PyQTax ( §3.9).

The Conversation
We now outline the communication channel between the two agents.Foremost, the Oracle sends the instructions to the Solver.The instructions are an itemized list that dictates the format of the Solver's response.The instructions and rationale are outlined in Figure 2.
Next, the Solver responds with what it deems to be the best sequence of commands to answer the query.This is sent to the Oracle as free-text along with embedded code, including artifacts pertaining to explanations and chain of thought.Consequently, the Oracle sets up a secure environment, locally fire-walled with its dataset.Aforementioned, this environment ensures that any arbitrary execution of code is non-destructive and any exposure of the underlying tabular data is disabled.
As a result of this conversation, there are two states that will be defined in detail -a state of "successful retrieval" or one of "failure".A state of successful retrieval is defined as one in which an answer has been generated from executing the Solver's code in the Oracle's secure environment.This answer could be a text entry from the table, an aggregated value such as sum, or a list of table entries.In contrast, a state of failure is defined as an error message, such as Value or Index errors, NULL answers that provide no identifiable answer (empty dataframes) nor any executable code, or the Solver's comment that it cannot answer.For each type of failure, the Oracle handles the state differently.Firstly, errors can be sanitized to remove any data references and fed back to the Solver, as prior literature regarding self-correcting code has discussed (Madaan et al., 2022).Secondly, empty dataframes can be conveniently identified with the Solver being informed that the generated code produced no valid results.Thirdly, if the Solver is conservative in answering the question and provides no executable code, the Oracle reassures the Solver that the question can be answered from the table provided.Within this context, new failures can be re-prompted to the Solver for correction by the Oracle.
With this apparatus to correct initial failures while retaining the original context throughout the conversation session, we allow the Oracle and Solver to interact for a maximum of seven times before the conversation is halted and the final verdict for this query is designated as a failure.We have discovered that failures are common when the answer is within an extractive span in a single table cell (free text) or if the answer resides in a generic column such as 'comments' or 'notes' that complicates contextual inference.

Minor Roles
The User The user's query initiates our game of HiddenTables.
Datalake The Oracle has read-access to a datalake, which stores the tables and entries in a secured environment on-premise.
Firewall This is a boundary to denote in Figure 1 -the on-premise and off-premise environments the agents operate in.This setup can enable guided entry into the on-premise environment.

Benefits of Demarcating the Roles
Demarcating the boundaries between the Oracle and Solver is to ensure that the underlying dataset is protected.This can be beneficial because firstly, for many institutions that handle sensitive or confidential data such as personally identifiable information, the Oracle can prevent any off-premise entities from accessing the data but still help generate answers.Secondly, this demarcation ensures that code is executed in a regulated and structured manner, regardless of the user's location or device.Thirdly, an additional layer of control has been generated, while still allowing third-party API providers to operate on the data.

Question, Table, and Answer Token Counts
Table 1 outlines each set's total token count for gpt-3.5-turboif sent to the model.The number of tokens were determined by OpenAI's fast BPE encoder tiktoken2 .The dominating term for token counts is in the table entries themselves -96.1% of the outstanding burden is located here.However, previous encoder-based methods were limited by the model's sequence length and memory constraint in computing multi-headed attention between every cell (Liu et al., 2022).In contrast, our construct is comparably linear in its token usage.If we define the number of rows as r and the number of columns as c, then the total token count for a table is polynomial O(rc), which is quadratic in time complexity as either term increases.However, in HiddenTables, since the only dependent variable required for solving a table query is bounded by the number of columns O(c), token growth is linear.Each table could add c × r many rows -yet our task will still include the same number of columns c.

Privacy
Another by-product of this setup is privacy.Since row entries are omitted and safe guarded by the Oracle, the Solver must form a general solution from the schema only.More importantly, the Oracle can be configured with additional safety prompts and code policies to ensure that any adversarial attacks by the Solver are properly handled.However, this system may potentially need additional safeguards against side-channel attacks to obfuscate successful retrievals from failures (Kocher, 1996).

Prompt Burden
Given the replacement of table entries with our RISQ system prompt, we analyzed the distribution of usage tokens for all three of our datasets.Of the 116, 661 samples accepted by the Solver and responded to (without error), the average prompt burden was 279 tokens with a standard deviation of 19 tokens.The minimum, median, and maximum prompt usage was 243, 275, and 630, respectively.Overall, the total amount of tokens used in the Solver's system prompt was 32, 546, 634.This is only 48.5% of the burden incurred by using the entire table.As mentioned in §3.6, our construct is efficient for large tables with many rows, as the token burden remains constant for each new row of data.Our Solver generated an average 115 tokens per answer, with a standard deviation of 61 tokens.

PyQTax
HiddenTables has produced PyQTax that aligns 116,671 question- We follow the same procedure for the number of table rows and columns, relying on the interquartile range to delineate small, average, and large tables.For WikiSQL, our quartiles for rows are Q 1 = 7, Q 3 = 18.For WikiTQ, our quartiles for rows are Q 1 = 10, Q 3 = 25.For SQA, our quartiles for rows are Table 5: Ablation results for the cumulative accuracy gains per additional conversation round.Each round includes the cumulative total of correct solutions, even if the conversation ended prematurely.Incremental gains in accuracy level off after the third conversation round, as a consequence of a dwindling pool of remaining unsolved problems.Furthermore, issues from parsing persist in the later conversation rounds as the Solver struggles to find the right formats or forgets the original task.column quartiles were Q 1 = 5, Q 3 = 7 for all three datasets.Our experiments show that demarcations for columns show the largest differentials in performance favoring small tables, while our Solver is consistent across any number of rows.It is difficult to generalize the performance regarding table entries since the size is obfuscated by either the number of rows or columns.

Conversation Length & Cumulative Accuracy
For all datasets, we show the necessary number of attempts to write fully executable code.Our experiments show that while the probability of a successful retrieval decreases with more rounds, a considerable number of samples are being solved correctly in each round.As reported in Table 5, HiddenTables sees significant cumulative increases in the Solver's accuracy when paired with a Oracle agent for the first three conversation rounds.Afterwards, additional rounds yield very diminished accretive benefits.

WikiSQL
SQL Query Difficulty Following a similar analysis by (Yu et al., 2018;Liu et al., 2022), we breakdown our WikiSQL results by difficulty, yielding insights into how well the Solver can assemble the required steps based on how many SQL elements appear in the original query.For our analysis, we used SQLGlot 3 to create an abstract syntax tree that shows the query's complexity.The number of nodes in an abstract syntax tree (AST) corresponds  2.
Operator Difficulty We also evaluate in Table 2 the accuracy of our approach by SQL aggregator, which includes SELECT, MAX, MIN, COUNT, SUM, and AVG operations.WikiSQL is relatively simple as reflected by 71.8% of SELECT questions, with COUNT as the next prominent operator at 9.1%.The top operators are SELECT and SUM.In contrast, Hid-denTables exposes gpt-3.5-turbo'sdeficiency in fetching extrema within a column with MIN/MAX or simple counting.AVG underperforms, as a significant number of tables include a grand total entry.

WikiTableQuestions
Operator Difficulty We tag each question in WikiTQ as a Select, Filter, Aggregate, Superlative, Arithmetic, Comparative, Group or Other operator, as inspired by (Liu et al., 2022), to further understand the limitations regarding gpt-3.5-turbo.Table 3 enumerates the operator types and the performance breakdown by split.
In order to quickly tag each question, we used a 7-shot approach using one example per type of question, then leveraged gpt-3.5-turbo to generate the best category for the question.This provides insight into how the model handles each question during inference time, as the same assumptions in categorizing the question influence the generated code.

SQA
Dependency Difficulty As a conversational dataset, SQA allows the profiling of gpt-3.5-turbo'sperformance on follow-up questions.In Table 4, we denote the accuracy across several facets.We profile the overall accuracy for each sample and denote the accuracy for the sequence.For intermediate questions Q i , we showcase the accuracy of the i-th question in the conversation.As expected, highly compositional questions tend to struggle more than initial sequence questions.
Operator Difficulty Since SQA builds off of compositional questions from WikiTQ, there is significant overlap between the two.Therefore, we reuse our generated 7-shot question taxonomies for all SQA samples found in the WikiTQ set.If not found, the category defaults to N/A (2,874 samples).

Privacy & Efficiency vs. Accuracy: Tradeoff
HiddenTables has demonstrated that in order to have full privacy and efficiency in the context of table question-answering, the lack of illustrative examples or the holistic table degrades accuracy.
Privacy is a crucial concern when working with sensitive data, especially in industries that are highly regulated.By generating code derived only from the question and schema of a table, rather than the whole table, data exposure can be limited.Therefore, the Oracle, via the Secure Interpreter, only accesses the relevant portions of the data on-premise, mitigating the risk of any data leaks.HiddenTables compensates the substantial increase in difficulty from blindly solving TableQA by implementing the pair-programming iterative approach between the Solver and Oracle LLMs, as outlined in The Conversation ( §3.3).This iterative approach to problem solving yields a +6.7% increase for WikiSQL, a +8.2% increase for WikiTQ, and a +11.4% increase in SQA.
Efficiency is another consideration regarding large knowledge bases or computationally intensive tasks.First, generating code allows systems to focus computational resources on subsets of the data internally, rather than processing the entire set as a multi-span extraction or aggregation prob-lem.This results in fewer tokens required during the inference step of an LLM, resulting in lower latency and faster response times.Our approach used 48.5% of the total tokens, if table contexts are considered.This proportion will decrease as table sizes increase in either rows or columns.
HiddenTables comes with a drawback in terms of accuracy.When relying solely on the schema, the problem shifts from a multi-span extraction task to a semantic parsing and code generation task.This added complexity requires LLMs to interpret and comprehend the question alongside the table structure.As a result, we see that HiddenTables's final accuracy is below TaPEx (Liu et al., 2022).By forcing LLMs to align the interpretation of queries to structure, errors in understanding the format of data dominates most failure cases.While additional conversation rounds mitigate this risk, other errors such as relying on extraction within a full text column still prove difficult.

Conclusion
In this work, we introduced a novel approach to evaluating the generalizability of LLMs across 3 table question-answering datasets.By creating a cooperative game that withholds the underlying data from the the model, HiddenTables challenges the Solver to make educated guesses via programmatic commands and operators to be in a state of successful retrieval.We have shown that this construct enables a computationally efficient large-scale testing of LLMs on massive datasets in tandem with ensuring the security of the tabular data.Also, our study provides insights that this task is considerably more difficult than traditional holistic models -yet lends itself to potentially large-scale industrial applications.We have also quantified this efficiency by showcasing the number of generated tokens in contrast with those of conventional models.We also contribute PyQTax, a dataset aligning generated python code to table questions and various taxonomies for 116,703 samples.Overall, our work provides a promising direction for future research in the field of table question-answering and has devised a novel construct in the deployment process of language models.

Limitations
While our work presents a novel approach to evaluating the generalizability of LLMs on tablequestion answering datasets, it is imperative to dis-cuss several limitations to our system.Foremost, our approach requires a Solver to generate code and answer the user query, which may be infeasible.Additionally, our system's reliance on programmatic commands and operators may result in a lack of flexibility when it comes to answering certain types of queries.
Next, while HiddenTables protects the information in the tables by withholding the underlying data from the LLM, it may not be able to address the issue of data privacy in cases that the table schema may contain sensitive information.Moreover, our system's reliance on an Oracle to evaluate the Solver's code may not be scalable in cases when there is a high volume in user queries.
Lastly, while our results demonstrate the effectiveness on English language datasets, its scalability to other languages with more complex morphologies and diacritics is an area that requires further investigation.Additionally, questions are tailored to each dataset, where WikiSQL questions reiterate column names to align language to table retrieval.The discrepancy between experimental questions and real-life user queries can be substantial and warrants further investigation.In summary, while our system presents a promising direction for future research in table question-answering, these limitations must be acknowledged to enable its wider adoption.Victor Zhong, Caiming Xiong, and Richard Socher.2017.Seq2sql: Generating structured queries from natural language using reinforcement learning.

A Few-Shot Categorization of Questions
To provide better clarity into the generalizability of gpt-3.5-turbo,we breakdown WikiTQ into seven categories of questions.By using the same LLM and the Solver, we gain insight into how gpt-3.5-turborecognizes and understands what kind of operations should be performed for a given question, based on semantics.To label each question, we select a representative example for each question category, and provide this as a 7-shot prompt to the model.We include the candidate question and a directive to label it, then parse and reconcile the generated category with the prescribed eight (Other is a fallback category).For SQA, there is an overlap between WikiTQ, and therefore we reuse the same labels when applicable.See Table 17 for an example of each question type, plus the semantic span that correlates with the category.

B Implementation: Secure Interpreter
To execute code generated by the Solver, we provided the Oracle a secure interpreter that can directly interact with the data on-premise.This means that our setup, in order to preserve privacy, is executed locally.The Solver's generated output is checked for any malicious code, in case of a potential attack through code injection or external requests.First, the interpreter is fire-walled to have no external connections, as the data is already on-premise.Second, the interpreter does not allow for any additional packages to be imported.The generated code is inspected for import *, import * as *, from * import * and replaced with an empty string.The namespace of the interpreter is pre-installed with verified packages.Finally, to avoid malicious code intended to erase or corrupt data, all operations are performed on a copy of the table.If a copy is not feasible, the database only allows for read operations.Any write or in-place operations on the source data are strictly denied.Intermediate artifacts are allowed to be manipulated during execution.

C Instructions for RISQ
We outline the instructions and the (rationale) in parenthesises.
1   We outline common errors in Table 6.The largest failure case was gpt-3.5-turbonot providing any executable code.This usually occurs when a question does not aligning with any column names.Furthermore, IndexError exclusively occurs at attempting to directly access a table value that is strictly out of bounds for the index, which is expected if the Solver does not know how many records are contained in the table after a Filter, Comparative, or Superlative operator.The next most common issue was an AttributeError, often triggered by gpt-3.5-turbobeing unable to infer the correct type of variable the code operates on.For instance, the most common objec-tion of the interpreter was "Can only use .straccessor with string values!" indicating a failure to correctly apply string methods onto a pandas dataframe.ValueError arose when boolean indexing that had NA / NaN values -of which a fix is to include .str.contains(*, na=False).
Finally, KeyError is fairly straightforward -the Solver produced code that accesses a column not available in the transformed tables, either through hallucination or as a byproduct of aggregation.

Figure 2 :
Figure2: Outline of our Role, Instructions, Schema, and Question (RISQ) prompt template that the Oracle generates for the Solver.Each instruction was curated to align the Solver's code to work with our tables.For instance, all string comparisons are case insensitive and Unicode normalized.For each prompt component we outline the token complexity, which is bounded by the number of columns O(c) in the schema.

Table Solver :
Off Premise

table question
passes 116,671 question-table-answer-python quadruplets of varying degrees and taxonomies for promising future academic experiments. ).

table -
(Herzig et al., 2020;Liu et al., 2022)l purpose of WikiSQL was to translate natural language into SQL, and we have repurposed this task to write Python code.WikiSQL is comprised of simple questions -selecting and filtering table entries (71.8%) that align well with the table schema.Aggregation operations only comprise 28.2% of the questions.It consists of 80, 654 total examples over 24, 241 tables.However, 2% of the set's answers are incorrect according to(Herzig et al., 2020;Liu et al., 2022).5.1 Table SizeWe breakdown our analysis based upon the interquartile range on table tokens -small tables represent the lower quartile (≈ 25%), average tables

Table 2 :
We provide breakdowns of each WikiSQL, split by complexity of the required operations to produce the answer and by each aggregator.The best performing taxonomies are Easy and Medium difficulty questions, SELECT, and tables with a small amount of columns.Medium style questions comprise 69% of the overall set, with hard at 24%.SELECT is the dominant operator at 71.8% of questions.TaPEx achieved a denotation accuracy of 89.5% on WikiSQL-Weak.themiddle 50%, and large tables the upper quartile (≈ 75%).This enables outlier categorization into the pertinent buckets that guide the amount of content any model processes to produce an answer.For WikiSQL, the first and third quartiles are 247 and 607 tokens.For WikiTQ, the quartiles are 288 and 805.For SQA, the quartiles are 248 and 492.

Table 3 :
We provide breakdowns of each WikiTQ, split by the type of operation, table size by entries, rows, and columns, and the number of conversation rounds required by the Solver.WikiTQ provides insight into how language models can handle complex QA challenges.We employ few-shot categorization to label each question ( §5.4).The best performing taxonomies are Aggregate and Comparative for operators, small tables with limited entries, and tables solved in Round 1.Note that the Solver is consistent in performance regarding row size.TaPEx acheived a denotation accuracy of 57.0% and 57.5% on the Dev and Test set respectively.
The first and third 1 , Filter, Superlative, Other operators, and conversation rounds 1 & 2. The Arithmetic and Select operators are the most deficient as compositional errors propagate downstream.TaPEx achieved SQA test accuracy of 74.5% model prompts through visual programming.In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA '22, New York, NY, USA.Association for Computing Machinery.
. You must write python code and operate on a pandas dataframe named df.(Aligns Solver to the start variable to operate on)

Table 10 :
Cross-taxonomy accuracy for all WikiSQL sets by difficulty and operator against the number of table rows.There are no discernible trends, highlighting that HiddenTables is not dependent on the number of rows for performance.Therefore, the trends in Table7are exclusively driven by the number of columns.

Table 11 :
Cross-taxonomy accuracy for all WikiTQ sets by operator against the number of table rows.Generally, performance for most partitions is greatest on average sized tables.

Table 12 :
Cross-taxonomy accuracy for all SQA sets by question sequence and operator against the number of rows.Filter increases in performance, perhaps being agnostic to the number of items, while Aggregate shows increased sensitivity to the inclusion of outliers.H Examining the Effect of the Number ofColumns on Performance

Table 13 :
Cross-taxonomy accuracy for all WikiSQL sets by difficulty and operator against the number of table columns.As difficulty increases, the number of table columns has more influence on performance, yet for simple questions shows no differentiation.No discernible trend can be inferred for SQL operator.

Table 14 :
Cross-taxonomy accuracy for all WikiTQ sets by operator against the number of columns.Performance increases with more columns, suggesting that question complexity plays a greater role than operator.

Table 15 :
Cross-taxonomy accuracy for all SQA sets by question sequence and operator against the number of columns.There is no discernible influence of columns on the performance of HiddenTables.