HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation

Tables are often created with hierarchies, but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables. Hierarchical tables challenge existing methods by hierarchical indexing, as well as implicit relationships of calculation and semantics. This work presents HiTab, a free and open dataset to study question answering (QA) and natural language generation (NLG) over hierarchical tables. HiTab is a cross-domain dataset constructed from a wealth of statistical reports (analyses) and Wikipedia pages, and has unique characteristics: (1) nearly all tables are hierarchical, and (2) both target sentences for NLG and questions for QA are revised from original, meaningful, and diverse descriptive sentences authored by analysts and professions of reports. (3) to reveal complex numerical reasoning in statistical analyses, we provide fine-grained annotations of entity and quantity alignment. HiTab provides 10,686 QA pairs and descriptive sentences with well-annotated quantity and entity alignment on 3,597 tables with broad coverage of table hierarchies and numerical reasoning types. Targeting hierarchical structure, we devise a novel hierarchy-aware logical form for symbolic reasoning over tables, which shows high effectiveness. Targeting complex numerical reasoning, we propose partially supervised training given annotations of entity and quantity alignment, which helps models to largely reduce spurious predictions in the QA task. In the NLG task, we find that entity and quantity alignment also helps NLG models to generate better results in a conditional generation setting. Experiment results of state-of-the-art baselines suggest that this dataset presents a strong challenge and a valuable benchmark for future research.


Introduction
In recent years, there are a flurry of works on reasoning over semi-structured tables, e.g., answering questions over tables (Yu et al., 2018;Pasupat and Liang, 2015) and generating fluent and faithful text from tables (Lebret et al., 2016;Parikh et al., 2020). * * Equal contributions. Work done during Zhoujun and Zhiruo's internship at Microsoft Research Asia.
† † Corresponding author. 1 https://www.nsf.gov/statistics/2019/nsf19319/ • Teaching assistantships were most commonly reported as the primary mechanism of support for master's students (11%). But they mainly focus on simple flat tables and neglect complex tables, e.g., hierarchical tables. A table is regarded as hierarchical if its header exhibits a multi-level structure (Lim and Ng, 1999;Chen and Cafarella, 2014;. Hierarchical tables are widely used, especially in data products, statistical reports, and research papers in government, finance, and science-related domains. Hierarchical tables challenge QA and NLG due to: (1) Hierarchical indexing. Hierarchical headers, such as D2:G3 and A4:A25 in Figure 1, are informative and intuitive for readers, but make cell selection much more compositional than flat tables, requiring multi-level and bi-dimensional indexing. For example, to select the cell E5 ("66.6"), one needs to specify two top header cells, "Master's" and "Percent", and two left header cells, "All fulltime" and "Self-support".
(2) Implicit calculation relationships among quantities. In hierarchical tables, it is common to insert aggregated rows and columns without explicit indications, e.g., total (columns B,D,F and rows 4,6,7,20) and proportion (columns C,E,G), which challenge precise numeri-cal inference.
(3) Implicit semantic relationships among entities. There are various cross-row, crosscolumn, and cross-level entity relationships, but lack explicit indications, e.g., "source" and "mechanism" in A2 describe A6:A19 and A20:A25 respectively, and D2 ("Master's") and F2 ("Doctoral") can be jointly described by a virtual entity, "Degree". How to identify semantic relationships and link entities correctly is also a challenge.
In this paper, we aim to build a dataset for hierarchical table QA and NLG. But without sufficient data analysts, it's hard to ensure questions and descriptions are meaningful and diverse (Gururangan et al., 2018;Poliak et al., 2018). Fortunately, large amounts of statistical reports are public from a variety of organizations (StatCan; NSF; Census; CDC; BLS; IMF), containing rich hierarchical tables and textual descriptions. Take Statistics Canada (Stat-Can) for example, it consists of 6, 039 reports in 27 domains authored by over 1,000 professionals. Importantly, since both tables and sentences are authored by domain experts, sentences are natural and reflective of real understandings of tables.
To this end, we propose a new dataset, HiTab, for QA and NLG on hierarchical tables. (1) All sentence descriptions of hierarchical tables are carefully extracted and revised by human annotators.
(2) It shows that annotations of fine-grained and lexical-level entity linking significantly help table QA (Lei et al., 2020;, motivating us to align entities in text with table cells. In addition to entity, we believe aligning quantities (Ibrahim et al., 2019), especially composite quantities (computed by multiple cells), is also important for table reasoning, so we annotate underlying numerical relationships between quantities in text and table cells, as Table 1 shows. (3) Since real sentences in statistical reports are natural, diverse, and reflective of real understandings of tables, we devise a process to construct QA pairs based on existing sentence descriptions instead of asking annotators to propose questions from scratch.
HiTab presents a strong challenge to state-of-theart baselines. For the QA task, MAPO (Liang et al., 2018) only achieves 29.2% accuracy due to the ineffectiveness of the logical form customized for flat tables. To leverage the hierarchy for table reasoning, we devise a hierarchy-aware logical form for table QA, which shows high effectiveness. We propose partially supervised training given annotations of linked mentions and formulas, which helps models to largely reduce spurious predictions and achieve 45.1% accuracy. For the NLG task, models also have difficulties in understanding deep hierarchies and generate complex analytical texts. We explore controlled generation (Parikh et al., 2020), showing that conditioning on both aligned cells and calculation types helps models to generate meaningful texts.

Dataset Construction and Analysis
We design an annotation process with six steps. To well-handle the annotation complexity, we recruit 18 students or graduates (13 females and 5 males) in computer science, finance, and English majors from top universities, and provide them with comprehensive online training, documents, and QAs. The annotation totally costs 2,400 working hours. We will discuss the ethical considerations in Section 8.

Hierarchical Table Collection
We select two representative organizations, Statistics Canada (StatCan) and National Science Foundation (NSF), that are rich of statistical reports. Different from Census; CDC; BLS; IMF that only provide PDF reports where table hierarchies are hard to extract precisely (Schreiber et al., 2017), StaCan and NSF also provide reports in HTML, from which cell information such as text and formats can be extracted precisely using HTML tags.
First, we crawl English HTML statistical reports published in recent five years from StatCan (1, 083 reports in 27 well-categorized domains) and NSF (208 reports from 11 organizations in science foundation domain). We merge StatCan and NSF and get the combination of various domains. In addition, ToTTo contains a small proportion (5.03%) of hierarchical tables, so we include them to cover more domains from Wikipedia. To keep the balance between statistical reports and Wikipedia pages, we include random 1, 851 tables (50% of our dataset) from ToTTo. Next, we transform HTML tables to spreadsheet tables using a preprocessing script. Since spreadsheet formula is easy to write, execute, and check, the spreadsheet is naturally a great annotation tool to align quantities and answer questions. To enable correct formula execution, we normalize quantities in data cells by excluding surrounding superscripts, internal commas, etc. Extremely small or large tables are filtered out (Appendix A.1 gives more details).

Sentence Extraction and Revision
In this step, annotators manually go through statistical reports and extract sentence descriptions for each table. Sentences consisting of multiple semantic-independent sub-sentences will be carefully split into multiple ones. Annotators are instructed to eliminate redundancy and ambiguity in sentences through revisions including decontextualization and phrase deletion (Parikh et al., 2020). Fortunately, most sentences in statistical reports are clean and fully supported by table data, so few revisions are needed to get high-quality text.

Entity and Quantity Alignment
In this phase, annotators are instructed to align mentions in text with corresponding cells in tables. It has two parts, entity alignment and quantity alignment, as shown in Table 1. For entity alignment, we record the mappings from entity mentions in text to corresponding cells. Single-cell quantity mentions can be linked similar with entity mentions, but composite quantity mentions are calculated from two or more cells through operators like max/sum/div/diff ( Table 2). The spreadsheet formula is powerful and easy-to-use for tabular data calculation, so we use the formula to record the calculations process of composite quantities in text, e.g., '10 points higher' (=G23-G24). Although quantities are often rounded in descriptions, we neglect rounding and refer to precise quantities in table cells.

Converting Sentences to QA Pairs
Existing QA datasets instruct annotators to propose questions from scratch, but it's hard to guarantee the meaningfulness and diversity of proposed questions. In HiTab, we simply revise declarative sentences into QA pairs. For each sentence, annotators need to identify a target key part to question about (according to the underlying logic), then convert it to the QA form. All questions are answered by formulas that reflect the numerical inference process. For example, the 'XLOOKUP' operator is frequently used to retrieve the header cells of superlatives, as shown in Table 1. To keep sentences as natural as they are, we do not encourage unnecessary sentence modification during the conversion.
If an annotator finds multiple ways to question regarding a sentence, he/she only needs to choose one way that best reflects the overall meaning.

Regular Inspections and the Final Review
We ask the two most experienced annotators to perform regular inspections and the final review.
(1) In the labeling process, they regularly sample annotations (about 10%) from all annotators to give timely feedback on labeling issues. (2) Finally, they review all annotations and fix labeling errors. Also, to assist the final review, we write a script to automatically identify spelling issues and formula issues. To double-check the labeling quality before the final review, we study the agreement of annotators by collecting and comparing annotations on randomly sampled 50 tables from two annotators. It shows 0.89 and 0.82 for quantity and entity alignment in Fleiss Kappa respectively, which are regarded as "almost perfect agreement" (Landis and Koch, 1977), and 64.5 in BLEU-4 after sentence revision, which also indicates high agreement. We further show annotation artifacts are substantially avoided    in our dataset in Appendix A.2.

Hierarchy Extraction
We follow existing work (Lim and Ng, 1999;Chen and Cafarella, 2014; and use the tree structure to model hierarchical headers. Since cell formats such as merging, indentation, and font bold, are commonly used to present hierarchies, we adapt heuristics in  to extract top and left hierarchical trees, which has high accuracy. We go through 100 randomly sampled tables in HiTab, 94% of them are precisely extracted. Figure  8 in Appendix shows an illustration. Table 3 shows a comprehensive comparison of related datasets. HiTab is not among the largest ones, but (1) it is the first dataset to study QA and NLG over hierarchical tables (accounting for 98.1% tables in HiTab) in-depth;

Dataset Statistics and Comparison
(2) it is annotated with fine-grained entity and quantity alignment; (3) compared with TAT-QA, FinQA, and NumericNLG that are single-domain, HiTab has a wide coverage of different domains from statistical reports and Wikipedia, even wider than ToTTo or WTQ that only involves Wikipedia tables; (4) the number of real descriptions per table (5.0) in statistical reports (HiTab) is much richer than 1.4 in Wikipedia (ToTTo) and 3.8 in scientific papers, contributing more analytical aspects per table.

Hierarchical Table QA
Table QA is essential for table understanding, document retrieval, ad-hoc search, etc. Hierarchical tables are quite common in these scenarios like in webpages and reports, while current Table QA tasks and methods focus on simple flat tables. Table QA is defined as follows: given a hierarchical table t and a question x in natural language, output answer y.

Problem Statement Hierarchical
The question-answer pair should be fully supported by the table. Table QA is usually formulated as a semantic parsing problem (Pasupat and Liang, 2015;Liang et al., 2017), where a parser converts the question into logical form, and an executor executes it to produce the answer. However, existing logical forms for Table QA (Pasupat and Liang, 2015;Liang et al., 2017;Yin et al., 2020) are customized for flat or database tables. The three challenges mentioned in Section 1 (hierarchical indexing, implicit indexing relationships, and implicit semantic relationships) make QA more difficult on hierarchical tables.

Hierarchy-aware Logical Forms
To this end, we propose a hierarchy-aware logical form that exploits table hierarchies to mitigate these challenges. Specifically, we define region as the operating object, and propose two functions for hierarchical region selection.
Definitions Given tree hierarchies of tables extracted in Section 2.6, we define header as a header cell (e.g., A7("Federal") in Figure 1), and level as a level in the left/top tree (e.g., A5,A6,A20 are on the same level). Existing logical forms on tables treat rows as operating objects and columns as attributes, and thus can not perform arithmetic operations on cells in the same row. However, a row in hierarchical tables is not necessarily a subject or record, thus operations can be applied on cells in the same row. Motivated by this, we define region as our operating object, which is a data region in table indexed by both left and top headers (e.g., B6:C19 is a rectangular region indexed by A6,B2). The logical form execution process is divided into two phases: region selection and region operation.

Region Selection
We design two functions (f ilter tree h) and (f ilter level l) to do region selection, where h is a header, l is a level. Functions can be applied sequentially: the subsequent function applies on the return region of the previous function. (f ilter tree h) selects a sub-tree region according to a header cell h: if h is a leaf header (e.g., A8), the selected region should be the row/column indexed by h (row 8); if h is a non-leaf header (e.g., A7), the selected region should be the rows/columns indexed by both h and its children headers (row 7-16). (f ilter level l) selects a subtree from the input tree according to a level l and return the sub-region indexed by headers on level l. These two functions mitigate aforementioned three challenges: (1) hierarchical indexing is achieved by applying these two functions sequentially; (2) with f ilter level, data with different calculation types (e.g., rows 4-5) will not be co-selected, thus not incorrectly operated together; (3) level-wise semantics can be captured by aggregating header cell semantics (e.g., embeddings) on this level. Some logical form execution examples are shown in Appendix C.2.
Region Operation Operators are applied on the selected region to produce the answer. We define 19 operators, mostly following MAPO (Liang et al., 2018), and further include some operators (e.g., difference rate) for hierarchical tables. Complete logical form functions are shown in Appendix C.1.

Baselines
We present baselines in two branches. One is logical form-based semantic parsing, and the other is end-to-end table parsing without logical forms. Neural Symbolic Machine (Liang et al., 2017) is a powerful semantic parsing framework consisting of a programmer to generate programs from NL and save intermediate results, and a computer to execute programs. We replace the LSTM encoder with BERT (Devlin et al., 2018), and implement a lisp interpreter for our logical forms as executor. Table is linearized by placing headers in level order, which is shown in detail in Appendix C.4. TaPas (Herzig et al., 2020) is a state-of-the-art endto-end table parsing model without generating logical forms. Its power to select cells and reason over tables is gained from its pretraining on millions of tables. To fit TaPas input, we convert hierarchical tables into flat ones following WTQ (Pasupat and Liang, 2015). Specifically, we unmerge the cells spanning many rows/columns on left/top headers and duplicate the contents into unmerged cells. The first top header row is specified as column names.

Weak Supervision
In weak supervision, the model is trained with QA pairs, without golden logical forms. For NSM, we compare three widely-studied learning paradigms: MML ( Since these methods require consistent programs for learning or warm start, we randomly search 15, 000 programs per sample before training. The pruning rules are shown in Appendix C.3. Finally, 6.12 consistent programs are found per sample.
For TaPas, we use the pre-trained version and follow its weak supervised training process on WTQ.

Partial Supervision
Given labeled entity links, quantity links, and calculations (from the formula), we further explore to guide training in a partially supervised way. These three annotations indicate selected headers, region, and operators in QA 3 . For NSM, we exploit them to prune spurious programs, i.e., incorrect programs that accidentally produce correct answers, in two ways. (1) When searching consistent programs, besides producing correct answers, programs are required to satisfy at least two constraints. In this way, the average consistent programs reduces from 6.12 to 2.13 per sample. (2) When training, satisfying each condition will add 0.2 to the original  binary 0/1 reward. Sampled programs with reward r ≥ 1.4 are added to the program buffer. For TaPas, we additionally provide answer coordinates and calculation types in training following its WikiSQL setting.

Evaluation Metrics
We use Execution Accuracy (EA) as our metric following (Pasupat and Liang, 2015), measuring the percentage of samples with correct answers. We also report Spurious Program Rate to study the percentage that incorrect logical forms produce correct answer. Since we do not have golden logical forms, we manually annotate logical forms for 150 random samples in dev set for evaluation.

Implementations
We split 3, 597 tables into train (70%), dev (15%) and test (15%) with no overlap. We download pre-trained models from huggingface 4 . For NSM, we utilize 'bert-base-uncased', and fine-tune 20K steps on HiTab. Beam size is 5 for both training and inference. To test MAPO original logical form, we convert flatten tables as we do for TaPas. For TaPas, we adopt the PyTorch (Paszke et al., 2019) version in huggingface. We utilize 'tapas-base', and fine-tune 40 epochs on HiTab. All experiments are conducted on a server with four V100 GPUs.

Results
Table 4 summarizes our evaluation results.

Weak Supervision
First, MAPO with our hierarchy-aware logical form outperforms that using its original logical form by a large margin 11.5%, indicating the necessity of designing a logical form leveraging hierarchies. Second, MAPO achieves the best EA (40.7%) with the lowest spurious rate (19%). But >50% questions are answered incorrectly, proving QA on HiTab is challenging.
Third, though TaPas benefits from pretraining on tables, it performs worse than the best logical formbased method without table pretraining. Partial Supervision From Table 4, we can conclude the effectiveness of partial supervision in two aspects. First, it improves EA. The model learns how to deal with more cases given high-quality programs. Second, it largely lowers %Spurious. The model learns to generate correct programs instead of some tricks. MML, whose performance highly depends on the quality of searched programs, benefits the most (36.7% to 45.1%), indicating partial supervision improves the quality of consistent programs by pruning spurious ones. However, TaPas does not gain much improvements from partial supervision, which we will discuss in the next paragraph. Error Analysis For TaPas, 98.7% of success cases are cell selections, which means TaPas benefits little from partial supervision. This may be caused by: (1) TaPas does not support some common operators on hierarchical table like difference; (2) the coarse-to-fine cell selection strategy first selects columns then cells, but cells in different columns may also aggregate in hierarchical tables.
For MAPO under partial supervision, we analyze 100 error cases. Error cases fall into four categories: (1) entity missing (23%): the header to filter is not mentioned in question, where a common case is omitted Total; model failure, including (2) failing to select correct regions (38%) and (3) failing to generate correct operations (20%); (4) out of coverage (19%): question types unsolvable with the logical form, which is explained in Appendix C.1.
Spurious programs occur mostly in two patterns. In cell selection, there may exist multiple data cells with correct answers (e.g., G9,G16 in Figure 1), while only one is golden. In superlatives, the model can produce the target answer by operating on different regions (e.g., in both region B21:B25 and B23:B25, B23 is the largest). Level-wise Analysis In Figure 3, we present level-wise accuracy of HiTab QA with MAPO and our hierarchy-aware logical form. Level here stands for sum of left and top header levels. As shown, the QA accuracy degrades when table level increases as table structure becomes more complex, except for level = 2, i.e., tables with no hierarchies. The reason level = 2 performs relatively worse might be that only 1.9% tables without hierarchies are seen in HiTab. We also present an annotated table example from our dataset to illustrate in detail the challenges mentioned in Section 1 that hierarchical tables bring in Appendix C.5. On the other hand, some recent works propose controlled generation to enable more specific and logical generation: (1) LogicNLG generates a sentence conditioned on a logical form guiding symbolic operations over given cells, but writing correct logical forms as conditions is challenging for common users who are more experienced to write natural language directly, thus restricting the application to real scenario; (2) ToTTo generates a sentence given a table with a set of highlighted cells. In ToTTo's formulation, the condition of cell selection is much easier to specify than the logical form, but it neglects symbolic operations which are critical for generating some analytical sentences involving numerical reasoning in HiTab. We place HiTab as a middle-ground of ToTTo and LogicNLG to make the task more controllable than ToTTo and closer to real application than Log-icNLG. In our setting, given a table, the model generates a sentence conditioned on a group of selected cells (similar to ToTTo) and operators (much easier to be specified than logical forms). Although we use two strong conditions to guide symbolic operations over cells, there still leaves a considerable amount of content planning to be done by the model, such as retrieving contextual cells in a hierarchical table given selected cells, identifying how operators are applied on given cells, and composing sentences in a faithful and logical manner.

Hierarchical
We now define our task as: given a hierarchical table T , highlighted cells C, and specified operators O, the goal is to generate a faithful description

With Highlighted Cells
An entity or quantity in text can be supported by table cells if it is directly stated in cell contents, or can be logically inferred by them. Different from only taking data cells as highlighted cells (Parikh et al., 2020), we also take header cells as highlighted cells, and it is usually the case for superlative ARGtype operations on a specific header level in hierarchical tables, e.g., "Teaching assistantships" is retrieved by ARGMAX in Figure 1. In our dataset, highlighted cells are extracted from annotations of the entity and quantity alignment.

With Operators
Highlighted cells can tell the target for text generation, but is not sufficient, especially for analytical descriptions involving cell operations in HiTab. So we propose to use operators as extra control. It contributes to text clarity and meaningfulness in two ways. (1) It clarifies the numerical reasoning intent on cells. For example, given the same set of data cells, applying SUM, AVERAGE, or COUNT conveys different meanings thus should yield different texts.
(2) Operation results on highlighted cells can be used as additional input sources. Existing seq2seq models are not powerful enough to do arithmetic operations (Thawani et al., 2021), e.g., adding up a group of numbers, and it greatly limits their ability to generate correct numbers in sentences. Explicitly pre-computing the calculation results is a promising alternative way to mitigate this gap in seq2seq models. Operators are extracted from annotations of formulas shown in Table 2.

Sub Table Selection and Serialization
Sub Table Selection Under controls of selected cells and operators, we devise a heuristic to retrieve all contextual cells as a sub table.
(1) We start with highlighted cells extracted from our entity and quantity alignment, then use the extracted table hierarchy to group the selected cells into the top header, the left header, and the data region. (2) Based on the extracted table hierarchy, we use the source set of top and left header cells to include their indexed data cells, and we also use the source set of data cells to include corresponding header cells.
(3) We also include their parent header cells in table hierarchy to construct a full set of headers. In the end, we take the union of of them as the result of sub table selection.
Serialization On each sub table, we do a rowturn traversal on linked cells and concatenate their cell strings using [SEP] tokens. Operator tokens and calculation results are also concatenated with the input sequence. We also experimented with other serialization methods, such as header-data pairing or template-based method, yet none reported superiority over the simple concatenation. Appendix B.1 gives an illustration.

Experiments
We conduct experiments by fine-tuning four stateof-the-art text generation methods on HiTab. Pointer Generator (See et al., 2017) A LSTMbased seq2seq model with copy mechanism. While originally designed for text summarization, it is also used in data-to-text (Gehrmann et al., 2018). BERT-to-BERT (Rothe et al., 2020) A transformer encoder-decoder model (Vaswani et al., 2017) initialized with BERT (Devlin et al., 2018). BART (Lewis et al., 2019) A pre-trained denoising autoencoder with standard Transformer-based architecture and shows effectiveness in NLG. T5 (Raffel et al., 2019) A transformer-based pretrained model. It converts all textual language problems into text-to-text and proves to be effective.

Evaluation Metrics
We use two automatic metrics, BLEU and PAR-ENT. BLEU (Papineni et al., 2002) is broadly used to evaluate text generation. PARENT (Dhingra et al., 2019) is proposed specifically for data-to-text evaluation that additionally aligns n-grams from the reference and generated texts to the source table.

Experiment Setup
Samples are split into train (70%), dev (15%), and test (15%) sets just the same as the QA task. The maximum length of input/output sequence is set to 512/64. Implementation details of all baselines are given in Appendix B.2.

Experiment Result and Analysis
As shown in Table 5, first, from an overall point of view, both metrics are not scored high. This well proves the difficulty of HiTab. It could be caused by the hierarchical structure, as well as statements with logical and numerical complexity. Second, by comparing two controlled scenarios (cell highlights & both cell highlights and operators), we see that adding operators to conditions greatly help models to generate descriptions with higher scores, showing the effectiveness of our augmented conditional generation setting. Third, results on two controlled scenarios across baselines are quite consistent. Replacing the traditional LSTM with transformers shows large increasing. Leveraging seq2seq-like pretraining yields a rise of +6.5 BLEU and +11.3 PARENT. Lastly, between pretrained transformers, T5 reports higher scores over BART, probably for T5 is more extensively tuned during pre-training. Further, to study the generation difficulty concerning table hierarchy, we respectively evaluate samples at different hierarchical depths, i.e., table's maximum depths in top and left header trees. In groups of 2, 3, 4+ depth, BLEU scores 31.7, 26.5, and 21.3; PARENT scores 40.9, 36.5, and 31.6. The reason could be that, as the table header hierarchy grows deeper, the data indexing becomes increasingly compositional, rendering it harder to baseline models to configure entity relationships and compose logical sentences. Table-to-Text Existing datasets are restricted in flat tables or specific subjects (Liang et al., 2009;Chen and Mooney, 2008;Wiseman et al., 2017;Novikova et al., 2016;Banik et al., 2013;Lebret et al., 2016;Moosavi et al., 2021). The most related table-to-text dataset to HiTab is ToTTo (Parikh et al., 2020), in which complex tables are also included. There are two main differences between HiTab and ToTTo: (1) in ToTTo, hierarchical tables only account for a small proportion (5%), and there are no indication and usage of table hierarchies. (2) in addition to cell highlights, Hitab conditions on Figure 4: A meaningful but challenging case in HiTab.

Method
Test Accuracy MAPO w. partial supervision 32.6 BLEU PARENT T5 w. cell & calculation 16.9 28.8 operators that reflect symbolic operations on cells.  (Wang et al., 2015;Yu et al., 2018;Zhong et al., 2017) and flat web tables (Pasupat and Liang, 2015;Sun et al., 2016). Recently, there are some datasets on domain-specific table QA (Chen et al., 2021;Zhu et al., 2021) and jointly QA over tables and texts (Chen et al., 2020b;Zhu et al., 2021), but hierarchical tables still have not been studied in depth. CFGNN (Zhang, 2020) and GraSSLM (Zhang et al., 2020) uses gragh neural networks to encode tables for QA, but all tables are database tables and relational web tables without hierarchies, respectively.  include some hierarchical tables but only focuses on table search.

Discussion
HiTab also presents cross-domain and complicatedcalculation challenges. (1) To explore crossdomain generalizability, we randomly split train/dev/test by domains for three times and present the average results of our best methods in Table 6. We found decreases in all metrics in QA and NLG.
(2) Figure 4 shows a case that challenges existing methods: performing complicated calculations requires to jointly consider quantity relationships, header semantics, and hierarchies.

Conclusion
We present a new dataset, HiTab, that simultaneously supports QA and NLG on hierarchical tables, where tables are collected from statistical reports and Wikipedia in various domains. Importantly, we provide fine-grained annotations on entity and quantity alignment. In experiments, we introduce strong baselines and conduct detailed analysis on QA and NLG tasks on HiTab. Results suggest that HiTab can serve as a challenging and valuable benchmark for future research on complex tables.

Ethical Considerations
This work presents HiTab, a free and open English dataset for the research community to study table question-answering and table-to-text over hierarchical tables. Our dataset contains well-processed tables, annotations (QA pairs, target text, and bidirectionally mappings between entities and quantities in text and the corresponding cells in table), recognized table hierarchies, and source code. Data in HiTab are collected from two public organizations, StatCan and NSF. Both of them allow sharing and redistribution of their public reports, so there is no privacy issue. We collect tables and accompanied descriptive sentences from StatCan and NSF. We also include hierarchical tables in Wikipedia from ToTTo, which is a public dataset under MIT license, so there is no risk to use it. And in the labeling process, annotators need to check if there exist any names or uniquely identifies individual people or offensive content. They did not find any such sensitive information in our dataset. We recruit 18 students or graduates in computer science, finance, and English majors from top universities(13 females and 5 males). Each student is paid $7.8 per hour (above the average local payment of similar jobs), totally spending 2, 400 hours. We finally get 3, 597 tables and 10, 672 well-annotated sentences. And the data got approval from an ethics review board by an anonymous IT company. The details for our data collection and characteristics are introduced in Section 2. We filter tables using these constraints: (1) number of rows and columns are more than 2 and less than 64; (2) cell strings have no more than one non-ASCII character and 20 tokens; (3) hierarchies are successfully parsed via the method in 2.6. (4) hierarchies have no more than four levels on one side. Finally, 85% tables meet all constraints.

A.2 Annotation Artifacts
Annotation artifacts are common in large scale NLP datasets, which may raise unwanted statistical correlations making the task easier (Gururangan et al., 2018). In HiTab, the annotation artifacts may come from homogeneous patterns of questions. To address this issue, we ask annotators to revise questions from the high-quality descriptions from statistical reports from 28 domains to guarantee the diversity and naturalness, and encourage them to choose the best way to raise question reflecting the overall meaning of the description. To further check whether and where artifacts may exist in our dataset, we conduct two experiments on QA and count the ratio of answer occurring in the question: • Use table as only input without question, to see if there is a potential pattern between table and answer. We train BERT+MAPO for 10, 000 steps and TaPas for 10 epochs. Both methods can't converge under this setting, with 4.0% and 2.6% accuracy on the test set.
The poor performance indicates model can't learn the answers by exploring and leveraging artifacts between the table and answer, and thus should learn to jointly inference the question and table.
• Shuffle the rows and columns of table randomly. Experiments show similar performance (±1%) between our original tables and shuffled tables. The result shows that the correlation between answer and table cell position is very little, thus model can't choose some specific positions, e.g., cell at the first row and first column, as a shortcut prediction.
• The ratio that answer occurs in the question is only 5.3%. Model that only learns to retrieve the question can't achieve high performance.

A.3 Domain Distribution
The full 29 domains of sample distribution in HiTab are shown in Figure 5.

A.4 Annotation Interface
The annotation interface looks like Figure 6. Since spreadsheet formula is easy to write, execute, and check, the spreadsheet is naturally a great annotation tool. Annotators can use the Excel formula conveniently for cell linking and calculation in entity alignment and answering questions. Table-to-Text B.1 Illustration on Controlled Generation in Hierarchical Table-to-Text.

B Hierarchical
Please find the illustration shown in Figure 7.

B.2 Baseline Implementation Details
We perform optimized tuning for baselines using the following settings. Pointer Generator (See et al., 2017) A LSTMbased seq2seq model with copy mechanism. The model uses two-layer bi-directional LSTMs for the encoder with 300-dim word embeddings and 300 hidden units. We perform fine-tuning using batch size 2, learning rate 0.05, and beam size 5. BERT-to-BERT (Rothe et al., 2020) A transformer encoder-decoder model (Vaswani et al., 2017) where the encoder and decoder are both initialized with BERT (Devlin et al., 2018) by loading the checkpoint named 'bert-base-uncased' provided by the huggingface/transformers repository. We perform fine-tuning using batch-size 2 and learning rate 3e −5 .

Target text:
For doctoral students, the proportion of support from research assistantships is 10 points higher than that from teaching assistantships.   We use a beam size of 5 to search decoded outputs (sequence lengths range from 8 to 60 tokens) C Hierarchical Table QA C

.1 Logical Form Function List
We list our logical form functions in Table 7.
Union selection is required for comparative and arithmetic operations. It is achieved by allowing variable number of headers in f ilter tree, where "variable" is one or two in practice.
In our implementation, a function by default takes the selected region of last function as input region to prune search space. We use grammars to filter left headers before top headers, and a (f ilter level) is necessary after filtering one direction of tree even when only the leaf level is  Fig. 8. LEFT 1 is a symbol for the first level on the left.
available. And we deactivate order relation functions (e.g., eq function) and the order argument k in argmax/argmin because there are few questions in these types and activating them will largely increase number of spurious programs when searching.
The logical form coverage after deactivation is 78.3% in 300 iterations of random exploration. Some typical question types that can not be covered are: (1) scale conversion, e.g., 0.984 to 98.4%, (2) operating data indexed by different levels of headers, e.g., proportion of total, (3) complex composite operations, e.g., Figure 4.

C.2 Examples of Logical Form Execution
Take the table in Figure 8 as input table, we demonstrate three types of questions with complete logical forms in Table 8.

C.3 Pruning Rules in Searching
We use trigger words and POS tags for some functions in random exploration, which is inspired by (Zhang et al., 2017;Liang et al., 2018). Functions are allowed to be selected only when triggers appear in the question. Triggers are listed in Table 9.

C.4 Table Linearization
We linearize the question and table according to Figure 8.
The input is concatenation of question and table.  token stands for level zero of left. Each header is linearized as name | type. name is the tokenized header string. type is the entity type parsed by Stanford CoreNLP, which includes "string", "number", "datetime" in our case. Headers with the same name will gather token embeddings by mean pooling.

C.5 Illustration on Challenges in Hierarchical Table
We present an annotated example in Figure 9 to show the challenges of hierarchical table introduced in Section 1.
To precisely answer the question in the figure, the model/method first needs to hierarchically index the grey region with "field in science" and "doctoral", which requires understanding of textual and spatial semantics of the hierarchical table since the textual headers are spatially (seen as a tree) related with the region. Second, from the phrase "most enrolled", it should further indexes "All" (column G) rather than "Percent" (column H) and infers argmax operation, , which calls for the ability to distinguish between different calculation relationships.