Table-To-Text generation and pre-training with TabT5

Encoder-only transformer models have been successfully applied to different table understanding tasks, as in TAPAS (Herzig et al., 2020). A major limitation of these architectures is that they are constrained to classification-like tasks such as cell selection or entailment detection. We present TABT5, an encoder-decoder model that generates natural language text based on tables and textual inputs. TABT5 overcomes the encoder-only limitation by incorporating a decoder component and leverages the input structure with table specific embeddings and pre-training. TABT5 achieves new state-of-the-art results on several domains, including spreadsheet formula prediction with a 15% increase in sequence accuracy, QA with a 2.5% increase in sequence accuracy and data-to-text generation with a 2.5% increase in BLEU.


Introduction
Large language models (LLMs) such as BERT (Devlin et al., 2019) or T5 (Raffel et al., 2020) have shown impressive abilities to encode and generate fluent and coherent natural language text (Lan et al., 2020;Gururangan et al., 2020;Conneau et al., 2020).However their representation and generational capabilities are limited when it comes to structured or semi-structured domains like tables.This is mainly due to two reasons: (i) LLMs are only pre-trained on large amount of unstructured data (e.g., documents, news, etc.); (ii) their underlying model architecture lacks a way to fully leverage this structure information.
Yet, structured and semi-structured data is ubiquitous on the web (e.g.web tables, database tables, PDF tables, spreadsheets store rich numerical information and provide concise summaries of data), and widely studied in the academia (Chen et al., 2021a;Cheng et al., 2022;Parikh et al., 2020;Wang et al., 2021) and the industry (e.g.formula prediction in Excel1 and Google Sheets2 , or extracting data from tables in Text-to-Speech Assistants).
Recently, several solutions propose to alleviate aforementioned issue by introducing pre-training or intermediate training strategies for tables.For instance, Herzig et al. (2020) propose to use a Masked Language Model (MLM) as pre-training objective to improve the contextual representation of BERT (Devlin et al., 2019) over table inputs.To train their model, they introduce additional input embeddings that help the model understand the table structure.These pre-training models are designed and evaluated on datasets where the answers contain only table cells or aggregations of multiple cells, and not full sentences.In this paper, we tackle a set of distinct, complex tasks such as question answering and formula prediction that require full generation capabilities.In particular, our contributions are as follows: • We present an encoder-decoder based model TABT5 (Table-and-Text-to-Text Transfer Transformer) that can be applied to data-to-text generation tasks by relying on special embeddings of the input structure.

Problem Definition
The objective of our model is to learn a conditional sequence generator P (y|x) where x is endowed with extra two-dimensional structure.To encode said structure, each instance of x is as a variable length sequence of tuples (u i , t i , c i , r i ) N i=1 representing the components of x.In each component,

Token Embeddings
Original Input Table

TabT5
Denoising Target <X> 0 <X> 3 Random cells are replaced by a sentinel u i is a natural language utterance, t i is the discrete type of the component (i.e. could be Question, Document Title, Table Caption, Table Header, Table Cell, etc.), and c i and r i represent the two dimensional column and row coordinates for this component.This approach is general to represent the information layout in web documents and in particular tables where each table cell and each piece of metadata can map into a single component.(Parikh et al., 2020;Wang et al., 2021;Iida et al., 2021;Eisenschlos et al., 2021;Herzig et al., 2020).Another design choice is to use structural positional encoding, in addition to 1-D encoding, to represent two dimensional information such as the row and column positions of tokens (Herzig et al., 2020;Eisenschlos et al., 2021;Iida et al., 2021;Wang et al., 2021).An alternative is the the use of a structure-aware attention, in contrast to a standard self-attention mechanism, to better leverage the table structure (Mueller et al., 2019;Yang et al., 2022).All of these models are encoder-only.Concurrent with our work Shi et al. (2022) propose a similar method to adapt to T5 to tabular data, however their pretraining approach relies on existing annotated datasets and focuses solely on QA applications.Table Pre-training.Most pretraining methods follow the Masked Language Modeling (MLM) scheme, where some percentage of input tokens are randomly masked and successively predicted in an encoder only setup (Herzig et al., 2020;Eisenschlos et al., 2021;Yang et al., 2022).Some approaches (Wang et al., 2021;Yin et al., 2020;Iida et al., 2021) apply the masking on a cell-level, where the full contents of a given cell is masked and then predicted.Our work differs in training the encoder and decoder jointly by using a de-noising scheme similar to the one used in T5 (Raffel et al., 2020).Table QA.Given an input table, the task consists in producing an answer to a natural language question.We focus on WIKISQL (Zhong et al., 2017), and learn an encoder-decoder model with row/column embeddings in the weakly supervised setting without logical forms.Herzig et al. (2020) use a similar approach with a BERT encoder-only model, while Liu et al. (2022) use a BART encoder-decoder model without extra embeddings.Formula prediction.The task is to predict formula conditioned on headers and other contextual information, without an explicit natural language question.Chen et al. (2021a) propose to use a BERT-based architecture to compute an input header and a cell data vector that are fed to a two-step LSTM decoder.The decoder proposes a formula sketch and refines it with cell ranges.Cheng et al. (2022) propose a similar approach where the representation of the target cell output by a table encoder (Wang et al., 2021) is an input to a two-step LSTM-based decoder.Our approach is simpler as a single model is used to solve the task end to end.Data-to-Text.The task consists in generating a natural language description given structured data input.Parikh et al. (2020) employ an encoderdecoder model where the encoder and decoder are both initialized with BERT (Devlin et al., 2019).Kale and Rastogi (2020) use a T5 model.In both, tables are linearized with row/column separator tokens.Our work differs as we use row/column embeddings, and we employ two pretraining schemes.

TABT5 Model
TABT5 uses the T5 pre-trained language model as a baseline architecture.We linearize the table into a sequence of words, split words into word pieces (tokens) and concatenate the question and table tokens to create the input sequence.We include in the model row and column embeddings to encode table structure (Herzig et al., 2020).We add them on top of the token embeddings as model inputs and optimize them during training (Figure 1).The target sequence is a free-form answer.This can be an answer to a question for question-answering tasks, a table summary when no question is specified or a formula for the formula prediction tasks.

Pre-training
As a starting point, we use publicly available T5 checkpoints released by Raffel et al. (2020).Next, we pre-train TABT5 on Wikipedia tables.We use the pre-training dataset proposed by Herzig et al. (2020) which contains 6.2M tables (3.3M of class Infobox3 and 2.9M of class WikiTable).We also extract related passages that caption the table.We define two pre-training strategies described below.

Denoising
We design a denoising strategy for table-like data, following the method used in T5 (Raffel et al., 2020), by training the model to predict a target sequence containing the missing or corrupted tokens in the input table.The target consists of all of the dropped-out spans of tokens, delimited by the sentinel token used in the input sequence (Figure 1).We replace 15% of table cells and columns in the input with a mask token4 .This helps the model capture relationships between the neighbouring cells and between the related text.

ToTTification
We define another pre-training strategy using the same Wikipedia tables (Section 5.1) inspired by ToTTo (Parikh et al., 2020), to be used after denoising.For each table, we retrieve the statements that are in the same page as the table or link to the table page.We only keep statements that have an entity (Wikipedia URL, number or date) that matches the table, 4M in total.These statements become our target text.We add the matching entities in those statements as a (comma separated) plain text component of the input to guide the generation.

Experiments
In this section, we discuss the experiments we performed to show the effectiveness of our method.

Datasets
WIKISQL (Zhong et al., 2017) is a Table-QA dataset containing 80.654 instances.To create the dataset, crowd workers paraphrase a templatebased question into natural language.Two other crowd workers' groups then verify and correct the quality of the proposed paraphrases.We follow the approach of Herzig et al. (2020) and generate the reference answer from the reference SQL provided using our own SQL implementation.ENRON (Chen et al., 2021a) is a dataset to evaluate formula prediction task containing over 17K spreadsheets extracted from the Enron email corpus that contains 218.798 instances.It focuses on formula with referenced cells in a rectangular neighbourhood region of the target cell and the headers.We preprocess the data as described Appendix C. TOTTO (Parikh et al., 2020) -QA pairs, along with their numerical reasoning processes in the form a of sequence of mathematical operations.

Results
We discuss the experimental setup in the Appendix B. For TOTTO, we report the results in Table 1.We follow Parikh et al.'s official script to compute BLEU and PARENT as the evaluation metrics.The Non-Overlap dev set features examples that are out-of-domain from the training set.For the test set, we provide results from one run as this is a laborious manual process requiring a submission of test files into an external source5 .Note that Parikh et al. do not provide development set results in their paper and Kale and Rastogi do not provide test set results for the base model6 .We observe that TABT5 outperforms SOTA models and its performance is improved further by using the TOTTIFY pre-training.Note that the base model performs slightly better than the large model.We For FINQA we included the top 5 retrieved passages as part of the input for TABT5.Additionally we implemented a special tokenization scheme breaking all numbers into single tokens, following Chowdhery et al. (2022).We report the results in table 4. The dataset was included as part of the SUKI 7 workshop (Chen et al., 2022) where TABT5 reached the third place.Ablation We perform the ablation study only on the base variant due to computational costs.For each experiment, we report two ablations runs: (i) -DENOISING indicates that we remove the denoising  pre-training , and (ii) -EMBEDDINGS indicates runs without row and column embeddings.We observe that the performance of TABT5 deteriorates when removing either denoising or embeddings.This shows they are crucial for all tasks.We also show results for the TOTTIfication, which improves the performance for TOTTO.This pre-training method was defined to imitate the TOTTO task.Thus, it   is not surprising that it improves the performance on that task.However, we observe the decrease in performance for other tasks when using this method of pre-training compared to the denoising method.
Error analysis We manually annotate a random sample of 80 errors made by TABT5 on TOTTO dataset, and classify them in Figure 2.
We find that 35% of the TABT5 output are exactly the same as T5's output where 55% are correct (paraphrases) and 72.5% overall are acceptable (correct content with some details missing).We classify the remaining errors into grammatical errors, hallucinations and wrong answers (see Appendix D for the errors' definitions and examples).
The grammatical error cases are mostly around wrongly inserting determiners like in front of named entities (e.g.residing in *the* Watauga County).Similarly about 50% of hallucination cases are not severe, as they are fluent and convey a similar meaning to the ground truth, but are factually incorrect because the wrong entity being used (e.g.Washington Wizards instead of Washington Huskies, even when such entities are not in the input).The other 50% cases, are far from the ground truth meaning.We speculate that most of the less severe hallucinations cases are directly connected to the table denoising pretraining scheme employed.That is the model is biased towards generating related named entities even when those are not present in the input (i.e.masked during pretraining).
In general, we see that TABT5 is able to produce more fluent sentences than the baselines, either removing superfluous information or using the correct verbs in context (e.g.served as speaker instead of was a speaker).
The results suggest a need for better metrics for the data-to-text generation tasks that capture the similarities.

Conclusions and Discussion
We introduced TABT5 a new T5-based encoderdecoder model that improves the encoding ability of tables for pre-trained sequence-to-sequence language model.TABT5 achieves new SOTA results on spreadsheet formula prediction, question answering with text or complex mathematical formulas, and data-to-text generation.This work opens up different paths for future work.We plan to explore different datasets for pre-training.Raffel et al. show that it is beneficial for the unstructured data to train on datasets bigger than Wikipedia.Thus, we plan to use larger and task specific datasets for pre-training (e.g.scrape tables from Web, sheets).Finally, we want to extend this work to multiple languages, especially low resource ones.

Limitations & Ethical Considerations
As is for existing works on generative architectures based on large language models, there are potential risks and harms associated with using the output for downstream applications (Bender and Koller, 2020;Brown et al., 2020).Beyond the original pre-trained checkpoint from T5, we also used tables from Wikipedia for intermediate pre-training, which may contain additional undesirable biases.

A Hyperparameters Selection
We run denoising for 1M steps and ToTTify pre-training for 100k steps on top of denoising.We set each fine-tuning task for 50k training steps.We run the evaluation on ENRON and WIKISQL using the default T5 hyper-parameters with an input sequence length of 1024 and output 256.For the TOTTO dataset, we follow the approach of Kale and Rastogi (2020) and keep the learning rate constant and equal to 1 × 10 −4 , an input and output sequence length is equal to 512 and 256 respectively, and batch size is 256.Additionally, we observe that TABT5 in the small and base variants overfit quickly.Thus, we decide to increase the dropout rate to 0.2 when using pretraining.

B Experimental Setup
We apply the standard T5 tokenizer and start pretraining from publicly available T5 checkpoints.Row and column embeddings are randomly initialized.We run pre-training and fine-tuning on a setup of 16 Cloud TPU v3 cores with maximum sequence length of 1024.Pre-training takes around 3, 8 and 13 days for small, base and large models.Fine-tuning takes around 2 − 3 hours for each task.For each dataset, we run five independent runs and report median and standard deviation.

C ENRON results.
In this section, we present the results on the EN-RON dataset that contains all original data (i.e.all formulas in the tables).We find these results interesting as the ENRON contains real data collected by the company.Thus, we believe this scenario is realistic.We present the results in Table 5.We

Model
Top-1 T5-base 93.05 ± 0.98 TABT5-small 95.39 ± 0.17 TABT5-base 95.59 ± 0.08 +TOTTIFY 95.50 ± 0.05 -DENOISING 93.92 ± 0.16 -EMBEDDINGS 95.00 ± 0.16 Table 5: Formula prediction results on ENRON.In this experiment, the model has to produce the target formula having access to the formula used in the surrounding cells.Results are higher wrt. to Table 3 as the model is allowed to "copy" already used formulae or part of them.
observe that our results are extremely high because in ENRON dataset over 70% of tables contain a target formula in the input table.Following the previous approaches, we make the task harder.In the experimental section of the paper, we preprocess the datasety by removing all formulas from the input table cells.Additionally, we remove examples containing (i) erroneous formulas, and (ii) ranges from different tables in both input tables and target formulas.

D Error analysis
We manually annotate 80 errors made by the TABT5 on TOTTO dataset, and classify them according to the definitions in Table 6.

Figure 1 :
Figure 1: TABT5 linearizes the input table row by row and adds column and row embeddings to encode the 2-dimensional coordinates of each cell.The model is pretrained by randomly replacing 15% of cells by a <X> marker and training the decoder to predict the hidden output in sequence.

Figure 2 :
Figure2: We manually annotate 80 errors made by TABT5.We find that 55% of predictions are paraphrases and 72.5% are acceptable.The classification of error types is given in Table6.

•
We introduce different pre-training strategies that leverage web data containing tables.•We evaluate our approach on four different table and text datasets in English and obtain state-ofthe-art performance on several domains.
is a Table-to-Text dataset containing 120.761 instances.It consists of tables paired with table-grounded sentences as natural language descriptions.Parikh et al. apply several heuristics to sample tables and candidate sentences from Wikipedia pages.They use crowd worker annotators to highlight the corresponding table cells and revise natural language descriptions.FINQA (Chen et al., 2021b) is a dataset containing 8.281 financial Table

Table 1 :
Text generation results for TOTTO on development (dev) and test sets.The Non-Overlap set features examples that are out-of-domain from the training set.TABT5 provides improvements over existing approaches and TOTTIFY pretraining provides additional gains.

Table 2 :
Table-QA results on WIKISQL in the weakly supervised setting without logical forms.TABT5 provides gains over existing approaches even in a small model variant.The large model gives the best results.

Table 3 :
Formula prediction results on ENRON.The T5base baseline brings substantial improvements over existing approaches.TABT5 provides further gains, with the large model variant obtaining the best results.

Table 4 :
Results on the FINQA challenge, with an ensemble over 5 model execution outputs.TABT5 brings substantial improvements over the baselines.