Towards Table-to-Text Generation with Numerical Reasoning

Recent neural text generation models have shown significant improvement in generating descriptive text from structured data such as table formats. One of the remaining important challenges is generating more analytical descriptions that can be inferred from facts in a data source. The use of a template-based generator and a pointer-generator is among the potential alternatives for table-to-text generators. In this paper, we propose a framework consisting of a pre-trained model and a copy mechanism. The pre-trained models are fine-tuned to produce fluent text that is enriched with numerical reasoning. However, it still lacks fidelity to the table contents. The copy mechanism is incorporated in the fine-tuning step by using general placeholders to avoid producing hallucinated phrases that are not supported by a table while preserving high fluency. In summary, our contributions are (1) a new dataset for numerical table-to-text generation using pairs of a table and a paragraph of a table description with richer inference from scientific papers, and (2) a table-to-text generation framework enriched with numerical reasoning.


Introduction
Recent data-to-text generation studies have shown significant improvement in generating faithful text aligned with data sources. A copy mechanism has been widely explored to improve faithfulness in various ways. Wiseman et al. (2017) used joint probabilities to let models choose between copying records from data sources or generating from a vocabulary. Puduppully et al. (2019) improved a similar approach by modeling entity representations as a unit of copying. This approach has proven to be effective in generating descriptive text that explicitly mentions facts from sources. However, as introduced by Chen et al. (2020a), humans have the ability to produce more analyti-   Table 2 shows the mention detection results on the test set. Similar to coreference linking results, our model achieves higher precision and F1 score, which indicates that our model can significantly reduce false positive mentions while it can still find a reasonable number of mentions. cal text with richer inference, including numerical reasoning. Making inferences beyond texts is still an open question due to the limitation of language models in handling numeric operations. In this study, we further encourage research by elaborating numerical tables to initialize the ability to inject reasoning while maintaining high fluency. Our contributions are summarized as follows.
• We introduce a new dataset for table-totext generation focusing on numerical reasoning. The dataset consists of textual descriptions of numerical tables from scientific papers. Our dataset is publicly available on https://github.com/titech-nlp/numeric-nlg.
• We adopt template-guided text generation (Kale and Rastogi, 2020a) for a table-to-text generation task and propose injecting preexecuted numerical operations in the template to guide numerical-reasoning-based text generation. We compare different types of templates for table representations in pre-trained models.
• We propose a copy mechanism for pre-trained models, that uses general placeholders covering table contents and results of pre-executed numerical operations to avoid fact hallucination.
• We conduct experiments with current state-ofthe-art neural generation models and a simple template-based system to demonstrate the challenges and opportunities for future research on text generation with numerical reasoning.

Related Work
The power of tables in presenting data efficiently further encourages research done by exploring the tables as data sources in natural language tasks, such as table-to-text generation (Liang et al., 2009;Wiseman et al., 2017;Lebret et al., 2016;Parikh et al., 2020), table question answering (Pasupat and Liang, 2015;Wang et al., 2018), and table-based fact verification (Chen et al., 2020b;Gupta et al., 2020). Recent research on the table-to-text generation task is starting to generate text with more reasoning. Murakami et al. (2017) explored stock prices to generate market comments by adding generalization tags of possible arithmetic operations to cover mathematical reasoning. Nie et al. (2018) proposed operation-guided attentions by exploring the results of pre-executed numerical operations. The dataset closest to ours is LOGICNLG, by Chen et al. (2020a), who first introduced logical text generation using open-domain tables with unknown schemas. Different from our target text for generation, which consists of several sentences in a paragraph, they proposed a task of generating only one sentence from selected table contents.
Data Cleansing and Annotation Extracted table descriptions can be noisy since they may contain only table numbers without any sentences describing table facts. We hired experts in the computer science field to clean and annotate the extracted descriptions in the following steps: • Examine tables and their corresponding descriptions and then recommend only the descriptions that have at least one sentence representing numerical facts in the table.
• Categorize each sentence of the recommended description into three fact-checking classes: data description, supporting description, and not-related-to-  Figure 1, "Our full model" is selected as the target header.
We used the same split of training, validation, and test sets as the source table dataset (Suadaa et al., 2021).  Similar in motivation to LOGICNLG in generating text that can be logically entailed by facts in tables, numericNLG consists of collections of paragraphs that are naturally produced by human experts in scientific papers, paired with their corresponding numerical tables. Our dataset has fewer tables than LOGICNLG, focusing on numericalreasoning text in the scientific domain.

Table Representation
Due to ROTOWIRE's limited schemas, Wiseman et al. (2017) viewed a table input as a set of records (entity, value, type), where the entity and the type are the extracted row and column names, respectively. Because of the unlimited table schemas in our dataset, by capturing the original table structure in real-world tables, this paper uses the representations which consist of captions, row headers, column headers, cell values, and metrics, called a data table. Using only descriptive facts from the data table as input representations is sufficient to generate descriptive texts that explicitly mention facts in the table. However, since we intend to produce more analytical text with numerical reasoning, we propose adding inferred facts to the input representation by computing a set of arithmetic operations on the data table beforehand, defined as a pre-executed operation table.
Data Table We view T as a set of cells with their corresponding row header (rh), column header (ch), numerical value (val), and metric-type (m), defined as a data table (T D ). A data table for the example in Figure 1  (precision, recall, f1). Since our tables are annotated with a targeted header as a content plan for table descriptions, we mark cells corresponding to the targeted header with a target flag (tgt) to highlight the marked cells in text generation. We set tgt = 1 for targeted cells and tgt = 0 for non-targeted cells. In this study, we preprocess the header name by concatenating the row and column headers (h = [rh; ch]) and keep information about the header category by extracting overlapping tokens of row and column headers as th. As a result, we define T D = (h ij , th ij , val ij , m ij , tgt ij ), where 1 ≤ i ≤ n r , 1 ≤ j ≤ n c ; n r and n c are the numbers of rows and columns, respectively.

Pre-executed Operation Table
We provide a table of pre-executed cell operations (T OP ) by doing mathematical operations only on targeted cells to limit the calculation. In this study, we cover maximum, minimum, and difference operations. Examples of a preprocessed table, data table, and pre-executed operation table are shown in Figure  2.
Linearized Table Supporting transfer learning of pre-trained transformers to our table-to-text generation task, we prepare a linearized table P T as an input representation so that it similar to the representation that encoder has seen during pre-training. T is converted to a flat string P T = w 1 , ..., w |P T | , similar to that used in many prior work (Wang et al., 2020;Chen et al., 2020a;Kale and Rastogi, 2020b), where w i denotes the i-th word in paragraph P T with length |P T |. In this study, we adopt the template-based input representation, introduced by Kale and Rastogi (2020a), to handle representation bias between a structured data T and a natural language utterance P T , where P T is generated using a manually defined template. We propose not only covering data table T D in the template but also injecting the pre-executed numerical operations of table T through T OP to guide numerical-reasoningbased text generation. We consider four different methods 3 for converting T into sequences, the last two being our contributions. This naive representation omits the relation between rows and columns. Note that <table id> is extracted from the caption to support table mentioning in generating table descriptions.

Data-based Template (T D temp)
T is transformed into a natural language sentence by scanning each row of T D with tgt = 1 to fill a manually defined template: This representation covers the semantics of data in the original table.

Reasoning-based Template (T OP temp)
Mathematical operation arguments and results from T OP are injected in this representation to cover the numerical reasoning of data in the original table. We define h op and val op as a header and a value of an operation result respectively, where op ={max, min, diff}. Specific to the difference operation, h dif f 1 and h dif f 2 refer to the first and second header arguments, respectively. Then, T is represented by concatenating the templatized representation for each row of T OP : <table id> shows <caption>. <h max > has the largest <m max > (<val max >) of <th max >.

Data and Reasoning-based Template (T D + T OP temp)
T is converted by combining templatized sentences of T D and T OP . This representation covers both data and their numerical reasoning.

Generation Models
The task is to generate text by translating table representation P T into table description Y = y 1 , y 2 , ..., y n . We apply a series of generation models to solve the proposed task. While our focus is primarily on pre-trained models since they have been most widely used for limited data settings, <table_id> shows that <header_max> achieves higher <metric_max> and <metric_max>score. Table 2 shows that our full model achieves higher precision and f1-score.  like ours, we also include a template-based generator and a pointer-generator network as baselines.

Non-pre-trained Models
Template-based Generator We design a domain-specific template-based generator covering two types of sentences in producing table descriptions: table referring sentences and data description sentences. Since our task focuses on numerical-reasoning descriptions, we define templatized sentences using maximum records in table T OP : <table id> shows <caption>. we can see that <h max > outperforms other <th max > with <val max > of <m max >.
Pointer-Generator Pointer-generator (See et al., 2017) is a sequence-to-sequence model with attention and a copy mechanism. This model copes with the out-of-vocabulary problem in data-to-text generation by jointly copying from source texts and generating from a vocabulary.

Pre-trained Models
Fine-tuned GPT2 GPT2 (Radford et al., 2019) is a pre-trained language model with a decoder-only transformer architecture. We fine-tuned the GPT2 model by using table representation P T as a prefix of our input. Specifically, we fed the concatenation of table representation P T and table description Y to the model and generated Y . In the inference phase, we used only P T as the input to generateŶ starting after the last token of P T .
Fine-tuned T5 T5 (Raffel et al., 2020) is a pretrained transformer model with an encoder-decoder architecture, that solves natural language tasks by converting into a text-to-text format. We fine-tuned the T5 model in our dataset by adding a "summarize" prefix to table representation P T producing outputŶ .
Copy Mechanism Pre-trained language models have proven their effectiveness in handling the open vocabulary problem through subword tokenization. Supported by attention layers of the transformer in their architecture, the models learn to attend to source inputs while generating target texts in subword units. However, pre-trained generators often produce texts that are not aligned to table sources. In this study, we propose strengthening their copying ability by incorporating a copy mechanism into the pre-trained models. Although a copy mechanism based on pointer-generator (See et al., 2017) was used for pre-trained models (Chen et al., 2020c) and is well-known in the community, it cannot maintain the global logical structure of sentences with richer inference. We instead employed a simpler copy mechanism based on placeholders (Murakami et al., 2017) with more specific tags than in Chen et al. (2020a). We further propose a ranking-based placeholder alignment algorithm, as illustrated in Figure 3. First, we align entities and numbers in Y with the data tables T D and pre-executed arithmetic operation results T OP by using string matching. The alignment starts from the first row to the last row of T OP . If no matched token is found, it continues  to the rows of T D . We set a higher rank to T OP than T D in the alignment since we focus on logical text generation. Then, we replace the matched tokens with corresponding placeholders 4 in a templatized description Y temp . As depicted in Figure  3, since "our full model" in sentence Y is matched with the header result of the maximum operation, we replace it with <header max> placeholder. During the fine-tuning phase, instead of directly generating Y , the models learn to produce a templatized description Y temp including placeholders as well as words.
In the inference phase, we design a ranking algorithm with a placeholder memory to select the best-replaced tokens for placeholders of a predicted templatized descriptionŶ temp in producing a generated descriptionŶ . We define a set of values in the same row of source tables as a content set and prioritize replacing placeholders in one sentence with the same content set, ensuring sentence coherence. A content set of T D is a tuple of header, metric, and value. For T OP , a content set consists of header, metric, and value of the operation results. Specific to the difference operation, we add the header of the first and second arguments to the content set since the header arguments are important to capture entity comparison in a sentence.
We utilize a placeholder memory to temporarily save prioritized placeholder candidates from the same content set that is previously chosen. For 4 Details of placeholders and their definition are in Tables  7 and 8 in the appendix. example, as shown in Figure 3, after replacing the header max placeholder with the header result from the first row of maximum records of T OP in Step 1, the related placeholders from the same content set (metric max and value max) are added to the placeholder memory as higher-ranked candidates in the searching space. The placeholder memory is reset to empty in the following sentence ofŶ temp and the alignment starts again from the next content set of table sources.

Experiments
We conducted experiments on the proposed dataset to evaluate the performance of the text generation models and verify the effectiveness of the approach of using different table representations.

Automatic Evaluation Metrics
We used BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) to evaluate the informativeness of generated texts. We computed the BERTSCORE (Zhang et al., 2020) to assess the similarity between the generated texts and the ground-truth table descriptions by using contextualized token embeddings of pretrained BERT (Devlin et al., 2019), which have been shown to be effective for paraphrase detection. Considering both references and table contents, we also used the PARENT metric, proposed by Dhingra et al. (2019). In our experiments, we modified the PARENT calculation by adding noun phrases of table captions as table contents and used only  targeted table contents for table sources.

Implementation Details
We trained a pointer-generator model using the Adagrad optimizer with a batch size of 8 and a learning rate of 0.15. For fine-tuning the GPT2 model, the Adam optimizer set weight decay to 3 × 10 −5 . Following Raffel et al. (2020), the T5 model was fine-tuned with a constant learning rate of 0.001. We trained all models for a maximum of ten epochs with early stopping based on the loss score on the validation set (patience of 3). At the time of decoding, the generated text was produced through a beam search of size 5. Table 2 shows our experimental results. The finetuned T5 models performed better than the others in terms of BLEU, ROUGE-L, METEOR, and BERTSCORE. The slightly lower PARENT of the best fine-tuned T5 model than the template-based generator implies that the fine-tuned T5 model was also comparable in terms of generating related table descriptions. The pointer-generator model had the lowest score since our dataset consists of limited table collections with a broad vocabulary and challenging target texts.

Effect of table representation
Comparing the performance between table representation types in the pre-trained models, we can see a different tendency between GPT2 and T5. The more similar the table representation used as an input, the higher the score of GPT2. Since GPT2 had only a decoder, the inputs including reasoning-based templates (T OP and T D + T OP ), which are more similar to our target with numerical reasoning, performed the best for several metrics with more than 1 point improvement. In T5 with an encoder-decoder architecture, on the contrary, there was only a slight margin between different table representations. This indicates that the encoder part of T5 can capture table contexts from various input templates. For variants without a copy mechanism, T5 with only data representation (T D ) outperformed the other representation types with longer sentences for all metrics. Because of the gap between the encoder and decoder, T5 still had difficulty aligning the information of longer inputs and outputs.
Effect of copy mechanism The worst scores of the fine-tuned GPT2+copy models indicate that our proposed copy mechanism failed to learn the templatized target patterns in the fine-tuning step. The decoder-only GPT2 could not handle the sparse distributions of target texts with placeholders. Conversely, the copy-based fine-tuned T5 models achieved a better BLEU score due to their encoder and decoder ability in handling output texts with placeholders. Table 3 shows table descriptions generated by the template-based, pointer-generator, and fine-tuned pre-trained models (GPT2 and T5), using data and reasoning-based templates 5 for our table example in Figure 2. We marked sentences related to table captions in green, correct facts based on table contents in blue, and incorrect facts in red. In this study, since we had a limited training set with a broader vocabulary, the pointer-generator model tended to result in repetitive words and failed to generate well-described descriptions. The pre-trained models, GPT2 and T5, generated more natural descriptions. While several pieces of text generated by GPT2 included numerical facts, they used numbers that were not extracted from table contents. The T5 models produced descriptions that were more related to table contents than GPT2.

Qualitative Analysis
Considering our lengthy output examples in Table 3, unlike the fine-tuned GPT2 model, which generated longer sentences, the fine-tuned T5 model generated shorter sentences than the references. 6 The length gap between the references and outputs of the fine-tuned T5 model affected the F1-based metrics of ROUGE-L, METEOR, BERTSCORE, and PARENT. Note that BLEU is a precision-based metric that can handle shorter outputs through a brevity penalty (Papineni et al., 2002). Therefore, we assume that BLEU better represents the performance of the fine-tuned T5 model than the other metrics.

Human Evaluation
We conducted a human evaluation 7 to better assess the quality of the generated text. We compared our copy-based fine-tuned T5 model with the table shows the recall performance with our full model. the result of our full model is 88.7, which is comparable with the 89.3 performance of our full model but still better than the 89.9 and 89.2 performance in both the f1, prec and full models. we also find that our full model does not perform very well when compared against a full one, with 89.4% and 89.4% recall and 89.2% recall respectively. we can also find that our full model is slightly inferior in terms of recall. Fine-tuned GPT2 (TD + TOP temp) + Copy table 2 presents the overall mention detection results on ontonotes. our full model outperforms all the state-of-the-art systems in terms of recall and f1 score. Fine-tuned T5 (TD + TOP temp) + Copy table 2 shows the overall mention detection results on the test set of ontonotes. our full model outperforms the previous state-of-the-art models by a large margin, which confirms the effectiveness of our proposed approach. (<table id>: table 2; <header max>: our full model)   the template-based, pointer-generator, fine-tuned GPT2, and fine-tuned T5 models. We did not compare it against the copy-based fine-tuned GPT2 since GPT2 failed to incorporate our proposed copy mechanism. We used the best table representation with majority metrics for each model on the basis of the experimental results in Table 2.
In the first study, we evaluated the correctness of the generated text on the basis of facts in tables. We randomly selected 30 tables in the test set and elicited responses from three graduate students per table. Following Wiseman et al. (2017), the raters were asked to count how many facts in the descriptions were supported by numerical data in the tables and how many were contradicted. Since our task covers numerical-reasoning text, we distinguished descriptive numerical facts from inferred numerical facts. We also measured the level of relevance of the generated text to the table captions by using a four-point Likert scale (highly relevant, relevant, somewhat relevant, and irrelevant).
The results are shown in Table 4. The pointergenerator failed to reflect facts due to the wide variety of our table schemas. While the fine-tuned GPT2 model generated sentences with a larger number of descriptive and inferred facts than the others on average, most of the facts were contradictive. The fine-tuned T5 model generated fewer sentences than GPT2, with the average number of inferred facts being larger than that of descriptive facts. Our model based on the fine-tuned T5 model with a copy mechanism reduced the ratio of contradictive facts for both descriptive and inferred facts.
Following earlier work (Puduppully et al., 2019), we also evaluated text fluency in terms of grammaticality, coherence, and conciseness by using best-worst scaling (BWS) (Louviere and Woodworth, 1991;Louviere et al., 2015). We divided the outputs of the five models into ten pairs of descriptions. We presented workers with two descriptions and asked them to decide which one is best for each fluency category.
The score of each model was calculated by using the MaxDiff approach (Orme, 2009): the number of times a description was chosen as the best minus the number of times it was chosen as the worst. Scores range from −100 (absolutely worst) to 100 (absolutely best). We elicited judgments with Amazon Mechanical Turk for the 30 descriptions, rated by 3 participants. The results are shown in Table 5. Most of the pre-trained models achieved better scores than the others. The fine-tuned GPT2 model achieved the highest score in terms of grammaticality and coherence. The fine-tuned T5 model achieved the highest score in terms of conciseness. Adding a copy mechanism to the T5 slightly decreased the grammaticality and conciseness but improved the coherence.

Conclusion
We proposed numericNLG, a new dataset for tableto-text generation using a table and its corresponding description from scientific papers, focusing on numerical-reasoning texts. Even though our proposed dataset is not a large-scale table collection, we provided pairs of a table and its rich inference description, that are naturally written by experts in scientific papers, supporting further research on table-to-text generation with numerical reasoning.
We conducted experiments with fine-tuned pretrained models by using several types of table linearization as input representations, comparing with a template-based generator and pointer-generator. The experiments showed that transfer-learning of pre-trained language models leads to an improvement in our settings, that resulted in more fluent text while it still lacked fidelity to table contents. We then proposed incorporating a copy mechanism by using general placeholders to avoid the production of hallucinated phrases, that are not supported by tables while preserving high fluency. Even though our proposed copy mechanism failed to learn to generate better outputs in the decoder-only pre-trained models, we showed that a copy-based pre-trained model with an encoder-decoder archi-tecture leads to a better BLEU score and improves correctness.

A Table Representation
An example of table representation for Figure 2 is shown in Table 6.
Type     Table 9 shows table descriptions generated by the fine-tuned GPT2 and fine-tuned T5 models with and without a copy mechanism, using different types of table representations for our table example in Figure 2.

D Human Evaluation
Figures 4 and 5 show the user interface for evaluating correctness and relevance and for evaluating grammaticality, coherence, and conciseness, respectively.
Model Generated Text Fine-tuned GPT2 (naive) we compare our results using the model with that of the other model, which shows the performance on both models. we can conclude that the models are more reliable in predicting the usefulness of the model and more reliable than the other models. we suggest that the more accurate the model, the higher the mention detection results on both models. the model with the highest mentions detection yields a better model, and also shows the best performance on both models. Fine-tuned GPT2 (TD temp) and finally we have seen that our model does not do the right thing by simply using the word "we." as a noun we do not do the right thing by using the verb "we." as a noun our word "we" does not even have a verb, "we do." as a verb we do not even have a verb, "we do. Fine-tuned GPT2 (TOP temp) the table shows the recall results for our full model. the f1 performance was better than that in our full model, indicating that we did not have any other models in the dataset. the rec. performance was comparable to that for the full model with a f1+ performance of 82.7%. Fine-tuned T5 (naive) table 2 presents the overall mention detection results on the test set of ontonotes. we can see that our full model outperforms all the baselines in terms of recall f1 score. Fine-tuned T5 (TD temp) table 2 shows the mention detection performance on the test set of ontonotes. our full model outperforms all the baselines in terms of recall and f1 score. Fine-tuned T5 (TOP temp) table 2 shows the overall mention detection results on the test set of ontonotes. our full model outperforms the state-of-the-art in both precision and recall. Fine-tuned T5 (naive) + Copy table 2 shows the overall model results on ontonotes. we can see that our full model outperforms all baselines, which demonstrates the effectiveness of our approach. Fine-tuned T5 (TD temp) + Copy table 2 shows the overall mention detection results on the test set of ontonotes dataset. our full model outperforms the state -of -the -art by a large margin , with an absolute difference of 0.8% over the state of the art. Fine-tuned T5 (TOP temp) + Copy table 2 shows the overall mention detection results on the test set of ontonotes. our model outperforms the state -of -the -art ( lee et al . , 2018 ) and is comparable to the stateof -the -art ( lee et al . , 2018 ).