QTSumm: Query-Focused Summarization over Tabular Data

People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics. We investigate a set of strong baselines on QTSumm, including text generation, table-to-text generation, and large language models. Experimental results and manual analysis reveal that the new task presents significant challenges in table-to-text generation for future research. Moreover, we propose a new approach named ReFactor, to retrieve and reason over query-relevant information from tabular data to generate several natural language facts. Experimental results demonstrate that ReFactor can bring improvements to baselines by concatenating the generated facts to the model input. Our data and code are publicly available at https://github.com/yale-nlp/QTSumm.


Introduction
In the era of data-driven decision-making, tabular data plays a crucial role in facilitating data analysis, serving as a concise and structured representation Figure 1: An example of QTSUMM.Given the numerous data points in the table, different users may be interested in various aspects for their own informationseeking or decision-making purposes.The system needs to perform human-like reasoning and analysis over relevant table regions to generate a tailored table summary.
tables containing various statistics to develop game strategies and make team adjustments.However, effectively accessing and comprehending the information contained within a large and complex table can be time-consuming for users (Hurst, 2000;Pasupat and Liang, 2015;Pujara et al., 2021;Nan et al., 2022a).Text generation systems that can accurately summarize a provided table according to users' information needs have the potential to greatly enhance data analysis and expedite the process of obtaining data insights.
Existing work and datasets on table-to-text generation (Parikh et al., 2020;Chen et al., 2020a;Cheng et al., 2022b;Lebret et al., 2016;Moosavi et al., 2021;Suadaa et al., 2021) have mainly focused on converting tabular data into coherent statements, aiming to present the structured data in a humanreadable format.However, these approaches have overlooked the fundamental goal of addressing users' information-seeking purposes.Table -to-text generation systems should adopt a more flexible and interactive approach that allows people to obtain a user-customized summary tailored to their information needs (Dang, 2006;Xu and Lapata, 2020;Zhong et al., 2021;Xu and Lapata, 2022;Zhou et al., 2023), as illustrated in Figure 1.While table question answering (QA) (Pasupat and Liang, 2015;Iyyer et al., 2017;Zhong et al., 2018;Chen et al., 2020c;Nan et al., 2022b) has made significant progress in answering fact-based questions, the primary focus of their approaches is on extracting relevant facts or entities from the table and composing short-form answers.Nevertheless, in real-world scenarios, users often have more complex and diverse information needs that extend beyond simple fact retrieval.They expect models to perform human-like reasoning and provide trustworthy explanations or analyses that accompany the extracted insights.
With comprehensive consideration of the realworld information needs of users when consulting tabular data, we propose a new task, query-focused table summarization.In this task, the model is required to generate a user-customized summary given the table and user query.To enable research in this area, we construct a human-annotated tableto-text generation dataset named QTSUMM1 , that contains 7,111 query-summary pairs over 2,934 Wikipedia tables covering diverse topics.Table 1 compares QTSUMM with previous table-to-text generation datasets.To the best of our knowledge, QTSUMM is the first dataset that tackles tasks of generating user-customized table summaries based on real-world scenarios.
We provide a comprehensive evaluation of current state-of-the-art models, including text generation (Lewis et al., 2020;Raffel et al., 2020;Chung et al., 2022), table-to-text generation (Liu et al., 2022b;Zhao et al., 2022b;Jiang et al., 2022), and large language models (Touvron et al., 2023a,b;Zheng et al., 2023;Jiang et al., 2023a;Xu et al., 2023;OpenAI, 2023).Our results and analysis from different perspectives reveal that the existing models struggle in solving this new task, highlighting the challenges the models face when performing human-like reasoning and analysis to generate summary tailored to users' information needs.
To improve both text generation systems for QT-SUMM, we propose REFACTOR.Given a user query, REFACTOR can retrieve and reason over query-relevant facts from the source table to generate multiple data insights in natural language sentences.Our results illustrate that directly concatenating the original input sequence with REFAC-TOR's generation can bring effective improvements to state-of-the-art baseline systems.
We conclude our main contributions as follows: • We propose a new query-focused table summarization task, and construct a large-scale benchmark, QTSUMM, comprising 7,111 querysummary pairs collected in real-world situations.
Strict quality control measures are employed to ascertain the high quality of the dataset.
• We conduct a systematic study of state-of-the-art models on QTSUMM, and illustrate that they are still far behind expert performance, motivating future research on this new table-to-text task.
• We present REFACTOR for the efficient retrieval and reasoning of query-relevant facts from tables.
It demonstrates significant enhancements pertaining to state-of-the-art text generation baselines.(Chen et al., 2020a;Parikh et al., 2020;Cheng et al., 2022b;Liu et al., 2022a), or a generic summarization task (Lebret et al., 2016;Moosavi et al., 2021;Suadaa et al., 2021).In the single-sentence generation task (Parikh et al., 2020;Chen et al., 2020a;Cheng et al., 2022b), the focus is on generating fluent and faithful descriptions using provided table regions as a control for text generation.Nevertheless, using table regions for controlling text generation does not align with real-world scenarios, where people refer to tabular data for information-seeking purposes.The generic table summarization tasks (Lebret et al., 2016;Moosavi et al., 2021;Suadaa et al., 2021) aim to create concise and informative summaries based on the content of a given domainspecific table (i.e., sports or scientific).In contrast, the tables in QTSUMM cover diverse topics.Furthermore, considering the numerous data points in the table, various users may be interested in different aspects for their own information-seeking   (Parikh et al., 2020) statements into questions and uses the same statements as the answers.In comparison with FeTaQA, the queries in QTSUMM were annotated under realworld scenarios, making them more natural and better-reflecting users' actual information needs.

Related Work
Reasoning Over Tabular Data Enhancing the table reasoning capabilities of models is essential for a variety of tasks related to tables, such as table question answering (Pasupat and Liang, 2015;Iyyer et al., 2017;Zhong et al., 2018;Zhao et al., 2023d), table fact verification (Chen et al., 2020b), and table-to-text generation (Chen et al., 2020a;Cheng et al., 2022b).One prevalent approach is pre-training models with table-text joint reasoning data (Herzig et al., 2020;Liu et al., 2022b;Zhao et al., 2022b;Liu et al., 2022a;Jiang et al., 2022;Dong et al., 2022;Cheng et al., 2022a;Xie et al., 2022).Nevertheless, these models generate text in an end-to-end manner, resulting in reduced explainability and difficulties in handling more complex reasoning, such as arithmetic calculation.Therefore, we propose REFACTOR, which can retrieve and generate query-relevant facts from tables as intermediate results for model input (Zhou et al., 2022;Zhao et al., 2023b), mitigating the implicit reasoning processes of text generation models.
Query-Focused Summarization Initially formulated as a document summarization task, QFS aims to generate summaries from documents that are tailored to specific user queries (Dang, 2006).Despite its potential real-world applications, QFS remains a challenging task due to the lack of large-scale training data.Existing works have attempted to address this issue by leveraging distant NLP resources, including question answering (Xu and Lapata, 2020) and paraphrase identification (Su et al., 2020), and generic summarization (Xu and Lapata, 2022;Zhou et al., 2023) where θ denotes the parameters of a neural text generation model, and y i denotes the i-th tokens in the generated summary.

Data Collection Principles
At a high level, the goal of the data collection process is to obtain high-quality user queries and corresponding paragraph-long summaries grounded on the tabular data.We outline our key criteria for designing a benchmark to thoroughly evaluate the table-to-text summarization capabilities of models.
To achieve this, we first design three principles for annotating a good query-summary pair: • Comprehensiveness: The tailored summary should provide enough details and analysis of the source table to respond to the user query, fulfilling user's information need.
• Attributablity & Faithfulness: The query should be answerable using only information from the source table.The summary should be grounded on the source table, and not contain any unfaithful or nonsensical text.
• Fluency: Both the user query and its corresponding table summary should be coherent and fluent.

QTSUMM Annotation Pipeline
To ensure that QTSUMM annotation fulfills the aforementioned principles, we carefully design an annotation pipeline consisting of following steps: Source Table Collection QTSUMM uses tables from LOGICNLG (Chen et al., 2020a) and TOTTO (Parikh et al., 2020) datasets as source tables, as these tables are crwaled from Wikipedia and covers diverse domains and topics.We filter out tables that are 1) too large or too small, 2) with only string-type columns, or 3) with hierarchical structures (e.g., containing more than one table header).Then we randomly sample 2,000 candidate tables from LOGICNLG and TOTTO, respectively, for the query-summary annotation.
User Query Annotation Given a table, the annotators are required to read its content, and determine whether the table is informative and intelligible to common web users.Then they were asked to come up with two or three queries, assuming they are users seeking certain information from the table.We require each query to be answerable using information only from the  as query responses, we avoid queries that can be answered in a short sentence (e.g., "Which country held the 2022 FIFA World Cup?").
Query-Focused Summary Annotation Given a table and user query, we ask another annotator to use only information from the source table to write a paragraph-long summary that satisfies the user's information need.We encourage annotators to produce sophisticated summaries that 1) contain as much information from the table as possible, and 2) involve more types of reasoning over multiple relevant table regions.To further encourage high quality annotations, we adopt the "two channel collection" design (Chen et al., 2020b), in which the annotators would be paid 60% more if their summaries are manually verified to exhibit adequate complexity.We also require the annotators to annotate the row indices of relevant table regions that are referenced in the written summary, allowing future researchers to quantify how well the summaries are grounded in the table in their work.

Multi-Round Validation
We conduct a multiround validation protocol to ensure that the annotated data fulfills the aforementioned annotation principles.We first assign query annotators to validate each summary against their corresponding queries, and fix the mistakes if there are any.Then we check 1) whether a query-summary pair contain adequate information and complex aggregation by examining the length of the summary, and 2) whether the information in summary is essential in responding to the user query.We manually revise pairs that do not meet the above standard.

Annotation Quality Control
Table 2 describes the basic statistics of QTSUMM.
In addition to the multi-round validation, we carefully design several quality control approaches, comprising expert annotation and numerous annotation de-biasing designs, to ensure the high quality of QTSUMM annotations.
Expert Annotators To help improve the annotation process, five experts with professional experience in the text summarization tasks are invited to conduct the internal annotation.They are asked to provide feedback regarding the task instructions and the user experience of the annotation interface, based on which we iteratively modify the annotation guideline and interface design.In the stage of external annotation, we enroll 17 graduate students majoring in STEM fields (10 females, and 7 males).
We do not use the crowd-source annotation platform such as Mechanical Turk as our preliminary study indicates that annotators on MTurk fail to annotate high-quality query-summary data.Before starting the official annotation process, each annotator is given a two-hour training session to learn the annotation requirements and interface.
Annotation De-biasing We observed several kinds of annotation bias during our internal annotation, and we proposed countermeasures as follows for annotation de-biasing: Source Table Diversity: During internal annotation, we found that many tables in LOGICNLG have similar content.For example, there are around 200 tables describing the results of football games, with identical table headers.To ensure the diversity of source tables, we keep only one table for each unique table header.
Query Diversity: When annotating queries, annotators may prefer simpler ones, resulting in low query diversity.Therefore, we frequently monitor the diversity of queries for each annotator.Annotators are also encouraged to craft queries that are either creative or require complex reasoning in summarization, resulting in a doubled payment to compensate them for the extra time.
Supporting Fact Position: We found that annotators prefer to raise queries regarding the first few rows of each table.To deal with such bias regarding supporting fact positions, we randomly highlight certain rows for each table in the annotation interface.We require the annotators to write queries whose summaries should cover at least two rows of the highlighted regions.
We also report the human evaluation scores and inter-evaluator agreements over 200 sampled querysummary pairs.QTSUMM has a high annotation  quality and inter-annotator agreement (Table 3).

QTSUMM Evaluation
We develop a comprehensive approach for evaluating QTSumm, incorporating both automated and human evaluation.We adopt following popular automated evaluation metrics: BLEU (Papineni et al., 2002) computes the geometric average of the precision over output text's ngrams.We used SacreBLEU (Post, 2018) that produces comparable and reproducible BLEU scores.
ROUGE (Lin and Hovy, 2003) measures the word overlap between the candidate and reference summaries.We reported F1 score for ROUGE-L (longest common subsequences).
METEOR (Banerjee and Lavie, 2005) is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.
BERTScore (Zhang et al., 2020) computes the sim-ilarity between the reference and generated summary using contextual word embeddings.
TAPAS-Acc (Herzig et al., 2020;Liu et al., 2022a) is a reference-free metric that uses TAPAS (Herzig et al., 2020) fine-tuned on the Tab-Fact dataset (Chen et al., 2020b) as the backbone to evaluate the faithfulness of generation.
AutoACU (Liu et al., 2023a) is an interpretable and reference-based summarization evaluation system that exhibits better alignment with human judgements.The A2CU first extracts atomic content units (ACUs) from the generated summary and then evaluates them against reference.A3CU is an accelerated version of A2CU that directly computes the similarity between two text without extracting ACUs, but with the similar evaluation target.We use F1 score of A3CU for evaluation.
For human evaluation, the summaries from different models were evaluated by experts from three criteria (i.e., comprehensiveness, faithfulness, and fluency) that have been discussed in Section 3.2.Each summary was scored from 1 (worst) to 5 (best) for each criteria, with the final score averaged across different evaluators.

REFACTOR
QTSUMM requires models to perform human-like reasoning in generating summaries that provide comprehensive and precise analysis of the source table to fulfill the user's information need.However, existing end-to-end text generation models rely on error-prone implicit reasoning processes for generating text, leading to diminished explainability and challenges in addressing user queries that necessitate complex human-like reasoning (Zhou et al., 2022;Zhao et al., 2023b).To address this, we present REFACTOR, to retrieve and reason over query-relevant information from tabular data to generate several NL data insights (i.e., facts) as explicit reasoning results.As shown in Figure 3, the generated facts is concatenated to the model input to mitigate the implicit reasoning issues, enhancing the comprehensiveness and faithfulness of generated summary.We next discuss the implementation of REFACTOR.

Fact Generation
Given the user query and source table, REFACTOR will generate several candidate facts by executing various forms of human-like reasoning over the ta- Query: Which company earns the highest profit in the Oil and Gas industry, and how does it compare to the most profitable company overall?
National Petroleum earns the highest profit in the Oil and Gas industry, amounting to $4,575 million dollars.However, the most profitable company overall, Walmart, earns $7,306 million more profit than Sinopec Group.

Error-prone implicit reasoning
Within the Oil and Gas industry, Sinopec Group earns the highest profit -$6,205 million.However, compared to the most profitable company overall, Apple, the profit earned by Sinopec Group is much lower.In fact, Apple earns $51,306 million more profit than Sinopec Group.
Explicit and faithful reasoning by REFACTOR Baseline Models (e.g., Flan-T5, PLOG)   ble.Specifically, we define 6 types of table reasoning operations (e.g., numerical operation, counting, and conjunction) that are necessary for the QT-SUMM task, as shown in Table 7 in the Appendix.
For each reasoning operation, the fact generator (adopted from Zhao et al. (2022b)) takes a table and a query as input.It produces multiple facts based on the fact template.Each fact template includes several placeholders that need to be filled with information retrieved from the table.Specifically, column col and cell value val are indexed to specify the column and cell name, respectively.Some templates also regulate that the selected column and cell value must be date or number type.
OPERATOR corresponds to operators that are instantiated according to the specific reasoning reasoning.And CONDITION:i can be 1) a cell value from the i-th column; or 2) a number/temporal comparison statement if the i-th column is date or number type.After substituting all the placeholders in the provided template, the fact generator will programmatically return the executed_results and form one fact.Once facts for a {table, query} pair are collected from different fact generators, we pass them to the Fact Ranking process.

Fact Ranking
Given the query and source table, each fact generator will be utilized to generate several queryrelevant facts, resulting in a large number of candidate facts in total.Therefore, we need to rank the generated facts to select the most relevant ones.
We use the QA encoding model (Reimers and Gurevych, 2019) to obtain the embedding of the query and each generated fact.Then, we select the top-n generated facts with the highest cosine similarity to the query embedding.In practice, we assign n as max( row num × column num 2 , 5), and ensure that the number of selected facts from each type of reasoning operation does not exceed 3. The selected facts, which are handy and readily available for end-to-end text generation systems, are then concatenated into the model input.

Baseline Systems
We evaluate the following three types of state-ofthe-art baseline systems2 on QTSUMM:

Text Generation Models
BART (Lewis et al., 2020) is a pre-trained denoising autoencoder with transformer-based architecture and shows effectiveness in NLG tasks.
T5 (Raffel et al., 2020) demonstrates effectiveness in NLG tasks by treating all NL problems as textto-text tasks during pre-training stage.
Flan-T5 (Chung et al., 2022) enhances T5 by scaling instruction fine-tuning and demonstrates better human-like reasoning abilities than the T5.

Large Language Models
Llama-23 (Touvron et al., 2023a,b) is an opensource large language model trained on large-scale and publicly available datasets.Vicuna4 (Zheng et al., 2023) is tuned from Llama-1 with instruction-following data, exhibiting better instruction-following capabilities.Mistral5 (Jiang et al., 2023a) is a 7-billionparameter LLM that outperforms Llama-2-13B across most of popular evaluated benchmarks.Lemur6 (Xu et al., 2023) is tuned from Llama-2 with instruction-following data, exhibiting competitive natural language and coding capabilities.GPT (Brown et al., 2020;OpenAI, 2023) is a powerful large language model which is capable of generating human-like text and performing a wide range of NLP tasks in a few-shot setting.

Experimental Setup
The specifics of input data serialization and LLM prompting examples are discussed in Appendix A. All experiments were conducted on an 8 NVIDIA RTX A6000 48GB cluster.We selected the large version for all fine-tuned baseline models, whose weights are publicly available at HuggingFace.For each fine-tuning experiment, we ran 15 epochs with a batch size of 128.The best fine-tuning checkpoints were selected according to the validation loss.The experiments for open-sourced LLMs were conducted using vLLM framework (Kwon et al., 2023).We used gpt-3.5-turbo-0613for GPT-3.5 and gpt-4-0613 for GPT-4 via the OpenAI APIs7 .For LLM hyperparameter settings, we set temperature as 1.0, Top P as 1.0, and maximum output length as 256.

Main Results
We draw following conclusions based on the automated and human evaluation results (Table 4 & 6).Analyze the correlation between the size of the geographical area of a Gmina type and its population?

Importance of table structure understanding
REFACTOR employs the QA encoding model for fact ranking.However, it struggles to understand complex information needs from users, such as the "correlation between A and B", and might consequently rank irrelevant facts higher.backbones, demonstrating the importance of considering table structure for the QTSUMM task.
Importance of reasoning and analysis Among text generation models, Flan-T5, which enhances T5 through scaled instruction fine-tuning, outperforms T5.Moreover, LLMs with improved reasoning capabilities (i.e., Llama-2-70B and GPT-4) also achieve better performance.These findings indicate the significance of reasoning and analytical skills in handling the QTSUMM task.
Mismatch between automated and human evaluation Despite receiving low scores in popular automated evaluation metrics such as BLEU and ROUGE, GPT-* exhibit better performance than state-of-the-art fine-tuned models in human evaluation.This finding underscores the need for future research to investigate the development of automated evaluation metrics for the QTSUMM task that better align with human judgments (Zhang and Bansal, 2021;Liu et al., 2023a;Jiang et al., 2023b).
Effectiveness of REFACTOR As assessed by human evaluation, baseline systems employing REFACTOR typically yield better performance, especially in faithfulness-level.This suggests the efficacy of REFACTOR in enhancing the reasoning process in text generation.

Error Analysis
For a deeper understanding of the query-focused table summarization task on QTSUMM, we conduct an error analysis to illustrate existing challenges.
We identify four common mistakes that current text generation models are likely to make (i.e., hallucination, factual incorrectness, user intent misunderstanding, and repetition), providing detailed examples and explanations for each type of common mistake in Table 8 in the Appendix.

REFACTOR Analysis
We also undertake a human evaluation to examine the efficacy of REFACTOR in generating queryrelevant facts from tabular data.Specifically, we randomly sample 200 examples from QTSUMM validation set, and ask two human evaluators to evaluate each fact generated by REFACTOR, determining its relevance to the query.56.4% generated facts (528 out of 937) are labeled as "relevant", suggesting an adequate coverage of REFACTOR.To delve deeper into this, we also conduct a case study examining the failure cases, specifically those examples where less than two facts were annotated as "relevant".We identified three kinds of common failure cases: (1) difficulty in parsing cell values via rule-based methods, (2) complex user query causes difficulty in ranking related facts, and (3) unsupported reasoning operations.We provide detailed examples and explanations in Table 5.

Conclusion
This paper defines a new query-focused table summarization task, and constructs a large-scale benchmark, QTSUMM.We investigate a set of strong baselines, including text generation, table-to-text generation, and large language models.Experimental results and manual analysis reveal that the new task presents significant challenges in tableto-text generation.Moreover, we propose a novel approach named REFACTOR, to retrieve and reason over query-relevant information from tables, improving the faithfulness of generated summary.

Limitations and Future Work
The baseline systems provided have a restricted maximum number of tokens they can accommodate (e.g., 1024 for all examined fine-tuned models), which prevents them from generating summaries for large tables that, when converted into a sequence, exceed the maximum number of tokens.To handle large tables (e.g., with more than 300 table cells), future work can apply neural models (Herzig et al., 2020;Liu et al., 2022b) to first filter out those query-irrelevant rows or columns.Moreover, this paper demonstrates the effectiveness of using intermediate results obtained from explicit reasoning operations to mitigate the implicit reasoning issues.However, the proposed REFAC-TOR utilizes template-based method to generate facts.Although such template-based approach can ensure the factual correctness of generated facts, as discussed in Section 5.5, it might not cover all crucial facts for some complex user query.We believe following directions warrant further exploration: (1) Complex query decomposition.Our case study reveals that the TAPEX-based fact ranking module struggles with comprehending complex questions.To address this, future research could investigate LLM chain-of-thought methods to break down complex questions into more understandable and actionable sub-questions.(2) Tool usage.The predefined and template-based execution modules in the REFACTOR fact generation phase have their limitations.Recent studies (Schick et al., 2023;Lu et al., 2023;Paranjape et al., 2023;Gou et al., 2023;Qiao et al., 2023) highlight the impressive abilities of LLMs in making and utilizing tools for problem-solving.It would be intriguing to explore if LLMs can produce executable programs from scratch to derive query-relevant insights.(3) Explainable automated evaluation.In Section 5.3, a discrepancy between automated and human evaluation results is observed.Such discrepancies are concerning, as developers might opt for suboptimal systems for real-world applications if they solely rely on automatic metrics for comparing and ranking different text generation systems.Therefore, a more reliable and explainable automated evaluation system is required (Zhang and Bansal, 2021;Liu et al., 2023a,b;Jiang et al., 2023b).

Ethical Consideration
The source tables in QTSUMM were collected from LOGICNLG (Chen et al., 2020a) and TOTTO (Parikh et al., 2020) datasets, which are publicly available under the MIT license8 and CC BY-SA 3.0 license9 , respectively.They both permit us to compose, modify, publish, and distribute additional annotations upon the original dataset.
For the external annotation of QTSUMM, we hired 17 graduate students majoring in STEM majors.We regard 1) creating three queries for one table, and validating the corresponding summaries annotated by others, and 2) composing a queryfocused summary response as a unit task.And we paid around $1.5 for each unit task.For creative annotation rewards, we paid additional $0.5 for a query, and $1.5 for a summary.Averagely, an annotator can finish 7 unit tasks per hour after training and practicing.And the hourly rates are in the range of $9 and $13 based on the different working speed (above the local average wage of similar jobs).We recommended that annotators complete a maximum of 30 unit tasks per day in order to reduce pressure and maintain a comfortable pace.In total, the approximate working hours to annotate QTSUMM dataset was 1,400 hours.The whole annotation work lasted about 40 days.

A Implementation Details
Input Data Serialization The input contains a user query, and corresponding table data.For text generation and large language models (Section 5.1.1 & 5.1.3),we followed recent works on table-to-text generation (Liu et al., 2022b;Xie et al., 2022;Zhao et al., 2023c,a) to flatten the table data as T=[HEADER]:h, [ROW]1:r 1 ,..., [ROW]n:r n , where h is table header, r i is the i-th table row.For text generation models, [HEADER] and [ROW] are special tokens indicating the region of table headers and rows respectively; while for LLMs, we set them as empty strings.We also separated headers or cells in different columns using a vertical bar |.In this way, the flattened table input can be fed directly into text generation models.For table-to-text generation models (Section 5.1.2),we followed their original data processing methods to input the query and table data.The difference between val:1 and val:2 in col is executed_results.
The difference between China and Canada in Gold is 16.
Table 7: 6 reasoning operations, along with fact template and examples, defined for the fact generation process of REFACTOR.Variable names indicate permissible instantiations.col denotes a column name, val denotes a cell value, and executed_results denotes the execution results of the function.OPERATOR is instantiated according to the specific reasoning operation, e.g., for "Numerical Operation", OPERATOR is replaced with "sum" or "average"; CONDITION can be 1) a cell value from the i-th column, or 2) number/temporal comparison statement (e.g."later than 1967") if the i-th column is of number or date type.Error Type: Repetition Explanation: Generate repetitive information.

Analysis:
The information of these buildings being the tallest in Portland, Oregon has been mentioned repetitively throughout the system output, while the system fail to also distinguish them (until which year each of them was the tallest respectively).
Table 8: Case study for common errors made by Flan-T5-large wo.REFACTOR.The colored text highlights problematic parts of the system output.

Figure 3 :
Figure 3: Enhancing fine-tuned models with the proposed REFACTOR.After generating and selecting the top-n query-relevant facts obtained through various reasoning operations (e.g., numerical comparison, counting), these facts are concatenated with query and table data as the model input in both fine-tuning and inference stage.REFACTOR can mitigate the error-prone implicit reasoning issues of end-to-end text generation systems.For LLM in zero-or few-shot setting, we provide generated facts within the prompts (Figure 5 in Appendix A).

Figure 4 :
Figure 4: An example of LLM zero-shot prompt prefix wo.REFACTOR for the QTSUMM task.

Figure 5 :
Figure 5: An example of LLM zero-shot prompt prefix w.REFACTOR for the QTSUMM task.

Table 1 :
Comparison between QTSUMM and existing table-to-text generation datasets.
(Nan et al., 2022bt challenging to create a generic summary that encompasses all the salient information within the table.Therefore, in this paper, we propose and investigate a new task setting related to query-focused summarization.FeTaQA(Nan et al., 2022b) is a table QA dataset that collects queries by rewriting ToTTo's

Table 3 :
(Randolph, 2005) over 200 samples of QT-SUMM.Three internal evaluators were asked to rate the samples on a scale of 1 to 5. We report 1) percent of samples that have an average score ≥ 4 to indicate the annotation quality of QTSUMM; and 2) percent of agreement and Randolph's Kappa with 95% CI(Randolph, 2005)to show the inter-annotator agreement.
Amazon, ... 2. The Company Name, with Industry is Oil and gas, ordered by Profit ($ Million) is Sinopec Group, National Petroleum.3. The sum of Profit with Industry is Oil and gas is 10780.4. The difference between Apple and Sinopec Group in Profit is 51306.5. ....
1.The Company Name ordered by Profit ($ Million) is Apple, Models wo.REFACTOR input Models w.REFACTOR input

Table Title : Top 10 Company in 2012
continues pre-training the BART model by using a large-scale corpus of synthetic SQL query execution data.It shows better table understanding and reasoning abilities.

Table 4 :
Automated evaluation results on the QTSUMM test set, involving three types of baseline systems with and without REFACTOR.We used chat or instruct version for each type of LLMs.Within each experimental setting, we used A3CU (F-score) as the ranking indicator of model performance.Due to the budget constraints, for all LLM w.REFACTOR experiments, we randomly selected 200 samples.

Table 5 :
Case study on REFACTOR's failure cases.

Table 6 :
Human evaluation results (Likert Scale Scoring) of selected baselines on the test set.Five experts are enrolled to evaluate 50 predictions for each model.
Game have Attendance greater than 10,235.
This was the only race which was decided by less than 10 percentage points, with Hudson winning 54.1% of the vote to Kissell's 45.9%.The close margin of the race demonstrates that both candidates were competitive, and it was an intense contest until the very end.The race in the North Carolina 7th district was the most competitive, as the Democratic incumbent Mike McIntyre won by a slim margin, with only a 0.2% difference between him and his Republican challenger David Rouzer.Furthermore, this election was the only one among all North Carolina districts in 2012 that resulted in a margin of victory within less than 1%.