A Table-to-Text Framework with Heterogeneous Multidominance Attention and Self-Evaluated Multi-Pass Deliberation

,


Introduction
Table -to-text, referring to the task of producing a textual description taking the table as input, has been widely applied in different domains, such as weather forecast (Liang et al., 2009;Mei et al., 2016), logical tabular reasoning (Chen et al., 2020a), and financial report generation (Lin et al., 2022).Automatic description generation may shed light on mitigating the time-consuming procedure in table-to-text tasks with bare hands.
Recent advances in pre-trained language models (PLMs) have demonstrated significant progress in natural language generation (NLG) (Yao et al., 2022(Yao et al., , 2023;;Fang et al., 2023).To effectively leverage the power of PLM, several table-to-text works (Gong et al., 2020;Suadaa et al., 2021) serialized the table input via manually defined templates.Besides, to preserve the table's structural information, TableGPT (Gong et al., 2020) devises a table structure reconstruction task, and TASD (Chen et al., 2022b) proposes to learn the structure representation explicitly.However, to group the data into categories, the cells in a table (e.g., a pivot table) are often organized in a nested/hierarchical manner using headings and subheadings.The attention map computed by these approaches may fail to harness the perplexing hierarchical table structures.
On the other hand, one can effectively deliberate the generated texts from a global perspective with the multi-pass generation paradigm (Niehues et al., 2016;Chen et al., 2022b).While it is hard to terminate the multi-pass generation procedure and determine the appropriate outcome towards faithful descriptions of tables.Existing works evaluate and finalize the multi-pass generated texts with the help of reinforcement learning (RL) (Geng et al., 2018) or a customized rewrite-evaluator architecture (Li and Yao, 2021).However, extra workload might be brought in due to the intractability of RL or a separate evaluation module.
To this end, in this paper, we propose a PLMbased table-to-text approach with the help of Self-evaluated multi-pass Generation and Heterogeneous Multidominance Attention (SG-HMA).Specifically, we first formulate an input table into a multidominance (MD) structure (Gračanin-Yuksek, 2013).In this way, the hierarchical relation of the table content can be preserved since one child node in the MD structure can have more than one parental node.Then, we devise a heterogenous multidominance attention (HMA) mechanism to represent the table content with the awareness of complex hierarchical structure.Afterward, we deliberate the generated texts with a multi-pass paradigm and develop a contrastive loss to equip the model to generate more faithful table descriptions with self-evaluation.The contributions of this work can be summarized as follows: • We propose to transform the tabular input into a multidominance structure and devise a heterogenous multidominance attention to yield table representation on top of PLMs.
• We innovate a self-evaluated multi-pass generation framework for the table-to-text task with the help of contrastive learning.
• Extensive experiments on benchmark datasets validate the superiority of the proposed SG-HMA framework in generating descriptive texts for tabular inputs.
2 Related Work

Table-to-Text Generation
With the success of deep neural networks, the seq2seq method has been applied to various natural language generation (NLG) tasks.Based on this framework, researchers (Liu et al., 2018;Puduppully et al., 2019)  Since PLMs have shown great potential in transfer learning, fine-tuning PLMs on different downstream tasks becomes a general and effective method in NLG tasks (Kale and Rastogi, 2020;Chen et al., 2020b).Parikh et al. (2020) proposed a novel table-to-text dataset with a controlled generation task applying BERT as a baseline.Gong et al. (2020) transformed the table into natural language text and designed two auxiliary tasks to address the incompatibility between text-to-text PLMs and table-to-text generation.To infer facts from tables, Chen et al. (2020a)

Contrastive Learning for Natural Language Generation
Contrastive learning has been widely applied in NLG tasks such as machine translation (Yang et al., 2019) and summarization (Cao and Wang, 2021).SimCTG (Su et al., 2022), as a contrastive training objective, can calibrate the model's representation space.Xu et al. (2022) propose SeqCo with a contrastive objective trying to map representations of document and summary to the same space.
Most generation models are trained without being exposed to incorrectly generated tokens.To solve the exposure bias problem, Sun and Li (2021) applies margin-based losses in the generated text.Liu and Liu (2021) trains an evaluation model with contrastive learning.Several works (Lee et al., 2021;An et al., 2022;Su et al., 2022) use an Npairs contrastive loss and Brio (Liu et al., 2022) uses a ranking loss based on their sequence-level scores to learn a better sequence-level distance function between the document and the target.Therefore, contrastive learning can well bridge the gap between training objectives and evaluation metrics (Chen et al., 2022a).We are the first to apply it into text deliberation, implementing a multi-pass generation process that can be self-evaluated.

Problem Formulation
In a typical table, the caption provides critical information about the entire table and the associated topics, the rows and columns describe the properties of their affiliated cells, while the cells provide additional details within the table framework.A table organizes these by a hierarchical structure, hiding deep semantic information which can be expressed by natural language.The table-to-text task aims at generating an appropriate summary s for each table t.
Specifically, given a structured table t, the model θ is expected to generate a descriptive sentence y in an auto-regressive way: yi = arg max where |y| is the word number of sentence y.

PLM as Generator
Fine-tuning PLM on different downstream tasks becomes a general and effective method in NLG tasks.We first serialize the table into natural language text N t that conforms to the standard input format of PLM.Given the serialized table N t and the reference s, in the training process, the last hidden state H with the input of the decoder s is obtained as follows: Ht,s = PLM([Nt ; s]). ( where H will be used to predict the probability of the next token.Formally, the training objective of text generation that maximizes the likelihood of the reference text is given by: (3)

Methodology
In this section, we introduce the proposed framework in detail.Firstly, as shown in Fig. where and V e = {e i } |Ve| i=1 are the set of caption, row, column, and content nodes, respectively.Furthermore, we use h(r i ) and h(c i ) to signify the located level of row r i ∈ V r and column c i ∈ V c , respectively.We also denote the located row and column in the max level of the content e i ∈ V e as ρ R (e i ) and ρ C (e i ), respectively.Based on the heterogeneous node sets, the set of edges E is defined according to the hierarchical structure as: Heterogeneous Multidominance Attention.
Multidominance is a tree-like structure organized in a top-down manner with non-uniform abstract meaning among different levels.More importantly, MD distinguishes itself from commonly known tree structure by inheritance non-linearity, i.e., the node may have more than one parent node.
To deal with MD that experiences both the directed hierarchical structure and the inheritance non-linearity, we propose a novel heterogeneous multidominance attention (HMA) mechanism, where a flow aggregation layer is employed to adaptively aggregate the heterogeneous flows within the MD table structure.Specifically, we first obtain the initial representations of the table component nodes by the PLM embedding layer Emb(•): where E x ∈ R lx×d , l x is the length of the text in node x and d is the dimensionality.Then, we respectively concatenate the representations of the nodes in terms of their types (i.e., caption, row, column and content) as: where U A ∈ R n A ×d is the caption representation, n a is the length of all caption nodes, and ∥ is the concatenation operation along the first dimension.Analogously, the row, column and content can be also derived.Afterward, we devise a flow aggregation layer FA(• ⇝ •, •) to enable the table information flow from the high-level component (e.g., captions) to the low-level component (e.g., rows and columns) while selectively accumulating beneficial knowledge for each node.Specifically, the row (column) representation can be updated as: where P is a incidence prior suggesting the hierarchical connectivity based on D between specific types of nodes, which is illustrated in Fig. 2. For example, consider the connectivity from caption nodes to row nodes, P A→R ∈ R n A ×n R is given by: The cell content representation can be further acquired following: Particularly, to fulfill adaptive information aggregation, the flow aggregation layer FA is implemented with a sparse attention (Wang et al., 2019) as: where S, T are the source and target input of the flow, respectively, ⊙ is the element-wise multiplication, and W Q , W K , W V are parameters.Finally, the table representation Z T is obtained by concatenating the learned caption, row, column and content representations as: 4.2 Self-Evaluated Multi-pass Generation For the generation procedure, we first utilize the table representation to enhance the generation ability of PLM by cross attention.So as to achieve self-evaluation that dominates the termination signal in the multi-pass deliberation, we develop a contrastive loss that ranks the similarity between the table and the generation samples conforming to the evaluation metric.Moreover, the deliberation is performed by rewriting not only the output same semantic information, it is beneficial to align the table's hidden state for generation with more closely matched summaries.Hence, we propose to empower the model to self-evaluate the quality of candidates based on their compatibility with the table, using a carefully designed ranking-based contrastive loss.Specifically, inspired by sinkborn divergence (SD) (Feydy et al., 2019) that interpolates between MMD (Gretton et al., 2006) and OT (Chen et al., 2019) to attain probability distribution comparison, we first define the similarity score σ between the table t and its generation sample g as: Then, we generate the candidates {u t i } nu i=1 by beam search for table t where n u is the candidate number.Afterward, we leverage the evaluation metrics (e.g., BLEU) to construct the partial order ≻ of the candidates, where higher metrics suggest higher rankings.Finally, We introduce a ranking loss (Zhong et al., 2020) to assign higher score to better candidates as follows: where ϵ is the margin value.
Learning Objective and Deliberation.We jointly train the text generation task and candidate self-evaluation task with a composite loss: where α is a hyperparameter that is a scale factor.Regarding the multi-pass deliberation, for each pass, we rewrite not only the generation result but the candidates to incorporate abundant samples into Table 1: Performance comparisons of the automatic evaluation on the fine-tuning, TableGPT, prefix-tuning, TASD and SG-HMA with different backbones.Except fine-tuning, other baselines are implemented according to their source codes.We implement fine-tuning and our method with three backbones to achieve a fair comparison.
the contrastive evaluation.Then, according to the average score, we terminates the deliberation if the score starts to decline.The details of the training pipeline are shown in Algorithm 1.

Table Description
Inference.In the inference stage, we generate multiple candidates in an autoregressive manner using the multi-pass deliberation paradigm.To attain a satisfactory deliberation, we further leverage the benefits of contrastive self-evaluation to discern the quality of candidates.Specifically, we terminate the deliberation if the quality of the newly generated candidate, as determined by the similarity score defined in Equation 14, begins to decline.Detailed description of the inference stage is given in Algorithm 2.

Overall Performance
As shown in on the E2E dataset, respectively.Furthermore, it's worth noting that SG-HMA can maintain a leading position when changing its backbone, which demonstrates its backbone-agnostic nature.Moreover, we observed performance variations of different backbones across datasets.Notably, BART demonstrated the best performance on the numericNLG dataset due to its reputation as a denoising model, allowing for more accurate summaries of complex tables with noise.On the Totto dataset, which features tables of varying types and formats, T5 achieved the best results.This is because T5 is pre-trained with multiple tasks, giving it a strong generalization capability to process various data types with different formats.Lastly, the E2E dataset, which describes restaurant information in a relatively simple format, is more susceptible to overfitting with complex models.Therefore, the GPT2 backbone may be slightly more effective.

In-depth Analysis
We conducted an in-depth analysis from multiple perspectives on the three public datasets with the best-performing backbones to gain further insights into our proposed method and verify the effectiveness of each component.
Ablation study.To futher explore the effectiveness of our proposed modules, we conduct an ablation study.Specifically, we compare SG-HMA with several variants: 1) w/o HMA that removes the heterogeneous multidominance attention, 2) w/o ctr that removes the contrastive self-evaluation in the training stage, 3) FA that replaces our designed HMA with a full attention.
As can be seen in Table 2, w/o ctr performs worse than SG-HMA under all metrics.This demonstrates the importance of the contrastive loss in guiding the PLM to generate a more reasonable probability distribution.Besides, SG-HMA outperforms w/o HMA and FA under all metrics.Since our HMA is a form of flow aggregation that propagates information in a top-down manner, it matches the inherited complex structures of tables, thus performing best on all datasets.
Parameter sensitivity analysis.We study the impact of constrastive loss weight on the model performance.The weight varies from 0.1 to 10.As shown in the first row of Figure 3, on the whole, the model performance initially improves as the weight increase.This is because increasing weight forces the PLM to generate a more reasonable probability distribution and evaluate the deliberation result effectively.However, as the weight continues to increase, the model's performance decreases.This is because when the weight is too large, the model overly focuses on the contrastive loss and degenerates into a discriminative model, losing the ability to generate text.

Effectiveness of self-evaluation on deliberation.
As shown in Figure 4, we present the predicted similarity score by our model, along with five met- rics against the golden table description for the description candidates in each pass.We have observed that metrics perform best at the second pass on the numericNLG and Totto datasets, while it achieves the best performance at the third pass on the E2E dataset.Such an observation underscores the significance of identifying an appropriate termination point for the deliberation, as prolonged deliberation may result in a deterioration of the generated description.Furthermore, we observe a positive correlation between the model's predicted similarity score and the BLEU metric owing to the contrastive self-evaluation performed with the BLEU metric during the training stage.As a result, our proposed model can serve as a reliable indicator of the quality of the generated table descriptions during the inference stage, facilitating the assessment of deliberation adequacy.SG-HMA better restore the original content of the table .In terms of summarization ability, SG-HMA not only correctly identifies "CLSP-SE" as the best performing model, but also acknowledges poor performance on the "ws-240" dataset and captures the hierarchical relationship between "Chinese" and "ws-240".This demonstrates our model's ability to summarize tables by effectively mining hierarchical table structure information.

Human Evaluation
We randomly selected 40 samples from three datasets and conducted a human evaluation from four aspects: fluency, coverage, relevance, and

Limitations
Our proposed method exhibits marginal improvements over the prefix-tuning baseline when the input tabular data, such as the E2E dataset, are relatively simple.To further enhance the model's performance on simple tables, we aim to integrate prompt learning with our hierarchical table representation in future work.Besides, we only take the BLEU as the metric ranking criterion in contrastive learning.In the future, we will consider all metrics to achieve a more balanced model.

Ethics Statement
We will abide by the laws, rules, and regulations of our community, school, work, and country.We will conduct ourselves with integrity, fidelity, and honesty.We will openly take responsibility for our actions and only make agreements, which we intend to keep.

A Multidominance Structure based Table Serialization
Formally, the serialization of the table according to both the structure and the content is given by: As (the caption is) a, the r hr 0 of r 0 0 ... , .... where x j i is the i-th x with a j-level, e r i ,c j k is the cell e k with row attribute r i and column attribute c j and h x is the max level of x.

B Extra Experimental Setings B.1 Dataset Divison
The dataset divison of numericNLG and E2E adapt the official method.For Totto dataset, we filter 1189 samples.

B.2 Evaluation Metrics
BLEU measures the precision of N grams in a sentence against references.ROUGE-L measures the recall of the longest common subsequence between the source and the target.NIST improves the BLEU method by weighing the penalty for incorrectly matching n-grams.METEOR evaluates the generation of word-to-word matching.CIDEr is an automated consistency metric used for evaluating image descriptions.

B.3 Backbones
GPT2 is a pre-trained language model with a decoder-only transformer architecture.It was pretrained on a large and diverse webtext dataset with the goal of maximizing the probability of generating high-quality text.BART is a denoising autoencoder for pretraining sequence-to-sequence models, with a standard transformer-based architecture.T5 is a pretrained transformer model with an encoderdecoder architecture.providing a unified framework for converting all NLP tasks into text-to-text tasks.

B.4 Implementation Details
Regarding automatic evaluation, all results of deep models were obtained by conducting experiments on a Linux machine with Nvidia P40 GPU.Furthermore, an Adam optimizer was utilized for LM finetuning, and training was iterated in 30 epochs for numericNLG, 20 epochs for Totto and 5 epochs for E2E.A beam search algorithm was adopted when generating the text and the beamwidth was set to 4. The learning rate of PLM was searched from 1e-5, 5e-5, 1e-4 and we selected 1e-4 for numeric-NLG and Totto and 1e-5 for E2E.We use BLEU as the evaluation metric to define the target ordering of the candidate summaries.We fine-tuned on a GPT2 model with 124M parameters, a BART model with 400M parameters and a T5 model with 220M parameters.

B.5 Baselines
We compare SG-HMA with the most relevant baselines as following: • Fine-tuning.To leverage the rich semantic information in PLMs, fine-tuning as a transfer learning method has shown great potential in various downstream tasks, including machine translation, named entity recognition and summarization.
• TableGPT.TableGPT is the first attempt to apply table serialization to convert semistructured data into natural language text and a multi-task learning paradigm to enhance the generation ability, showing the potential in leveraging the table structure information.
• Prefix-tuning.Prefix is a sequence of continuous task-specific vectors, the only module that needs to be optimized while keeping basic PLM parameters.Prefix-tuning is a state-of-the-art table-to-text method on the E2E dataset.
• Cont.Contrastive learning is an effective solution to solve the exposure bias problem in NLG tasks.This paper proposes a unified framework to break the bottlenecks from tree aspects: contrastive example construction, contrastive loss choice and decoding strategy.
• TASD.TASD devises a three-layered multihead attention network to leverage the table structure information and adapt a multi-pass Fluency 1. Poor Fluency: The text is difficult to understand.

Below Average Fluency:
The text has some basic elements of fluency but still contains noticeable errors or inconsistencies.
3. Average Fluency: The text demonstrates a moderate level of fluency, with relatively few errors.
4. Above Average Fluency: The text demonstrates a high level of fluency, with minimal errors or disruptions.
5. Excellent Fluency: The text demonstrates exceptional fluency, with virtually no errors or disruptions.

Coverage
1.The generated text does not mention any key elements of the table.
2. The generated text provides limited information about some of the key elements of the table.
3. The generated text provides moderately comprehensive description of the table.

4.
The generated text provides a comprehensive description of the table.
5. The generated text offers an exceptional and comprehensive description of the table.
Relevance 1.The generated text does not mention any relevant information about the table, and/or the information provided is entirely fabricated or false.
2. The generated text includes some relevant information about the table, but the description is limited, and/or contains factual inaccuracies.

3.
The generated text provides a moderately relevant description of the table, and/or contain minor errors.
4. The generated text offers a highly relevant description of the table, and/or contains mostly accurate and reliable information that can be reasonably verified.

5.
The generated text provides an exceptionally relevant and authentic description of the table.
Overall Quality 1.The generated text is of poor quality overall with numerous issues.
2. The generated text is below average in quality, with several issues affecting its overall effectiveness.
3. The generated text is average in quality, with some strengths but also some weaknesses.
4. The generated text is above average in quality, with clear strengths and good overall effectiveness.
5. The generated text is of excellent quality overall, with outstanding strengths and high effectiveness.

Figure 2 :
Figure 2: The prior for the table in Fig 1 where the blue cell indicates 1 while the white cell indicates 0.

Fig 5
Fig 5 presents an interesting case we observed when comparing the results on the numericNLG dataset.Specifically, the green highlights indicate text that appears in both the table and the generated result, the red highlights indicate summary statements and their corresponding content, and the blue highlights indicate parts showing the hierarchical semantic relationship.Compared to the fine-tuned BART model, the results generated by introduced reinforcement learning in the training algorithm, and Suadaa et al. (2021) designed a reasoning-based template.Inspired by prompting, Li and Liang (2021) applied prefix-tuning to GPT2 for table-to-text generation and outperformed fine-tuning in low-data settings.Li et al. (2021) introduced table representation learning into fine-tuning of PLMs, showing the potential of table representation in guiding text generation.TASD (Chen et al., 2022b) designed three multi-head attention layers within and among cells, ignoring the inherent hierarchical structure within a table.For the table-to-text task, no work takes the data structure of tables into account.We resolve the table hierarchical structure into an MD structure and devise an HMA to learn the table representation.

caption 1 st chead 2 nd chead 3 rd chead values 1 st rhead 2 nd rhead caption 1 st chead 2 nd chead 3 rd chead 1 st rhead 2 nd rhead Resolve … PLM Input Original Table Resolved Table HMA Aggregation & Cross Attention Predicted Probability
Algorithm 1: Training Procedure.Data: Given a training dataset with a table set T , the corresponding reference set S and a language model LM Θ with initial parameters Θ Result: A language model for table-to-text generation LM Θ * with optimal parameters Θ * Inference Procedure.Data: Given a table t, a well-trained table-to-text generation LM Θ * with parameters Θ Result: A textual description of the table 1 p ← 1, {N t i } 8 Θ (p) ← Θ (p−1) − λ∇ Θ L mul ; 9 M (p) ← 0; 10 for t ∈ T do 11 M (p) ← σ(t, y) + M (p) ; 12 N t ← [N t ; y]; 13 end 14 while M (p) ≥ M (p−1) ; 15 Θ * ← Θ (p−1) ; 16 return LM Θ *generation but also the candidates, which are able to incorporate abundant samples into constrastive learning.The L mle defined in Equation 3 is applied as the training objective for text generation.Contrastive Self-Evaluation.As the table itself and its summary convey different aspects of the Algorithm 2:

Table 2 :
Performance comparisons on BLEU(B), ROUGE-L(R), NIST(N), METEOR(M) and CIDEr(C) of SG-HMA and its variants with the best backbone.

Table 5 :
Performance on monolingual word similarity computation with seed lexicon size 6000 Groundtruth: table5shows the results of monolingual word similarity computation on four datasets.fromthetable,wefindthat:.(1)ourmodelsperformbetterthan bilex on both chinese word similarity datasets.(2)clsp-wrmodeldoesnotenhance english word similarity results but clspse model does Finetuned BART: table5shows the results of this experiment.wecansee that clspwr achieves the median accuracy of 60.46% compared to bilex, which indicates that clone-based methods are more effective than conventional word similarity methods.SG-HMA BART: table5shows the results of monolingual word similarity computation with seed lexicon size 6000.we observe that clsp-se outperforms other models on all datasets except for Chinese ws-240 where the accuracy of bilex is slightly worse

Table 3 :
Human evaluation results.Flue means fluency, Cove means coverage, Rele means relevance, Ovqa means overall quality.
Table 4 shows the size of each part of the dataset after division.

Table 4 :
Statistics of the training, validation, and test sets for the NumericNLG, Totto and E2E datasets.