Improving Encoder by Auxiliary Supervision Tasks for Table-to-Text Generation

Table-to-text generation aims at automatically generating natural text to help people conveniently obtain salient information in tables. Although neural models for table-to-text have achieved remarkable progress, some problems are still overlooked. Previous methods cannot deduce the factual results from the entity’s (player or team) performance and the relations between entities. To solve this issue, we first build an entity graph from the input tables and introduce a reasoning module to perform reasoning on the graph. Moreover, there are different relations (e.g., the numeric size relation and the importance relation) between records in different dimensions. And these relations may contribute to the data-to-text generation. However, it is hard for a vanilla encoder to capture these. Consequently, we propose to utilize two auxiliary tasks, Number Ranking (NR) and Importance Ranking (IR), to supervise the encoder to capture the different relations. Experimental results on ROTOWIRE and RW-FG show that our method not only has a good generalization but also outperforms previous methods on several metrics: BLEU, Content Selection, Content Ordering.


Introduction
to-text generation is an essential task for text generation from structured data. It aims at automatically producing descriptive natural language text to help people obtain the salient information from the tables. Over the past several years, neural text generation methods have made significant progress on this task. Lebret et al. (2016); Wiseman et al. (2017); Bao et al. (2018) view the input table as a record sequence and model it as a machine translation task. To generate text containing more salient and well-organized facts, ; Moryossef et al. (2019) Figure 1 (a) contains basketball game statistical tables from ROTOWIRE (Wiseman et al., 2017), a benchmark of NBA basketball games. As can be seen, each entity (player or team) takes one row in the corresponding table. Moreover, each row comprises several records of different types, which describe the entity's performance in different aspects. In terms of generating a summary from these tables, it is necessary to make reasoning to obtain some factual results from the entities' performance and the relationships between entities. For instance, when humans describe the tables in Figure 1 (a), they usually give some factual results, such as "The Boston Celtics dominated the visiting New York Knicks" or "Isaiah Thomas was huge for Boston...". These results need to be reasoned from the entities' performance and the relationships between entities. Therefore, it is necessary to give the model the reasoning ability. However, previous methods do not explicitly model this ability.
Numerical tables mean most records in these tables are numerical and are very common. For instance, 86.82% of the records and almost 86.49% of the column types are numeric in ROTOWIRE. We observe that there are different relations between records in different dimensions. For example, there are two kinds of relations in numerical tables. The first one is numerical size relation in the column dimension, i.e., in the same type column. The other is the relative importance relation in the row dimension. It refers to the relative importance of different types of records, which are in the same row, to the entity that they belong to. On the one hand, these relations may contribute to tableto-text generation. Let us take Figure 1 (a) as an example. I.Thomas's score is 29, which is higher than other records in the column PTS. And he has three rebounds, which is lower than most other records in the column REB. Therefore, humans are more likely to describe his scores rather than his rebounds when summarizing his performance. On the other hand, a vanilla encoder may not effectively capture the relations existing in different dimensions without any auxiliary supervision.
We employ a hierarchical encoder, which comprises a Record Encoder and a Reasoning Module, to encode the input tables from record level and row level. Specifically, inspired by Gong et al. (2019), the Record Encoder utilizes two cascaded self-attention modules to encode the table from the column and the row dimension, respectively. Moreover, to endow the model with the reasoning ability, we first build an entity graph on the row level according to the relations between players and teams. And then, we introduce a reasoning module to perform reasoning on the graph. Furthermore, we utilize different auxiliary tasks to help the encoder capture the different relations among records. More specifically, two auxiliary tasks named Number Ranking (NR) and Importance Ranking (IR) are proposed to supervise the learning of the different parts of the Record Encoder, respectively.
We conducted experiments on ROTOWIRE and RW-FG (Wang, 2019) to verify the effectiveness of the proposed approach. The experimental results demonstrate that it is necessary to enable the model the reasoning ability. Moreover, the proposed two auxiliary tasks can improve the data-to-text model's performance without introducing extra parameters. Furthermore, the results also show our method not only has a good generalization but also outperforms previous methods on BLEU, Content Selection, and Content Ordering metrics.

Related Work
Recently, neural models have been the mainstream for table-to-text generation and obtained impressive results. Early works on table-to-text generation regard it as a distinct machine translation task and view a structured table as a record sequence (Lebret et al., 2016;Wiseman et al., 2017;Bao et al., 2018). Most recent works are inspired by the traditional methods for data-to-text generation and introduce explicit content selection and planning to improve the results Puduppully et al., 2019b;Moryossef et al., 2019;Trisedya et al., 2020;Bai et al., 2020), and they obtain training labels by aligning the input tables with related summaries. However, this alignment may introduce additional errors. Some works attempt to use additional knowledge to improve the quality of the generated text. Nie et al. (2018) utilize pre-executed symbolic operations on the input table in a sequence-to-sequence model to improve the fidelity of neural table-to-text generation. Chen et al. (2019) introduce the background knowledge of the entity in the table to improve results.
In addition to introducing external knowledge, some works learn better representation for the table by explicitly modeling the table's structure.  propose a structure-aware seq2seq architecture, which incorporates the filed information as the additional inputs to the table encoder. Some works (Bao et al., 2018; model the table's representation from the row and column levels, and utilize the dual attention decoder to generate text. Gong et al. (2019) introduce the historical data for each table and utilize a self-attention-based hierarchical encoder on three dimensions (row, column, and time) to enrich the table's representation. Furthermore,  propose three auxiliary supervision tasks (sequence labeling, text auto-encoding, and multilabel classification) to help the encoder capture a more accurate semantic representation of the tables. Gong et al. (2020) also explicitly model the relations between the numeric records. They pretrain a multi-layer transformer encoder to obtain records' contextual numerical value representations. Moreover, when training the data-to-text model, they replace the record's token embedding with its con- textual representation from the pre-trained model. Differently, our Number Ranking task is trained with the data-to-text model and can supervise the model actively to capture the numeric size relation without introducing extra parameters.

Record Encoder
Each input instance consists of three different tables T 1 , T 2 , T 3 , containing records about players' performance in the home team, players' performance in the visiting team, and the team's overall performance. Each cell in the table is regarded as a record. Inspired by Gong et al. (2019), we utilize two self-attention modules to model each record's contexts from the column and the row dimension, respectively. After that, we obtain the fusion representation for records by the record fusion gate.
Record Embedding Following previous work (Wiseman et al., 2017), we utilize four tuples to represent each record r. The four tuples include: entity r.e (the name of team or player, such as Carmelo Anthony), type r.t (e.g., PTS) and value r.v as well as feature r.f (e.g., home or visiting) which indicates whether a player or a team compete in home court or not. And we utilize 1-layer MLP to encode the embeddings of each record's four types of information into a dense vector r emb i,j , where i, j denote a record in the table of i-th row and j-th column, [; ] denotes the vector concatenation, W e and b e are trainable parameters.
Column-wise Encoder To capture the numeric size relation between records, we adopt a selfattention module to model record in the context of other records in the same column and obtain the column dimension representation vector r col i,j as: where W col 1 , W col 2 and W col 3 are trainable parameters, R represents the number of rows in the table.
Row-wise Encoder Considering the size relation captured by the Column-wise Encoder (CE) may help the learning of importance relation on row level, we have the Column-wise Encoder and the Row-wise Encoder (RE) in series (as shown in Figure 2). In other words, the input of RE is r col i,j rather than r emb i,j . We use another self-attention module, similar to the CE, to obtain the row dimension representation r row i,j for records. Record Fusion Gate The record representations from different dimensions contribute differently in reflecting the record's information. Therefore, we utilize a fusion gate to combine the two dimension representations adaptively (Gong et al., 2019). First, we concatenate the two dimension representations of a record and utilize an MLP to obtain a general representation for it as r gen i,j . Then, we compare the column dimension representation with r gen i,j to obtain its important score: where W f 1 and W f 2 are trainable parameters. Equally, we obtain the important score s row i,j for the row dimension representation r row i,j . Finally, we obtain the fused record representation will be used as the input of the text decoder.

Reasoning Module
As mentioned in Section 1, we observe some factual results in text that require reasoning from the entities' performance and the relationships between them. Therefore, it is necessary to enable model the reasoning ability. To achieve this, we primarily build an entity graph according to the entities' relationships in input tables, as shown in Figure  1 (c). And then, we leverage Graph Neural Networks (GNN) to perform reasoning. Following, we describe the details of the reasoning process.
Primarily, we obtain the initialized representation for each entity in tables by the Entity Node Initialization module (ENI). Considering that different records in the same row may not contribute the same, we combine them dynamically by attention mechanism. We first compute a general representation vector e gen i for the entity e i , which is given by mean-pooling over the same row records Then we compare each record in the i-th row with e gen i and obtain the initialized entity representation e 0 i by weighted sum: After obtaining the initial representations of entities, we adopt graph neural networks to propagate entity node information to their neighbors. Inspired by GAT (Velickovic et al., 2018), we use multi-head attention to measure the relatedness between target entity node e i and its neighbor nodes at layer l: where j ∈ N i and N i means the neighbor nodes set of target entity e i . The neighbor entities include information that is not relevant to the target entity. Therefore, we modify the way the information flow in GAT. Explicitly, we incorporate gate mechanisms into information aggregation to filter out noises from neighbor nodes and extract useful information, which we name GatedGAT. The representation e l i of e i at layer l is calculated as follows: where W l is a learnable parameter. The entities' representations {e L i } R i=1 at the last layer L are employed in text decoder.

Decoder with Dual Attention
To make use of record-level and row-level semantics information, we adopt the dual attention mechanism. Specifically, at decoding step t, the input of the LSTM unit is the embedding of the previously predicted word y t−1 . And given the decoder state d t , we first calculate the row-level attention β t,i , which is based on the similarity between the decoder state d t and the entities' representa- Then we compute the record-level attention α t,i over all the record representations {r f i,j } R,C i,j which are normalized among records in the same row. Finally, we fuse these two-level attention and obtain the context representation as: Given a reference output {y i } T i=1 , we use the cross-entropy loss as the objective function of tableto-text generation:

Auxiliary Supervision Task
Liu et al. (2019) have shown that a single encoder without any auxiliary assistant may not be effective to capture the accurate semantic representation. Inspired by this, we propose two auxiliary tasks, Number Ranking (NR) and Importance Ranking (IR), to help the Column-wise Encoder and the Row-wise Encoder capture the size relation and the relative importance relation among records respectively.
Number Ranking In practice, many tables mainly comprise numeric records. Different from text-type content, the numerical content contains less semantic information but the size relation. The size relation means the value of a record is larger or smaller than others, and it plays an essential role in records selection. For example, humans tend to focus on the highest scores or the fewest faults in a basketball game table. Therefore, it is necessary to incorporate size relation into record representation. To achieve this, we propose an auxiliary supervision task named Number Ranking (NR) to supervise the learning of the Column-wise Encoder. As shown in Figure 2 top, we take a list of records in column PTS to illustrate how it works. Specifically, we regard the PTS column of the table as an out-of-order set of records C = r 1 , r 2 , ..., r R , and the goal is to generate a sequence of record pointers in descending order according to their value. We adopt the Pointer Networks (Vinyals et al., 2015) to solve this problem and the output of Columnwise Encoder r col i (we omitted the indices on the column dimension) as its input. Let z = z 1 , ..., z R denote the sequence of the ranked records' indices. Each z k points to an input record and is between 1 and R. As shown in Figure 2, we use an LSTM as the decoder. The M eanP ooling({r i } R i=1 ) is used as the initialization of the first hidden state of the decoder. At each decoding step t, we calculate a distribution over the input records: where W nr is a trainable parameter, and p n t,i denotes the probability that the output points to the record r i at step t. We take the cross-entropy loss for this task: Importance Ranking When people describe a player's performance in a basketball game, they tend to focus on his relatively important record and describe these firstly. Consequently, we introduce the Importance Ranking task (IR) to supervise the Row-wise Encoder to capture the relative importance relations between records in the same row. This task's input is a sequence record in the same row, and the output is a sequence of records in descending order of the records' importance. We employ a pointer network similar to the one used in the Number Ranking task to model this task. However, different from the records in the same column, these in the same row cannot be directly compared as they represent different meanings. To address this issue, we take the rank of each record in the column as an importance indicator. Figure  2 left bottom shows an example of calculating the importance scores for records in the last row of the table.
The input of the decoder is the output of the Row-wise Encoder {r row j } R j=1 . And the output is the ascending order of the input, according to the records' importance scores. Let p s t,j denote the probability of pointing to record r j at decoding step t, the loss function for this task is:

Loss Function and Training
These two tasks are trained together with the tableto-text task, and the overall objective function consists of three parts: where λ 1 and λ 2 are tunable hyper-parameters.

Dataset and Evaluation Metrics
We conduct experiments on both ROTOWIRE and RW-FG datasets. They all comprise pairs of NBA basketball game statistics and summaries. There are two main differences between ROTOWIRE and RW-FG. The first is the team statistic Following previous works, we use BLEU and three extractive evaluation metrics, Relation Generation (RG), Content Selection (CS), and Content Ordering (CO) (Wiseman et al., 2017) to evaluate the table-to-text results. More specifically, RG measures the content fidelity of generated text, CS measures how well the generated text matches the reference in selecting which records to generate, and CO measures the ability on context planning. We refer the readers to Wiseman et al. (2017)'s paper for more detailed information on these extractive metrics.
We apply Accuracy (Acc) and normalized Damerau Levenshtein Distance (DLD) (Brill and Moore, 2000) to evaluate the two auxiliary supervision tasks. Accuracy measures the percentage of record sequences for which their absolute positions are correctly predicted (Logeswaran et al., 2018).

Implementation Details
To make a fair comparison, we follow the configurations in (Puduppully et al., 2019a;Gong et al., 2019). For the table-to-text model, we set word embedding and LSTM decoder hidden size as 600. We set GatedGat's layer as 2 and the numbers of heads as 2. We employ a two-layer LSTM decoder with Input feeding during text generation.
We apply dropout at a rate 0.3. For text decoding, we use BPTT and set the truncate size to 100. We set the beam size to 5 during inference. For the two auxiliary tasks, we employ two one-layer LSTM as the decoder and set the LSTM decoder hidden size as 600, respectively. We adjust λ 1 between 0.8 and 1.0, λ 2 between 0.2-0.4. Finally, we set them to 0.9 and 0.25 on ROTOWIRE, 1.0 and 0.4 on RW-FG. For inferring, we use the greedy search algorithm. All experiments are conducted on an NVIDIA Tesla V100. Code of our model can be found at https://github.com/liang8qi/ Data2TextWithAuxiliarySupervision.

Baselines
We compare our method with several strong baselines, including: • TEMP (Wiseman et al., 2017) is a templatebased method. We refer the readers to this paper for more detailed information on templates. • CC (Wiseman et al., 2017) is a standard encoder-decoder system with conditional copy mechanism. • NCP (Puduppully et al., 2019a) and NCP + TR (Wang, 2019) (Gong et al., 2020): the DU brings the sense of value comparison into content planning. Furthermore, DUV introduces content plan verification into DU.

Main Results
Automatic Evaluation Our results on the two test datasets are summarized in Table 1. For RO-TOWIRE, compared with previous neural models, our method achieves state-of-the-art results on Content Selection (CS), Content Ordering (CO), and BLEU. More specifically, compared with the previous best neural models, we obtain more than 4 improvement on CS-P and achieve the best results on CS-R. This implies our method can generate text that contains more salient records. Compared with NCP, DU, and DUV, our method scores the highest on CO, even without explicitly modeling content selection and planning. This indicates that our model can better organize the records when generating a summary for the input tables. We consider there are two main reasons. The first is that our Reasoning Module can learn a better entity representation on row level. The other is that our proposed two auxiliary tasks can supervise the Record Encoder to learn a number-aware and relative importance-aware record representation. As a result, the data-to-text model can make good con-  tent planning by considering the entity's performance and the relative importance of the record. As shown in Table 1, the results on RW-FG follow a pattern similar to ROTOWIRE. We notice that all models perform better on RW-FG than on ROTOWIRE. We consider that the improvement comes from the purification of data in RW-FG. Wang (2019) removes the sentences that are not supported by the input tables, which reduces the noise in the text and improves the dataset's quality. Due to this, we can obtain more accurate content planning labels from the dataset to train the models (NCP, NCP+TR) that explicitly model content planning and lead to better performance. Therefore, NCP outperforms ENT on RW-FG. However, the purification may make the task easier because some sentences that do not be supported by the tables directly but can be obtained by reasoning may also be removed. This may weaken the Reasoning Module of our model. Nevertheless, we still outperform the compared baselines. Table 2 shows our model's performance, which is trained together with the two auxiliary tasks on the two auxiliary tasks. We compare it with two baselines. The first is Original, which denotes a method that takes the input record sequence as the outputs. Moreover, we separately train our model on the two auxiliary tasks, denoted as Separate. As a result, our model achieves comparable performance to Separate and is much better than Original, even only using the greedy search at testing. The results indicate that the two auxiliary tasks can help the Record Encoder capture the size relation and relative importance relation among records.
Ablation Study First, we examine the effect of changes in the model structure on the results. From Table 3, Our Model means our data-to-text model without two auxiliary tasks. We change the connection mode between the Column-wise Encoder  (CE) and the Row-wise Encoder (RE) to parallel from series (-Series). Moreover, we replace the Reasoning Module with a row-level encoder with the content selection gate (-RM), which is proposed by Puduppully et al. (2019a). According to the results, the serial connection and the Reasoning Module contribute to the overall performance because BLEU, CS, and CO drop significantly after subtracting them from the full model.
Furthermore, we investigate the impact of the two auxiliary tasks on table-to-text generation. Table 3 shows that both Number Ranking (NR) and Importance Ranking (IR) tasks can improve our basic model. This indicates that it is necessary to explicitly model the size relation and relative importance relation between records. We notice that the model's performance is degraded on CS-F1 and CO when only the IR task is introduced. On the one hand, we believe this is because the modeling of relative importance relation in the row dimension between records depends heavily on its size relation in the column dimension. On the other hand, the CE cannot accurately capture the size relation between records without direct supervision.
Finally, we compare the method that introduces additional feature vectors of the ranking of number and relative importance to Record Embedding with the two auxiliary tasks. Specifically, we first introduce the embedding of ranking of the number (+ NE) and further add the embedding of the relative importance of records (+ IE). As shown in the third section in Table 3, the NE only improves the model on RG. Moreover when the IE is incorporated, the model achieves better performance on almost all metrics. However, the improvement is not as significant as the auxiliary tasks. We believe it may be a better way to effectively capture the accurate semantic representation by introducing auxiliary supervision tasks than adding feature vectors directly.  Human Evaluation To examine whether human judgments corroborate improvements in automatic evaluation metrics, we conducted a human evaluation. Three graduate students with basketball background knowledge and good English reading ability were invited to conduct the evaluation. We compared our best performing model against Gold, NCP, ENT, and HETD. Specifically, we randomly selected 30 games from the test set, and each game is rated by three workers. For each game, we arranged every 5-tuple of summaries into ten pairs. Given each pair, the participants were asked to choose which one is better according to five criteria: Supporting (does the summary contain more supported facts?), Contradicting (does the summary contain more contradicting facts?), Grammaticality (is the summary fluent and grammatical?), Coherence (do the sentences, in summary, follow a coherent discourse?), and Conciseness (does the summary contain less redundant information and repetitions?). Following previous work (Puduppully et al., 2019a), we calculated a model's score for each criterion as the difference between the per-centage of times when the model is chosen as the best and the percentage of times when the model is chosen as the worst. The results are summarized in Table 5. As can be seen, the gold texts have significant advantages in contradicting, grammaticality, coherence, and conciseness. Compared with other neural methods, our method receives the highest scores in coherence and grammaticality. This implies that our method can generate texts that contain well-organized facts. Though the ENT model outperforms our model in contradicting and conciseness, our method can be easily applied to it, which we leave for future work.

Conclusion
In this work, we mainly make two contributions. The first one is we introduce a reasoning module into a hierarchical table encoder, which enables the model reasoning ability. Moreover, we present to utilize the different auxiliary supervision tasks to help the encoder capture the different relations between records. In detail, the Number Ranking (NR) task is proposed to supervise the Column-wise Encoder to model the numeric size relation between records in the same column. And the Importance Ranking (IR) task helps the Row-wise Encoder capture the relative importance between records in the same row. Experimental results conducted on ROTOWIRE and RW-FG datasets demonstrate the effectiveness of our method. Furthermore, we migrate our method to the NCP model and significantly improve its performance on ROWTOWIRE. This indicates that our proposed method has a good generalization.

B Impact of different Ranking Directions
We also explore the impact of different settings for Number Ranking and Importance Ranking on the data-to-text model. The results are summarized in Table 7. We observe that compared with the basic model, almost all the settings can improve the data-to-text model on Content Selection(CS), Content Ordering(CO), and BLEU. This indicates the proposed two tasks are effective and robust.