How Helpful is Inverse Reinforcement Learning for Table-to-Text Generation?

Existing approaches for the Table-to-Text task suffer from issues such as missing information, hallucination and repetition. Many approaches to this problem use Reinforcement Learning (RL), which maximizes a single manually defined reward, such as BLEU. In this work, we instead pose the Table-to-Text task as Inverse Reinforcement Learning (IRL) problem. We explore using multiple interpretable unsupervised reward components that are combined linearly to form a composite reward function. The composite reward function and the description generator are learned jointly. We find that IRL outperforms strong RL baselines marginally. We further study the generalization of learned IRL rewards in scenarios involving domain adaptation. Our experiments reveal significant challenges in using IRL for this task.


Introduction
to-Text generation focuses on explaining tabular data in natural language. This is increasingly relevant due to the vast amounts of tabular data created in domains including e-commerce, healthcare and industry (for example, infoboxes in Wikipedia, tabular product descriptions in online shopping sites, etc.). Table-to-Text can make data easily accessible to non-experts and can automate certain pipelines like auto-generation of product descriptions. Traditional methods approached the general problem of converting structured data to text using slot-filling techniques (Kukich, 1983;Reiter and Dale, 2000;McKeown, 1992;Cawsey et al., 1997;Konstas and Lapata, 2013;Flanigan et al., 2016). While recent advances in data-to-text generation using neural networks (Sutskever et al., 2011;Mei et al., 2015;Gardent et al., 2017;Wiseman et al., 2017;Song et al., 2018;Zhao et al., 2020) have * Authors contributed equally. led to improved fluency, current systems still suffer from issues such as lack of coverage (where the generated text misses information present in the source), repetition (where the generated text repeats information) and hallucination (where the generated text asserts information not present in the source) (Lee et al., 2019). A significant reason for these issues is that models often lack explicit inductive biases to avoid these problems. Most extant approaches utilize Reinforcement Learningbased (RL) training, using a single reward (such as BLEU or task-specific rewards) that optimizes for a specific aspect. For example, Liu et al. (2019) and Nishino et al. (2020) use domain-specific rewards to improve the accuracy of medical report generation.
However, defining a single reward that addresses all of the above-described issues is difficult. To use multiple reward components with RL, one has to manually find an optimal set of weights of each component either through a trial-and-error approach or expensive grid search which gets infeasible as the number of such reward components increases. Inverse Reinforcement Learning (Abbeel and Ng, 2004;Ratliff et al., 2006;Ziebart et al., 2008) can be a natural approach for this task since it can learn an underlying composite reward function from labeled examples incorporating multiple rewards. Motivated by existing applications of IRL in other domains and tasks (Finn et al., 2016;Fu et al., 2017;Shi et al., 2018), we explore its utility for Table-to-Text generation. We diverge from previous work on IRL in designing a set of intuitive and interpretable reward components that are linearly combined to get the reward function. Figure  1 illustrates the overall idea of this work. We learn a "Description Generator" (also referred as policy later) to generate descriptions given a table. The IRL framework includes "Reward Approximator" that leverages the "expert" or the ground-truth de-   Table-to-Text task under Inverse Reinforcement Learning framework using multiple reward components scriptions corresponding to tables to jointly learn the underlying composite reward function combining multiple reward components such as "Recall", "Fluency", etc. This composite reward function quantifies the quality of the generated descriptions. We see IRL performs at par with RL baselines. For investigating when IRL helps and when it does not, we conduct experiments to evaluate generalization capabilities of IRL in limited data setting and identify challenges involved in IRL training. Our contributions are: • We formulate a set of interpretable reward components and learn the composite linear reward function in a data-driven manner for Table-to-Text generation 1 . • We study the utility of IRL for Table-to-Text generation.

Method
The training data for this task consists of pairs of tables and corresponding natural language descriptions, as shown in Figure 1. A table T is a sequence of tuples of slot types (e.g. "Name") and slot values (e.g. "Asgar") and let D denote the expert description. We formulate the "Table-to-Text" task as generating D from source table T . In the rest of this section, we first explain how to formulate Table-to-Text under the IRL framework, followed by the formulation of the reward components and a brief description of the text generation network 1 Code and dataset splits for the paper are provided in https://github.com/issacqzh/IRL_ Table2Text that is at the core of our method.

Table-to-Text as IRL
We pose Table-to-Text under the IRL framework where we aim to jointly learn a policy for generating description from the table and the underlying composite reward function. At the core of our approach, we have a neural description generator that we adapt from . The description generator is first trained using maximum likelihood estimation (MLE) followed by fine-tuning it using Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL) (Ziebart et al., 2008). Under the MaxEnt IRL framework, we iteratively perform two steps: (1) approximate the underlying composite reward function by leveraging the expert descriptions and the current policy for description generation; (2) Using the updated reward function, we update the current policy for description generation using RL. In this work, we model the composite reward R φ (D) as a linear combination of multiple reward components.
where φ is a weight vector, C t is the vector of reward component values at step t in a generated description and τ denotes total steps. Following the MaxEnt IRL paradigm, we assume the expert descriptions come from a loglinear distribution (p φ (D)) on reward values. The objective of the reward approximator (J r (φ)) is to maximize the likelihood of the expert descriptions. The partition function for this distribution (p φ (D)) is approximated by using importance sampling from the learned description generation policy. For sake of brevity, we skip the mathematical derivation here. Please refer to Appendix A.1 for detailed derivation. We draw N expert descriptions and M descriptions from the learned policy. The gradient of the objective (J r (φ)) w.r.t. reward function parameters φ is then the difference between the expected expert reward and expected reward obtained by the policy (Ziebart et al., 2008): where D i and D j are drawn from the training data and the learned policy respectively and β's are importance sampling weights.
The linear functional form of the reward simplifies individual weight updates as a simple difference of the expected expert and the expected roll out reward component from policy. Weight update for component c is: where c i is total value of reward component over all steps for i th expert description and c j is total value of reward component over all steps for j th generated description. To stabilize training when learning the policy for description generation we mix in weighted MLE loss with the policy gradient loss before backpropagation. Please refer to supplementary material (Appendix A.5) for model training details.

Reward Components
We aim to find a reward function that can combine multiple characteristics present in a good description such as faithfulness to the table and fluency.
To encourage faithfulness, we use recall and reconstruction as reward components, while to characterize grammatical correctness and fluency we use repetition and perplexity. We also consider BLEU score as a reward component. BLEU is a supervised reward component as it requires ground-truth descriptions for its computation. However, all other reward components are unsupervised.
• Recall: Fraction of slot values in the table mentioned in the description. • Reconstruction: We use QA models to extract answers from the description against a few "extractor" slot types (for example, "What is the name of the person in the description?" is used as a question for the slot type "Name ID"

Experiments and Results
In this section we describe our experiments and their results in detail.

Data and Metrics
Wang et al. (2018) proposed a dataset of tables and their corresponding descriptions related to people and animals from Wikipedia. However, the original released dataset is noisy (many descriptions have low precision/recall, most examples have very few distinct slot types, etc.). For our experiments, we filtered this dataset to get a smaller high-quality dataset of 4623 examples using the following criteria : (1) Recall (defined in §2.2) of 1.0 (2) High precision (fraction of entities in the description mentioned in the table) greater than 0.7 (3) number of distinct slot types greater than 6. We split the entire dataset as 80%, 10% and 10% for training, validation and testing respectively. Details of the dataset are provided in Appendix A.2. To aid reproducibility we make the data splits used by us publicly available 2 .
For evaluation, we report BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) along with their harmonic mean (called F1 hereon). Additionally, we report the mean reward value for Recall and Perplexity as proxies for faithfulness and fluency of generated descriptions respectively.  Table 1 shows the performance of models trained using maximum likelihood estimation (MLE), RL and IRL. For RL and IRL we report results with various sets of reward components. When using multiple reward components with RL we consider the total reward as the uniformly weighted sum of each component. We note that while IRL variants achieve higher performance than RL methods for all metrics, the gain in performance is marginal.

Automatic Evaluation
In Table 1 we choose the best model for each setting based on the performance on the validation split. For the best IRL (All) model, we find the learned weights for repetition, recall, BLEU, reconstruction and perplexity are 0.02, 0.12, 0.65, 0.05 and 0.15 respectively. However, we noticed that the weights of the IRL reward components failed to converge in our training runs. This is a consequence of the fact that reward components such as BLEU achieve their maximum value for the ground-truth description, and the value quickly drops as descriptions diverge from the ground truth description. Thus the gap between the expert value and the value achieved by the model for BLEU is always large, hindering the convergence of weights in IRL (Eqn 3). This results in a peaked distribution of weights where the model tends to favor the BLEU reward component excessively. We attempt scaling down the expert BLEU reward values by using multiplier. We dynamically update the multiplier using an adaptive binary search method (refer to Appendix A.4 for details) to induce convergence in weights. We observe that the multiplier acts as a "regularizer" in learning a more balanced weight for the reward components considered. For example, when we train IRL with BLEU, recall and perplexity without using multiplier, the learned weights of the components are 0.72, 0.15 and 0.13 respectively. On using multipliers for IRL training, the learned weights for BLEU, recall and perplexity are 0.45, 0.31 and 0.24 respectively. The model variants using multipliers get the best F1 score as seen in the second last row of Table 1.
We also find that having more reward components does not help IRL improve significantly. We note that IRL using all reward components gets the best BLEU but suffers a marginal drop in ROUGE.

Domain adaptation
To evaluate if rewards learned using IRL generalize better to unseen data distributions, we evaluate it for scenarios involving domain adaptation. For this, we divide the dataset into disjoint subsets of categories involving people in sports, academia, art, etc.
(category details in Appendix A.2). Each category has different table schemas. We train RL and IRL models on one category and test them on a different category. Since training on a single category limits the amount of labelled data, we consider training with unsupervised rewards that do not rely on the ground truth. Table 2 shows the F1 results when using IRL and RL with recall, perplexity and reconstruction. For each training category, we show results of the test category with the highest absolute value of relative change in F1. We notice mixed results. For instance, when training on the "Sports" domain, IRL's performance is much worse than RL. This may be because slot types with high frequency in the "Sports" category are significantly different from all other categories. Thus, IRL may be susceptible to learning a reward function that overfits the domain and actually generalizes worse than a fixed reward function. However, in several cases IRL leads to big improvements in performance (e.g. when training on Politics, Law, and Military) indicating the promise of this method in limited data settings.

Discussion
We highlight some challenges with IRL training that potentially hinder IRL to get significantly better than RL baselines. Further, we discuss qualitative differences between RL and IRL models.

Challenges in IRL training
Importance of reward components: During training, for most reward components, their values for expert and generated descriptions are close. However, the values of BLEU for generated descriptions are quite smaller than the BLEU value for expert descriptions. This shadows the contribution of other reward components irrespective of the weights assigned to them. Since BLEU optimizes for n-gram overlap with the expert text, it is undesirable to drop this component as it leads to text degeneration. As described in Section 3.2, we use adaptive multipliers to alleviate its dominance. However, its effect is limited and the method does not correspond to optimizing a fixed objective. Unstable training: To stabilize training, we mix the weighted MLE loss (cross-entropy loss) and the policy gradient objective. However, these losses can differ largely in scale. Having a larger weight to MLE loss diminishes the contribution of reward components, while larger weight to policy gradient leads to degeneration. These observations indicate the need for future research on training paradigms and better-designed reward components to address these challenges.

Qualitative analysis
Using only BLEU as a reward leads to generated descriptions that fit a general template resembling descriptions from the most common category ("Sports"). Including other reward components helps the model avoid this behavior. We still observe hallucination from both IRL and RL fine-tuned models. However, hallucinated information generated from IRL fine-tuned models often matches the overall theme (for example, it generates incorrect football league names but gets the name of the club mentioned in the table correct). Appendix A.7 shows an example of description generated by IRL (All) model.

Conclusion
We present an approach using IRL for Table-to-Text generation using a set of interpretable reward components. While the approach outperforms RL, improvements are marginal, and we identify several challenges. In particular, using metrics like BLEU as reward components is problematic, since they affect weight convergence for IRL. Based on our study, the application of IRL for Table-to-Text generation would broadly benefit from designing better-calibrated reward components and improvements in training paradigms. We hope our exploration encourages the community to engage in interesting directions of future work. of a description is sum of rewards at each step. Let q θ (D) be the policy for description generation. We maximise the log-likelihood of the samples in the training set (Equation 5).

References
The gradient w.r.t. reward parameters is given by The partition function requires enumerating all possible descriptions which makes this intractable. This is tackled by approximating the partition function by sampling descriptions from the policy using importance sampling. The importance weight β i for a generated description D i is given by The gradient is now approximated as: where D i and D j are drawn from training data and q θ (D) respectively.

A.2 Dataset statistics
We split the entire dataset as 80%, 10% and 10% for training, validation and testing respectively. Table  3 shows the statistics for our dataset. Table 4 shows the various disjoint category splits of our data.

A.3 Detailed description of some reward components
• Reconstruction: We use Question Answering models to extract answers from the description corresponding to few slot types. For example, to extract the name from the description we ask a question "What is the name of the person?". The questions corresponding to each slot type is pre-determined. We extract values for four most common slot types occurring in the dataset -"name", "place of birth", "place of death" and "country". We will refer to these slots as "extraction slot types". The questions for these extractor slot types are "What is the name of the person in the description?", "What is the place of birth of the person in the description?", "What is the place of death of the person in the description?" and "Which country does the person in the description belong to?" respectively. All extraction slot types are not present in every table of the dataset (example, "place of death" is not present for a living sportsperson). Following SQUAD-like (Rajpurkar et al., 2018) formalisation, for each slot-type we train a BERT-based (Devlin et al., 2019) model to get the answer from the description given the question. We calculate overlap score of predicted answer with the correct answer (slot value from table). The final reconstruction score is the arithmetic mean of answer overlap scores corresponding to the extractor slot types present in the table. • Perplexity: This is the negative perplexity of the description. We further normalize it by using Perplexity − Perplexity low Perplexity high − Perplexity low (9) where Perplexity high and Perplexity low are the maximum and minimum perplexity of expert texts and texts generated by pretrained MLE model respectively.

A.4 Learning Multiplier for BLEU
Let us assume that after the i th iteration of IRL, we have the multiplier value as m i . Let b be the average BLEU score obtained by the model. For (i + 1) th iteration we update the multiplier value as In case the change in weight is less than 0.00001, we instead increase multiplier value by 0.1. The maximum of multiplier value is 1. We start with initial multiplier value (m 0 ) as 1.

A.5 Training details
Model parameters We follow the same training scheme and model parameters from Wang et al.   Hyper-parameter tuning We adapt the model and optimizer hyper-parameters from . For choosing the weights for cross-entropy loss and policy gradient loss we tried combinations in the set 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 keeping sum of weights as 1. For IRL reward component weight updates we sample 500 descriptions from ground-truth and from the descriptions generated from the policy. We chose the size as 500 based on validation set performance. Based on the performance on the validation set we chose 0.9 as policy gradient loss weight and 0.1 for crossentropy loss. This also helps to bring both the loss terms in same scale.
Software and hardware specifications All the models are coded using Pytorch 1.4.0 3 (Paszke et al., 2019) and related libraries like numpy (Oliphant, 2006), scipy (Virtanen et al., 2020) etc. We run all experiments on GeForce RTX 2080 GPU of size 12 GB. The system has 256 GB RAM and 40 CPU cores.
Time for training and inference It takes around 16 seconds for one epoch of MLE training while it takes close to 150 seconds for an epoch when using RL fine-tuning with all the reward components. The reward component weight approximation stage of IRL is very fast and takes less than a second generally.
A.6 Validation set results Table 5 shows the results on validation set for the models in Table 1 of main paper.  description.