Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning

Recent advancements in data-to-text generation largely take on the form of neural end-to-end systems. Efforts have been dedicated to improving text generation systems by changing the order of training samples in a process known as curriculum learning. Past research on sequence-to-sequence learning showed that curriculum learning helps to improve both the performance and convergence speed. In this work, we delve into the same idea surrounding the training samples consisting of structured data and text pairs, where at each update, the curriculum framework selects training samples based on the model’s competence. Specifically, we experiment with various difficulty metrics and put forward a soft edit distance metric for ranking training samples. On our benchmarks, it shows faster convergence speed where training time is reduced by 38.7% and performance is boosted by 4.84 BLEU.


Introduction
Neural data-to-text generation has been the subject of much recent research. The task aims at transforming source-side structured data into targetside natural language descriptions (Reiter and Dale, 2000; Barzilay and Lapata, 2005). The process typically involves mini-matches which are randomly sampled with a fixed size from the training set to feed into the model at each training step. In this paper, we apply curriculum learning to this process, which was explored in neural machine translation (Platanios et al., 2019;Zhou et al., 2020), and show how it can help in neural data-to-text generation.
The main idea in curriculum learning is to present the training data in a specific order, starting from easy examples and moving on to more difficult ones, as the learner becomes more competent. When starting out with easier instances, the risk of getting stuck in local optima early on in training is reduced, since the loss functions in neural models are typically highly non-convex (Bengio et al., 2009). This learning paradigm enables flexible batch configurations by considering the material properties as well as the state of the learner. The idea brings in two potential benefits: (1) It speeds up the convergence and reduces the computational cost.
(2) It boosts the model performance, without having to change the model or add data.
With the release of large data-to-text datasets (e.g. Wikibio (Lebret et al., 2016), Totto (Parikh et al., 2020), E2E (Novikova et al., 2017)), neural data-to-text generation is now at a point where training speed and the order of samples may begin to make a real difference. We here show the efficacy of curriculum learning with a general LSTM-based sequence-to-sequence model and define difficulty metrics that can assess the training instances, using a sucessful competence function which estimates the model capability during training. Such metrics have not yet been explored in neural data-to-text generation.
In this paper, we explore the effectiveness of various difficulty metrics and propose a soft edit distance metric, which leads to substantial improvements over other metrics. Crucially, we observe that difficulty metrics that consider data-text samples jointly lead to stronger improvements than metrics that consider text or data samples alone. In summary, this work makes the following contributions towards neural data-to-text generation: 1. We show that by simply changing the order of samples during training, neural models can be improved via the use of curriculum learning.
2. We explore various difficulty metrics at the level of the data, text, and data-text pairs, and propose an effective novel metric.

Related work
The idea of teaching algorithms in a similar manner as humans, incrementally from easy concepts to more difficult ones dates back to incremental learning, which was discussed in light of theories of cognitive development relating to the processes of acquisition in young children (Elman, 1993;Krueger and Dayan, 2009;Plunkett and Marchman, 1993). Bengio et al. (2009) first demonstrated empirically that curriculum learning approaches can decrease training times and improve generalization; later approach address these issues by changing the minibatch sampling strategy to also include model competence (Kocmi and Bojar, 2017;Zhou et al., 2020;Platanios et al., 2019;Liu et al., 2020;Zhang et al., 2018Zhang et al., , 2019 erau, 1964;Brill and Moore, 2000a), which was used as a content ordering metric in Wiseman et al. (2017) to measure the extent of alignment between data slots and text tokens.

Preliminaries of Curriculum Learning
We base our curriculum learning framework on the two standard components: (1) model competence (how capable the current model is at time t), and (2) sample difficulty, which makes independent judgement on each sample's difficulty. Specifically, we adopt the competence function c(t) for a model at time t as in Platanios et al. (2019); Liu et al. (2020): (1) Where λ t is a hyperparameter defining the length of the curriculum and is set to 2.5 as in Liu et al. (2020). about which samples to add to each batch. This decision is determined by comparing the competence score with the difficulty score as shown in Algorithm 1.

Difficulty Metrics
For ease of discussion, we denote sequence to be s, which can be either data or text, or their concatenation. For comparison, the difficulty metrics use the unit tokens as tokenized by SpaCy 1 . We begin with discussion on length and word rarity, which were previously applied by Kocmi and Bojar (2017); Platanios et al. (2019) on text sentences.
Length. Length-based difficulty is based on the intuition that longer sequences are harder to encode, and that early errors may propagate during the decoding process, making longer sentences also harder to generate. It is defined as: Rarity. Word rarity of a sentence is defined as the product of the unigram probabilities (Platanios et al., 2019). This metric implicitly incorporates information about sentence length since longer sentence scores are sum of more terms and are thus likely to be larger. The difficulty metric for word rarity of a sequence s is defined as: Damerau-Levenshtein Distance. To consider data and text jointly, we measure the alignment between data slots and text using the Damerau-Levenshtein Distance (d dld ) (Brill and Moore, 2000a). We calculate the minimum number of edit operations needed to transform data (s d ) into text (s t ) 2 , and relies only four operations: (a) substitute a word in s d to a different word, (b) insert a word into s d , (c) delete a word of s d , and (d) transpose two adjacent words of s d . The process involves recursive calls that compute distance between substrings s i d ∈ s d and s i t ∈ s t at i th comparison.
Soft Data-to-Text Edit Distance. We here present the proposed soft edit distance (SED): (1) We include the basic add and delete edit operations as in the Levenshtein Distance (Levenshtein, 1966), which was used in Levenshtein Transformer (Gu et al., 2019) as the only two necessary operations for decoding sequences since it correlates well with human text writing where humans "can revise, replace, revoke or delete any part of their generated text". We call this variant the plain edit distance (PED).
(2) Next, we weight the indicator function 1(s i d , s i t ) for each edit operation with the negative logarithmic unigram probability − log p(w) for each token w ∈ s i {t|d} , in order to incorporate the idea of word rarity into the edit distance metric. For the delete operation, we use the w ∈ s i d and for add operations, we use w ∈ s i t . This is unlike the previous proposal by Brill and Moore (2000b), in which edits are weighted by the token transition probabilities -this is not suitable for our scenario because there is no natural order of the slot sequence in data samples.
The soft distance metric d sed is in principle similar to calculating the logarithmic sum as defined in the rarity function, but instead incrementally compares all substrings and calculates their edit distances. This way, d sed includes the information on length, rarity but also combining the edit operations. We show this process in Figure 5.
Note that we can compute length and rarity on the concatenation of input data and text sequence, or as individual sequences; whereas Damerau-Levenshtein distance and soft edit distance are computed jointly on data and text.

Experiment Setting
Data. We conduct experiments on the E2E (Novikova et al., 2017) and WebNLG (Colin et al., 2016) datasets. E2E is a crowd-sourced dataset containing 50k instances in the restaurant domain. The inputs are dialogue acts consisting of three to 8 slot-value pairs. WebNLG contains 25k instances describing entities belonging to 15 distinct DBpedia categories, where the data contain are up to 7 RDF triples of the form (subject, relation, object).

Configurations. The LSTM-based model is implemented based on PyTorch (Paszke et al., 2019).
We use 200-dimensional token embeddings and the Adam optimizer with an initial learning rate at 0.0001. Batch size is kept at 28, and we decode with beam search with size 5. The performance scores are averaged over 5 random initialization runs.
Settings. We first perform ablation studies (Table 2) on the impact of difficulty metrics on data, text or both (joint). We also analyse the average bin size for each metric -a metric that gives the same score to many instances creates large bins. This means that the order of samples within the bin will still be random. On the other hand, a metric that assigns a lot of different difficulty scores to the instances can yield a more complete ordering (and a smaller step size in moving from one level of difficulty to the next). We present the change in performance (BLEU) as the training progresses in order to compare the various difficulty metrics on both datasets (See Figure 3).

Results & Analysis
On Table 3, we observe that soft edit distance (SED) yields the best performance, outperforming a model that does not use curriculum learning by as much as 2.42 BLEU. It also outperforms all other metrics by roughly 1 BLEU. In general, we see that models perform better on joint and text than on data. This correlates to how a difficulty function is related to the average bin sizes of scores it generates. We see that for models that distinguish samples in a more defined manner, it will have a smaller average bin size where probability of having more difficult samples at every confidence threshold is lower. From this, we see that length and DLD have larger average bin sizes across its difficulty scores and this makes samples less distinguishable from one another. Thus, they result in the smallest improvement over plain. We show reordered samples in Table 1 for all difficulty metrics computed jointly on data and text. We include length (L), rarity (R), Damerau-Levenshtein Distance (DLD), and the proposed soft edit distance (SED). On the other hand, we also justify the use of weighting for edit operation where PED, which is the "hard" variant of SED that does not weight edit operations like SED, is shown to be far inferior to that of SED. The score margin comes up to 2.81 BLEU. Moreover, we further examine the difference in sample orders and observe that SED yields more intuitive and better sample ordering as opposed to other metrics.
Human Evaluation. For human evaluation, three annotators are instructed to evaluate 100 samples from the joint variant to see (1) if the text is fluent (score 0-5 with 5 being fully fluent), (2) if it misses information contained in the source data and (3) if it includes wrong information. These scores are averaged and presented in Table 2.   On Training Speed. We define speed by the number of updates it takes to reach a performance plateau. On Figure 3, the speedup is measured by the difference between the vertical bars. It can be observed that curriculum learning reduces the training steps to converge, where it consists of up to 38.7% of the total updates for the same model without curriculum learning (on E2E). Further, we see that the use of curriculum learning yields slightly worse performance in the initial training steps, but rise to a higher score and flattens as it converges.

Conclusion
To conclude, we show that the sample order does indeed matter when taking into account model competence during training. Further, we demonstrate that the proposed metrics are effective in speeding up model convergence. Given that curriculum learning can be combined with pretty much any neural architecture, we recommend the use of curriculum learning for data-to-text generation. We believe this work offers insights into the annotation process of data with text labels where reduced number of labels are needed.
"Foundations of Perspicuous Software Systems". We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.