Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMR-to-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple best-practice recommendations for leveraging additional data in AMR-to-text generation.


Introduction
Deep learning has made remarkable progress in many areas of natural language processing, including language generation (Sutskever et al., 2014;Luong et al., 2015) and semantic parsing (Dong and Lapata, 2016). Nevertheless, neural models are usually data-hungry, and sophisticated use of data augmentation can often go a long way (Konstas et al., 2017;Du and Black, 2019;Wei and Zou, 2019). One common method of data augmentation is to leverage large amounts of out-of-domain data for semi-supervised learning. However, without proper examination of the data being used, the external data may contain significant overlap with the test set, leading to unfair gains as a result. This issue is a unique problem for natural language generation (NLG) tasks with data augmentation, because training with data that overlaps with the test set is akin to training on the answers. In this work, we study the task of AMRto-text generation and scrutinize the datasets used for training and evaluation. Our contributions are two-fold: (1) we develop an examination procedure to confirm that there are serious overlaps between one of the AMR datasets and Gigaword (Parker et al., 2011), and conduct experiments showing that some of the performance gains are indeed "unfair"; (2) we propose several strategies to apply when collecting external data for training, and empirically show that these strategies can mitigate the aforementioned unfair gains. For best practice, we suggest future work on AMR-to-text generation exclude Gigaword articles that are written in the nearby months of those covering Proxy to be on the safer side (strategy no-3Months described in Section 5).

Related Work
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) has gained growing interest as a semantic formalism. The first AMR-to-text generator was developed using tree transducers (Flanigan et al., 2016). More recent work heavily adopted neural models, explored different architectures, and commonly employed Gigaword data to boost results (Konstas et al., 2017;Song et al., 2018;Wang et al., 2020). The most common approach is to use JAMR (Flanigan et al., 2014) to bootstrap labels for the additional data and then add them to the training data.
Prior work on AMR generation has used automatic metrics such as BLEU (Papineni et al., 2002) and human evaluations (May and Priyadarshi, 2017). Currently, there is increased research on evaluation metrics for NLG (Zhang et al., Sellam et al., 2020, inter alia). However, we are not aware of prior work investigating the problem of test set overlap when using data-augmentation methods for generation. Closest to our work is prior practice in machine translation evaluation of excluding articles from the same time period as the test set (NIST, 2012).

Origin of AMR and Gigaword Overlap
In this section, we describe the reason for the overlap between the AMR dataset and Gigaword.
In standard LDC releases of AMR, for example LDC2015E86 and LDC2017T10, the dev and test set consist of 5 datasets from different sources. Information about these datasets are listed in Table 1. Each sentence in the dev and test set is associated with an ID. The sentences of the Proxy dataset, in particular, have IDs that can be traced back to Gigaword articles. Upon inspection, these sentences appear to originate as close edits of sentences in Gigaword. For example, the sentence with ID "PROXY LTW ENG 20070831 0072.1" is originated from the Gigaword article with ID "LTW ENG 20070831". The date on which a Gigaword news article was written is included in the ID. Since Proxy takes up more than half of the test sentences, such overlap could have a high impact on the evaluation of AMR-to-text generators. In the next section, we describe our procedure to empirically examine the effect of overlap between Proxy and Gigaword.

Measuring Overlap
We use the following procedure to quantitatively examine the overlap between Proxy and Gigaword dataset. For each Proxy sentence in the validation and test split, we find the Gigaword article whose ID is associated with the Proxy sentence ID. Then we tokenize and split the article into sentences. We measure the overlap between the Proxy sentence  and each of the Gigaword sentences with 3 different metrics: (1) absolute count of common words, which is the number of distinct words that appear in both sentences, (2) BLEU score, and (3) ROUGE-L score.

Exclusion Strategies
We propose and investigate 3 sampling strategies for constructing semi-supervised training datasets from Gigaword, and these strategies differ by how to exclude certain Gigaword articles: no-ID excludes articles whose id appeared in the proxy dataset; no-Month excludes articles that are written in the same month as those excluded by no-ID ; no-3Months excludes articles that are written in the same month or neighboring months from those excluded by no-ID . We use reservoir sampling (Vitter, 1985) to sample sentences from Gigaword. We first collect a set with 200k sentences without any exclusion as a baseline. We then filter out sentences that are from articles excluded by no-ID , and sample same number of sentences as those being filtered from articles that are included by no-ID . This yields a set of 200k sentences representing no-ID . We collect the sample sets for no-Month and no-3Months based on the baseline set in a similar fashion. 2 We use the GGNN-dual-encoder model by (Ribeiro et al., 2019) as our model to study the effects of different exclusion strategies. For each exclusion strategy, we obtain 3 different samples using different random seeds and repeat experiments. We keep most of the hyperparameters from the original paper. We adjusted the learning rate schedule to accommodate larger sets of training Sentence Score

Count 1st
At least one of those bands appears to be splitting into at least two different groups. 13

Count 2nd
Even though the Bush White House has generally entrusted government agencies to officials ... 7

Count 3rd
The rentals violated U-Haul's rule requiring the tow vehicle to be at least 750 pounds heavier than the one being towed. 7

Bleu 1st
At least one of those bands appears to be splitting into at least two different groups. 0.70

Bleu 2nd
At least one of those inspections would have come at a particularly delicate time ... 0.20 Bleu 3rd ... as well as other outside organizations, at least one of which then sold tickets to its own members. 0.19

Rouge 1st
At least one of those bands appears to be splitting into at least two different groups. 0.91 Rouge 2nd For at least a few of those percentage points, we have to thank Sheehan. 0.44

Rouge 3rd
At least one Democratic member of the group questioned Giuliani's decision to quit. 0.4

Overlap between Proxy and Gigaword
In this section, we measure the overlap between Proxy and Gigaword using word and n-gram overlap evaluation measures, and study the effect of the overlap on the final trained system. We list the mean and median of the 3 sentences with highest overlap scores for each overlap measure in Table 2. It is clear that sentences with the top overlap score overlap significantly more than those sentences at the 2nd and 3rd place. Examples for illustration are given in Table 3. All three metrics tend to find the same top matching sentence. Most of the time, the test sentence in Proxy is an extractive summarization or rephrase of the top match in Gigaword, indicating a concerning overlap between Proxy and Gigaword.
To investigate the impact of semi-supervised training with these Gigaword sentences that are close duplicates of the test set, we create various sets for semi-supervised training. We create a cheat set using sentences with highest matching ROUGE scores, called Top 1 (Cheat). We are also interested in the impact of sentences from the same article as these duplicates, but with less overlap. We create additional sets with those that have top 2-4 overlap scores, top 5-7 overlap scores, etc. We trained the model with these sample sets for semi-supervised training, and the results on LDC2017T10 are listed in Table 4. The cheat set helped the evaluation on Proxy by more than 7 points, but only helped other datasets by about 1 point, if not hurting. As the matching scores decrease, the improvement on Proxy also went down. This indicates that the overlap sentences between Proxy and Gigaword give a significant unfair advantage, especially for the sentences with highest overlap.

Exclusion Strategies for Gigaword
To find a good exclusion strategy for constructing semi-supervised datasets from Gigaword, we sample semi-supervised training sets as described in Section 5 and ran experiments. The results on LDC2015E86 and LDC2017T10 are presented in Table 5 and 6, respectively. The results on LDC2017T10 is generally better than LDC2015E86, since the size of training of the former is larger than that of the later. Without exclud-  ing (i.e. baseline strategy), the results on Proxy are significantly better than no additional semisupervised data (by about 8 points on LDC2017T10 and 10 points on LDC2015E86). It is also slightly better than being trained with the cheat set. This is because training on sample sets of size 200k yields much better language model than the small cheat set. On the other hand, training on the cheat set is almost as good as training on 200k additional data, since neural models are good at memorization. For LDC2017T10, filtering out articles covering Proxy test sentences decreases performance on Proxy by 1 point; excluding articles written in the same month and nearby months further decreases results on Proxy by more than 0.5 points. For LDC2015E86, excluding articles written in the same month decreases results on proxy by more than 1 point. Finally, we perform statistical tests with a paired t-test for comparing performance of systems trained on different sample sets against the baseline (no filtering). See Table 7. For LDC2015E86, no-Month resulted in lower BLEU scores on proxy dataset that are statistically significant; for LDC2017T10, both no-Month and no-3Months resulted in lower BLEU scores on proxy and the differences are statistically significant. All strategies performed similarly on all other datasets. This shows that the exclusion of certain overlapping articles in Gigaword has significant impact on the evaluation on Proxy dataset, but less so on the rest.

Conclusion and Recommendation
In this paper, we examined Gigaword, the commonly used dataset for improving AMR-to-text generation, and found sentences that almost duplicate the test set of Proxy, one of the AMR datasets. We developed a procedure that utilizes a word overlap measure to find overlapping sentences, and found several metrics that may be good at finding duplicating sentences. We proposed 3 different strategies for excluding overlapping data from Gigaword, and validated the idea that without filtering certain articles, the evaluation results may be unfair. For best practice, we suggest future work on AMRto-text generation exclude Gigaword articles that are written in the nearby months of those covering Proxy to be on the safer side (no-3Months). Additionally, we suggest future work report results on each AMR dataset separately so that techniques favoring one dataset can be detected.