SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a. the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPPHIRE noticeably improves model performance. An in-depth qualitative analysis illustrates that SAPPHIRE effectively addresses many issues of the baseline model generations, including lack of commonsense, insufficient specificity, and poor fluency.


Introduction
There has been increasing interest in constrained text generation tasks which involve constructing natural language outputs under certain preconditions, such as particular words that must appear in the output sentences. A related area of work is data-to-text natural language generation (NLG), which requires generating natural language descriptions of structured or semi-structured data inputs. Many constrained text generation and NLG tasks share commonalities, one of which is their task formulation: a set of inputs must be converted into natural language sentences. This set of inputs can be, in many cases, thought of as concepts, e.g. higher-level words or structures that play an important role in the generated text.
With the increased popularity of Transformerbased models and their application to many NLP tasks, performance on many text generation tasks has improved considerably. Much progress in recent years has been from the investigation of model improvements, such as larger and more effectively pretrained language generation models. However, are there simple and effective approaches to improving performance on these tasks that can come from the data itself? Further, can we potentially use the outputs of these models themselves to further improve their task performance -a "selfintrospection" of sorts?
In this paper, we show that the answer is yes. We propose a suite of simple but effective improvements for concept-to-text generation called SAP-PHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. Specifically, SAP-PHIRE is composed of two major approaches: 1) the augmentation of input concept sets ( §4.1), 2) the recombination of phrases extracted from baseline generations into more fluent and logical text ( §4.2). These are mainly model-agnostic improvements that rely on the data itself and the model's own initial generations, respectively. 1 We focus on generative commonsense reasoning, or CommonGen (Lin et al., 2020), which involves generating logical sentences describing an everyday scenario from a set of concepts, which in this case are individual words that must be represented in the output text in some form. CommonGen is a challenging instance of constrained text generation that assesses 1) relational reasoning abilities using commonsense knowledge, and 2) compositional generalization capabilities to piece together concept combinations. Further, CommonGen's task formulation and evaluation methodology are quite broadly applicable and encompassing, making it a good benchmark for general constrained text generation capability. Further, this is an opportune moment to investigate this task as commonsense ability of NLP models, particularly for generation, has received increasing community attention through works like COMET (Bosselut et al., 2019).
We perform experiments on varying sizes of two  state-of-the-art Transformer-based language generation models: BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). We first conduct an extensive correlation study ( §3.1) and qualitative analysis ( §3.2) of these models' generations after simply training on CommonGen. We find that performance is positively correlated with concept set size, motivating concept set augmentation. We also find that generations contain issues related to commonsense and fluency which can possibly be addressed through piecing the texts back together in different ways, motivating phrase recombination. Fleshing out our first intuition -we devise two methods to augment concepts from references during training through extracted keywords ( §4.1.1) and attention matrices ( §4.1.2). For the phrase recombination intuition, we propose two realizations based on a new training stage ( §4.2.1) and masked infilling ( §4.2.2). Finally, through comprehensive evaluation ( §6), we show how the SAPPHIRE suite drives up model performance across metrics, besides addressing aforementioned baseline deficiencies on commonsense, specificity, and fluency.

CommonGen Dataset
The CommonGen dataset is split into train, dev, and test splits, covering a total of 35,141 concept sets and 79,051 sentences. The concept sets range from 3 keywords to 5 keywords long. As the original test set is hidden, we split the provided dev set into a new dev and test split for the majority of our experiments while keeping the training split untouched. Note that we also evaluate our SAPPHIRE models on the original test set with help from the Common-Gen authors (see §6.1). We will henceforth refer to these new splits as train CG , dev CG , and test CG , and the original dev and test splits as dev O and test O . The statistics of our new splits compared to the originals can be found in Table 1. We attempt to keep the relative sizes of the new dev and test splits and the distribution of concept set sizes within each split similar to the originals.

Models: T5 and BART
We perform experiments using pretrained language generators, specifically BART and T5 (both base and large versions). BART (Lewis et al., 2020) is a Transformer-based seq2seq model trained as a denoising autoencoder to reconstruct original text from noised text. T5 (Raffel et al., 2020) is another seq2seq Transformer with strong multitask pretraining. We use their HuggingFace codebases. We train two seeded instances of each model on train CG , evaluating their performance on dev O , and comparing our numbers to those reported in Lin et al. (2020) to benchmark our implementations. These essentially serve as the four baseline models for our ensuing experiments. We follow the hyperparameters from Lin et al. (2020), choose the epoch reaching highest ROUGE-2 on the dev split, and use beam search for decoding. 2 From Table 2, we see that our re-implemented models match or exceed the original reported results on most metrics across different models.

Evaluation Metrics
For our experiments, we use a gamut of automatic evaluation metrics. These include those used by Lin et al. (2020), such as BLEU (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016), and Coverage (Cov). Barring Cov, these metrics measure the similarity between generated text and human references. Cov measures the average % of input concepts covered by the generated text. We also introduce BERTScore , which measures token-bytoken BERT (Devlin et al., 2019) embeddings similarity. It also measures the similarity between the generated text and human references, but on a more semantic (rather than surface token) level. When reporting BERTScore, we multiply by 100. For all metrics, higher corresponds to better performance.

Correlation Study
We begin by conducting an analysis of the four baselines implemented and discussed in §2.2, which we refer to henceforth as BART-base-BL, BART-large-BL, T5-base-BL, and T5-large-BL. One aspect we were interested in is whether the number of input concepts affects the quality of generated text. We conduct a comprehensive correlation study of the performance of the four baselines on dev O w.r.t. the number of input concepts.
As seen from Table 3, the majority of the metrics are positively correlated with concept set size across the models. ROUGE-L, CIDEr, and SPICE have small correlations that are mainly statistically insignificant, demonstrating that they are likely uncorrelated with concept set size. Coverage is strongly negatively correlated, showing that there is a higher probability of concepts missing from the generated text as concept set size increases.
There are two major takeaways from this. Firstly, increased concept set size results in greater overall performance. Secondly, models have difficulty with coverage given increased concept set size. This motivates our first set of improvements, which involves augmenting the concept sets with additional words in hopes of 1) increasing performance of the models and 2) improving their coverage, as we hope that training with more input concepts will help models learn to better cover them in the generated text. This is discussed more in §4.1.

Qualitative Analysis
We conduct a qualitative analysis of the baseline model outputs. We observe that several outputs are more like phrases than full coherent sentences, e.g. "body of water on a raft". Some generated texts are also missing important words, e.g. "A listening music and dancing in a dark room" is clearly missing a noun before listening. A large portion of generated texts are quite generic and bland, e.g. "Someone sits and listens to someone talk", while more detailed and specific statements are present in the human references. This can be seen as an instance of the noted "dull response" problem faced by generation models (Du and Black, 2019;Li et al., 2016), where they prefer safe, short, and frequent responses independent of the input.
Another issue is the way sentences are pieced together. Certain phrases in the outputs are either joined in the wrong order or with incorrect connectors, leading to sentences that appear to lack commonsense. For example, "body of water on a raft" is illogical, and the phrases "body of water" and "a raft" are pieced together incorrectly. Example corrections include "body of water carrying a raft" and "a raft on a body of water". The first changes the adverb on joining them to the verb carrying, and the second pieces them together in the opposite order. A similar issue occurs with the {horse, carriage, draw} example in Table 4.
Some major takeaways are that many generations are: 1) phrases rather than full sentences and 2) poorly pieced together and lack fluency and logic compared to human references. This motivates our second set of improvements, which involves recombining extracted phrases from baseline generations into hopefully more fluent and logical sentences. This is discussed more in §4.2.

Concept Set Augmentation
The first set of improvements is concept set augmentation, which involves augmenting the input concept sets. We try augmentation using up to 1 to 5 additional words, and train-time augmentation both with and without test-time augmentation. We observed that test-time augmentation resulted in inconsistent results that were not as effective, and stick with train-time only augmentation. During training, rather than feeding in the original concept sets as inputs, we instead feed in these augmented concept sets which consist of more words. The expected outputs are the same human references. During test-time, we simply feed in the original concept sets (without augmentation) as inputs.

Keyword-based Augmentation
The first type of augmentation we try is keywordbased, or Kw-aug. We augment the train CG concept sets with keywords extracted from the human references using KeyBERT 3 (Grootendorst, 2020). We calculate the average semantic similarity (using cosine similarity of BERT embeddings) of the candidate keywords to the original concept set. At each stage of augmentation, we add the remaining can-    didate with the highest similarity. 4 Some augmentation examples can be found in Table 5. We train our BART and T5 models using these augmented sets, and call the resulting models BART-base-KW, BART-large-KW, T5-base-KW, and T5-large-KW.

Attention-based Augmentation
We also try attention-based augmentation, or Attaug. We augment the train CG concept sets with the words that have been most attended upon in aggregate by the other words in the human references. We pass the reference texts through BERT and return the attention weights at the last layer. At each stage of augmentation, we add the remaining candidate word with the highest attention. Adding the least attended words would not be effective as many are words with little meaning (e.g. simple articles such as "a" and "the"). Some augmentation examples can be found in Table 5. We train our BART and T5 models using these augmented 4 We also tried using the least semantically similar keywords, but results were noticeably worse. sets, and call the resulting models BART-base-Att, BART-large-Att, T5-base-Att, and T5-large-Att.

Phrase Recombination
The second set of improvements is phrase recombination, which involves breaking down sentences into phrases and reconstructing them into new sentences which are hopefully more logical and coherent. For training, we use YAKE (Campos et al., 2018) to break down the train CG human references into phrases of up to 2, 3, and 5 n-grams long, and ensure extracted phrases have as little overlap as possible. During validation and testing, since we assume no access to ground-truth human references, we instead use YAKE to extract keyphrases from our baseline model generations.
We ignore extracted 1-grams as this approach focuses on phrases. We find words from the original concept set which are not covered by our extracted keyphrases and include them to ensure that coverage is maintained. Essentially, we form a new concept set which can also consist of phrases. Some examples can be found in Table 6.

Phrase-to-text (P2T)
To piece the phrases back together, we try phraseto-text (P2T) generation by training BART and T5 to generate full sentences given our new input sets, and call these models BART-base-P2T, BARTlarge-P2T, T5-base-P2T, and T5-large-P2T. During  training, we choose a single random permutation of each training input set (consisting of extracted keyphrases from the human references), with the elements separated by <s>, and the human references as the outputs. This is in order for the models to learn to be order-agnostic, which is important as one desired property of phrase recombination is the ability to combine phrases in different orders, as motivated by the qualitative analysis in §3.2. During inference or test-time, we feed in a single random permutation of each test-time input set, consisting of extracted keyphrases from the corresponding baseline model's outputs.

Mask Infilling (MI)
This method interpolates text between test-time input set elements with no training required. For example, given a test-time input set {c 1 ,c 2 }, we feed in "<mask> c 1 <mask> c 2 <mask>" and "<mask> c 2 <mask> c 1 <mask>" to an MI model to fill the <mask> tokens with text. We use BART-base and BART-large for MI, and call the approaches BART-base-MI and BART-large-MI, respectively. We use BART-base-MI on input sets consisting of extracted keyphrases from BARTbase-BL and T5-base-BL, and BART-large-MI on input sets consisting of extracted keyphrases from BART-large-BL and T5-large-BL. We also try MI on the original concept sets (with no phrases). One difficulty is determining the right input set permutation. Many contain ≥5 elements (meaning ≥5!=120 permutations), making exhaustive MI infeasible. Order of elements for infilling can result in vastly different outputs (see §6.3), as certain orders are more natural. Humans perform their own intuitive reordering of given inputs when writing, and the baselines and other approaches (e.g. Kw-aug, P2T) learn to mainly be order agnostic.
We use perplexity (PPL) from GPT-2 (Radford et al., 2019) to pick the "best" permutations for MI. We feed up to 120 permutations of each input set (with elements separated by spaces) to GPT-2 to extract their PPL, and keep the 10 with lowest PPL per example. This is not a perfect approach, but is likely better than random sampling. For each example, we perform MI on these ten permutations, and select the output with lowest GPT-2 PPL.
We found BART-large-MI outputs contain URLs, news agency names in brackets, etc. Hence, we post-process before output selection and evaluation. BART-base-MI does not do this. One possible explanation is that BART-large may have been pretrained on more social media and news data.

Model Training and Selection
For training Kw-aug, Att-aug, and P2T models, we follow baseline hyperparameters, barring learning rate (LR) which is tuned per-method. We train two seeds per model. See Appendix A for more.
For each model, we choose the epoch corresponding to highest ROUGE-2 on the dev split, and use beam search for decoding. The dev and test splits are different. For Kw-aug and Att-aug models, the splits are simply dev CG and test CG (or test O ), as we do not perform test-time augmentation. For P2T, the splits are dev CG and test CG (or test O ) but with the input sets replaced with new ones that include keyphrases extracted from the corresponding baseline model's outputs.
The number of words to augment for Kw-aug and Att-aug (from 1 to 5) and maximum n-gram length of extracted keyphrases for P2T (2, 3, or 5) are hyperparameters. While we train separate versions of each model corresponding to different values of these, the final chosen model per method and model combination (such as BART-base-KW) is the one corresponding to the hyperparameter value that performs best on the dev split when averaged over both seeds. For MI, which involves no training, we select the variation (MI on the original concept set or new input sets with keyphrases up to 2, 3, or 5 n-grams) per model that performs best on the dev split, and only perform infilling using extracted keyphrases from the first seed baseline generations. These are the selected models we report the test CG and test O results of in §6.

Human Evaluation
We ask annotators to evaluate 48 test CG examples from the human references, baseline outputs, and various method (excluding MI) outputs for BARTlarge and T5-base. We choose these two as they cover both model types and sizes, and exclude MI as it performs noticeably worse on the automatic evaluation (see §6.1). See Appendix §C for more.
The annotators evaluate fluency and commonsense of the texts on 1-5 scales. Fluency, also known as naturalness, is a measure of how humanlike a text is. Commonsense is the plausibility of the events described. We do not evaluate coverage as automatic metrics suffice; coverage is more objective compared to fluency and commonsense.

Results and Analysis
Automatic evaluation results on test CG can be found in Tables 7, 8, 9, 10, and results on test O in Table 12. Human evaluation results on test CG can be found in Table 13. Single keyword augmentation performs best for Kw-aug across models. Two word augmentation performs best for Att-aug, except T5-base where three word augmentation performs best. Keyphrases up to 2-grams long perform best for P2T, except T5-large where 3-grams perform best. All models perform best with keyphrases up to 5-grams long for MI. These are the results reported here, and graphs displaying other hyperparameter results on test CG are in Appendix D. Table 14 contains qualitative examples, and more can be found in Appendix §E.

Automatic Evaluation
We see from Tables 7 to 10 that SAPPHIRE methods outperform the baselines on most/all metrics across the models on test CG . The only exception is MI, which performs worse other than coverage.
For BART-base, Kw-aug, Att-aug, and P2T all outperform the baseline across the metrics. For BART-large, Att-aug and P2T outperform the baseline heaviest, with noticeable increases to all metrics. For T5-base, all methods outperform the baseline, with Kw-aug performing best. Att-aug performs best for T5-large, and SAPPHIRE appears relatively less effective for T5-large. T5-large is the strongest baseline, and hence further improving its performance is possibly more difficult.
MI performs worse across most metrics except coverage, likely as MI almost always keeps inputs intact in their exact form. This is however possibly one reason for its low performance; it is less flexible. Further, as discussed in §4.2.2, MI is highly dependent on the input order. See §6.3 for more. Table 11 contains statistical significance pvalues from Pitman's permutation tests (Pitman, 1937) for what we adjudged to be the best performing method(s) per model compared to corresponding baselines on test CG . Most metrics across the methods are significant compared to the baselines.
From Table 12, we see that SAPPHIRE models outperform the corresponding baselines reported in Lin et al. (2020) on test O . T5-large-KW and P2T outperform EKI-BART (Fan et al., 2020) and KG-BART  on both SPICE and BLEU-4, which are two SOTA published CommonGen models that use external knowledge from corpora and KGs. As SPICE is used to rank the Common-Gen leaderboard 5 , T5-large-KW and P2T would place highly. SAPPHIRE models do lag behind the SOTA published RE-T5 (Wang et al., 2021), showing potential for further improvement. Further, the BART-large SAPPHIRE models perform worse than EKI-BART and KG-BART, but not by a substantial margin. We emphasize again that SAP-PHIRE simply uses the data itself and the baseline generations, rather than external knowledge. Hence, SAPPHIRE's performance gains over the baselines and certain SAPPHIRE models matching or outperforming SOTA models that leverage external information is quite impressive. Table 13 shows human evaluation results on test CG for human references and methods (excluding MI) using BART-large and T5-base. SAPPHIRE generally outperforms the baselines. BART-large-P2T performs noticeably higher on both fluency and commonsense. For T5-base, all three methods outperform the baseline across both metrics. Compared to humans, our best methods have comparable fluency, but still lag noticeably on commonsense, demonstrating that human-level generative commonsense reasoning is indeed challenging.

Qualitative Analysis
We see from Table 14 that many baseline outputs contain issues found in §3.2, e.g. incomplete or illogical sentences. Human references are fluent, logical, and sometimes more creative (e.g. example 5), which all methods still lack in comparison.      For example 1, the baseline generation "hands sitting on a chair" misses the concept "toy", whereas our methods do not. Kw-aug and Attaug output complete and logical sentences. For example 2, the baseline generation of "a camel rides a camel" is illogical. Our methods output more logical and specific sentences. For example 3, our methods generate more complete and coherent sentences than the baseline, which lacks a subject (does not mention who is "walking"). For example 4, the baseline generation "bus sits on the tracks" is illogical as buses park on roads. Our methods do not suffer from this and output more reasonable text. For example 5, the baseline generation "A lady sits in a sunglass." is completely illogical. Kw-aug, Att-aug, and P2T all output logical text. For example 6, the baseline output "Someone stands in front of someone holding a hand" is generic and bland. Kw-aug, Att-aug, and P2T all output more specific and detailed text rather than simply referring to "someone". Overall, SAPPHIRE generates text that is more complete, fluent, logical, and with greater coverage, addressing many baseline issues ( §3.2).
However, SAPPHIRE methods are imperfect. P2T relies heavily on the original generation. For example 1, the baseline output "hands sitting on a chair" is extracted as a keyphrase, and used in the P2T output "hands sitting on a chair with toys". While coverage improves, the text is still illogical. For example 2, P2T still misses the "walk" concept. While the Att-aug output of "A man is riding camel as he walks through the desert." is more logical than the baseline's, it is still not entirely logical as the man cannot ride the camel and walk at the same time. MI outputs logical and fluent text for examples 2 and 3. For the other examples, the generated texts are illogical, not fluent, or incomplete. This is likely due to input permutation having a strong effect on output quality. For example, "wave" before "falls off a surf board" leads to an illogical output "A wave falls off a surf board.", where the reverse order results in a more logical output "A man falls off a surf board and hits a wave." As discussed in §4.2.2, our method of selecting best permutations is likely imperfect. Further, BART-MI usually does not inflect inputs and retains them in exact form, unlike the baselines and other methods which learn to inflect words (e.g. singular to plural). We believe BART-MI has potential if these weaknesses can be addressed. Data-to-text NLG: A wide range of data-to-text NLG benchmarks have been proposed, e.g. for generating weather reports (Liang et al., 2009), game commentary (Jhamtani et al., 2018), and recipes (Kiddon et al., 2016). E2E-NLG (Dušek et al., 2018) and WebNLG (Gardent et al., 2017) are two benchmarks that involve generating text from meaning representation (MR) and triple sequences. Montella et al. (2020)  Commonsense Reasoning and Incorporation: Talmor et al. (2020) show that not all pretrained LMs can reason through commonsense tasks. Other works investigate commonsense injection into models; one popular way is through knowledge graphs (KGs). One large commonsense KG is COMET, which trains on KG edges to learn connections between words and phrases. COS-MIC (Ghosal et al., 2020) uses COMET to inject commonsense. EKI-BART (Fan et al., 2020) and KG-BART  show that external knowledge (from corpora and KGs) can improve performance on CommonGen. Distinctly, SAP-PHIRE obviates reliance on external knowledge.

Conclusion and Future Work
In conclusion, we motivated and proposed several improvements for concept-to-text generation which we call SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrated their effectiveness on CommonGen through experiments on BART and T5. Extensive evaluation showed that SAPPHIRE improves model performance, addresses many issues of the baselines, and has potential for further exploration. Potential future work includes improving mask infilling performance, and trying combinations of SAPPHIRE techniques as they could be complementary. Better exploiting regularities of CommonGen-like tasks, e.g. invariance to input order, presents another avenue. SAPPHIRE methods can also be investigated for other data-to-text NLG tasks, e.g. WebNLG, and explored for applications such as improving the commonsense reasoning of personalized dialog agents , data augmentation for NLG (Feng et al., , 2021, and constructing pseudo-references for long-context NLG (Gangal et al., 2021b). BART-large-MI There are people waiting on benches outside bus stops to sit down. pic.twitter. Human The man sat on the bench waiting for the bus. Concept Set {sunglass, wear, lady, sit} (example 5) T5-base-BL A lady sits in a sunglass. T5-base-KW A lady sits next to a man wearing sunglasses. T5-base-Att A lady sits wearing sunglasses.

T5-base-P2T
A lady sits next to a man wearing sunglasses. BART-base-MI A young lady sits in a sunglass to wear.

Human
The lady wants to wear sunglasses, sit, relax, and enjoy her afternoon. Concept Set {hold, hand, stand, front} (example 6) T5-large-BL Someone stands in front of someone holding a hand. T5-large-KW Two men stand in front of each other holding hands.

T5-large-Att
A man stands in front of a woman holding a hand.
Training was done using single RTX 2080 Ti and Titan Xp GPUs, and Google Colab instances which alternately used a single V100, P100, or Tesla T4 GPU. The vast majority of the training was done on a single V100 per model. T5-base models trained in approx. 1 hour, BART-base models in approx. 45 minutes, T5-large models in approx. 4 hours, and BART-large models in approx. 1.5-2 hours.

B Full Re-implementation versus Reported Model Numbers
See Table 16 for full comparison of our reimplemented CommonGen models compared to the original reported numbers in Lin et al. (2020).

C Human Evaluation Details
Human evaluation was done via paid crowdworkers on AMT, who were from Anglophone countries with lifetime approval rates > 97% . Each example was evaluated by 2 annotators. The time given Human A boy danced around the room while listening to music. Concept Set {cheer, team, crowd, goal} T5-base-BL the crowd cheered after the goal. T5-base-KW the crowd cheered after the goal by football team T5-base-Att the crowd cheered after the goal by the team. T5-base-P2T the crowd cheered as the team scored their first goal. BART-base-MI The team and the crowd cheered after the goal. Human The crowd cheered when their team scored a goal. Concept Set {bag, put, apple, tree, pick} T5-base-BL A man puts a bag of apples on a tree. T5-base-KW A man puts a bag under a tree and picks an apple.

T5-base-Att
A man puts a bag under a tree and picks an apple.

T5-base-P2T
A man puts a bag of apples on a tree and picks them. BART-base-MI A man puts a bag of apple juice on a tree to pick it up Human I picked an apple from the tree and put it in my bag.

Concept Set
{circle, ball, throw, turn, hold} T5-large-BL Someone turns and throws a ball in a circle.

T5-large-KW
A man holds a ball and turns to throw it into a circle.

T5-large-Att
A man holds a ball in a circle and throws it. T5-large-P2T A man holds a ball, turns and throws it into a circle. BART-large-MI He turns and throws a ball into the circle to hold it. Human A girl holds the ball tightly, then turns to the left and throws the ball into the net which is in the shape of a circle. Concept Set {hair, sink, lay, wash} T5-large-BL A woman is washing her hair in a sink. T5-large-KW A woman lays down to wash her hair in a sink.

T5-large-Att
A man lays down to wash his hair in a sink.

T5-large-P2T
A woman is washing her hair in a sink. BART-large-MI A woman is washing her hair in the sink. She lay the sink down Human The woman laid back in the salon chair, letting the hairdresser wash her hair in the sink. Concept Set {wash, dry, towel, face} T5-large-BL A man is washing his face with a towel. T5-large-KW A man washes his face with a towel and then dries it.

T5-large-Att
A man is washing his face with a towel and drying it.

T5-large-P2T
A man is washing his face with a towel and drying it off. BART-large-MI A man is washing his face with a towel to dry it. Human The woman will wash the baby's face and dry it with a towel. for each AMT task instance or HIT was 8 minutes. Sufficient time to read instructions, as calibrated by authors, was also considered in the maximum time limit for performing each HIT. Annotators were paid 98 cents per HIT. This rate (7.35$/hr) exceeds the minimum wage for the USA (7.2$/hr) and constitutes fair pay. We neither solicit, record, request, or predict any personal information pertaining to the AMT crowdworkers. Specific instructions and a question snippet can be seen in Figure 1.   et al. (2020). Note that for our models, results are averaged over two seeds, and that the original authors did not experiment with BART-base. Bold indicates where we match or exceed the reported metric.

E Further Qualitative Examples
See  BART-base and T5-base. These are only first seed results, and we only went above three augmented keywords for the base size models. BL refers to the baseline results with no augmented keywords. BART-base and T5-base. These are only first seed results, and we only went above three augmented words for the base size models. BL refers to the baseline results with no augmented words.