Retrieval Enhanced Model for Commonsense Generation

Commonsense generation is a challenging task of generating a plausible sentence describing an everyday scenario using provided concepts. Its requirement of reasoning over commonsense knowledge and compositional generalization ability even puzzles strong pre-trained language generation models. We propose a novel framework using retrieval methods to enhance both the pre-training and fine-tuning for commonsense generation. We retrieve prototype sentence candidates by concept matching and use them as auxiliary input. For fine-tuning, we further boost its performance with a trainable sentence retriever. We demonstrate experimentally on the large-scale CommonGen benchmark that our approach achieves new state-of-the-art results.


Introduction
The understanding of commonsense knowledge in human language has been acknowledged as a critical component for artificial intelligence systems. In recent years, many new tasks and datasets are proposed to assess NLP model's ability of commonsense reasoning . SWAG (Zellers et al., 2018) is a task of inferring the upcoming event based on a partial description using commonsense. CommonsenseQA (Talmor et al., 2019) is a commonsense question answering dataset built from ConceptNet. Recently, Lin et al. (2020) propose CommonGen, a new challenge for evaluating model's ability of generative commonsense reasoning. CommonGen requires the system to construct a plausible sentence based on several concepts related to an everyday scenario. Two examples for this task are shown in Table 1. The task is challenging because the system needs to organize provided concepts into the most plausible scenario, avoid violation of commonsense, and ensure the generated sentence is grammatically correct. Existing approaches fine-tune pre-trained encoder-decoder models for description construction with concatenated concepts as input. Fan et al. (2020) propose a retrieve-andgeneration method for commonsense generation which uses a prototype candidate sentence as auxiliary input. However, their retriever is non-trainable and only works for the fine-tuning process. In this work, we extend this idea and propose a novel framework for commonsense generation by using retrieval method for enhancing both the pretraining and fine-tuning stages. Furthermore, we design a trainable prototype sentence retriever to further boost generation performance.
We conduct experiments on CommonGen (Lin et al., 2020)  achieves new state-of-the-art results on Common-Gen on several metrics, including BLEU, CIDEr and SPICE.

Method
We frame CommonGen challenge as a sequenceto-sequence task and adopt T5 (Raffel et al., 2020), a powerful pre-trained encoder-decoder model, as our base model. Fan et al. (2020) find conceptsrelated sentences in external corpora can benefit relational reasoning for CommonGen. We extend this idea by proposing retrieval-enhanced T5 (RE-T5) which equips original T5 with a trainable retriever for selecting prototype sentences based on given concepts. Meanwhile, referring to (Zhou et al., 2021), we design a pre-training task for Common-Gen which continue to pre-train RE-T5 on pseudo concept sets extracted from external corpora. We also use a retriever in this pre-training stage. Formally, given a concept set X = {x 1 , x 2 , . . . , x n }, where x i represents the ith concept and n is the number of concepts, our goal is to generate a natural language output of tokens Y = {y 1 , y 2 , . . . , y m }, which describes a common scenario in our daily life, using all given concepts in X.

Retrieval
Since external corpora have lots of scenario knowledge to describe the relationship between concepts (Fan et al., 2020), we retrieve sentences related to input concepts to help the model perform better commonsense reasoning. First, given an input concept set, we extract all sentences from external corpora that contain at least two concepts in the input X as candidate set Z. Then, we design two retrieval models, matching retriever and trainable retriever, to further retrieve k prototype sentences Z = {z 1 , z 2 , . . . , z k }, Z ⊆ Z as auxiliary input context for RE-T5.
Matching Retriever The matching retriever first orders candidate sentences by the number of contained concepts. Then it simply samples k sentences starting from sentences that contained the most concepts as the auxiliary input.
Trainable Retriever In order to retrieve more useful sentences from the sentence candidate set, we design a trainable retriever, which predicts scores to rank these candidates, and then select top-k sentences as additional context. The scorer is built based on BERT (Devlin et al., 2019), a pre-trained language model usually used for language understanding. Given a concept set X and a candidate sentence z i , our trainable retriever first concatenate them into a text input: where [CLS] and [SEP] are special symbols in BERT.
We pass this into BERT, which generates an output vector for each input token. We take the output vector corresponding to [CLS] which is used as the aggregated representation of the input sequence (denoted c ) into a linear layer with sigmoid activation to obtain the binary classification output y c .
where W c is a projection matrix and b c is a bias.
To train this retriever, for each concept set in CommonGen training set, we use its paired sentence as a positive example and we randomly sample another sentence, also from the training set, as a negative example. Then, we adopt cross entropy loss for this binary classification. The top-k scored sentences with the highest scores will be selected as the auxiliary input Z.  We will describe how these two retrievers are used in CommonGen pre-training and fine-tuning stages.

Pre-training
To enhance model's ability of commonsense reasoning, we design a pre-training task for RE-T5 which is similar to original CommonGen task. In more details, given a sentence from external corpora, we first use spaCy (Honnibal et al., 2020) to tag the sentences with part-of-speech and extract Verb, Noun and Proper Nouns as pseudo concept phrases. We then only keep phrases in Concept-Net (Speer et al., 2017) and remove concept-sets that appear in CommonGen's testset. We use the original sentence as the target sentence, and constructs a pre-training task of using RE-T5 to generate this sentence given pseudo concepts.
Due to the extraction method for pseudo concepts, when retrieving prototype sentences, for each concept set in pre-training data, we have a large candidate set Z with an excessive number of candidate sentences. This leads to a long inference time for using the trainable retriever. Thus, due to speed consideration and also to introduce a degree of randomness into pre-training, we use the matching retriever to retrieve k sentences as auxiliary input Z.
After retrieval, RE-T5 takes the concatenation of input concepts and retrieved prototype sentences as input, and the original sentence as output.

Fine-tuning
At fine-tuning stage, we use trainable retriever to score sentences from candidate set Z and select top k sentence as additional context Z. Similar to pre-training, RE-T5 takes the concatenation of input concepts and retrieved prototype sentences as input, and the original sentence as output.

Experiments Settings
Dataset CommonGen is a benchmark dataset designed to diagnose whether a model has the ability of generative commonsense reasoning (Lin et al., 2020). This dataset contains 32,651/993/1,497 concept sets for training/development/test, and the numbers of corresponding sentences are 67,389/4,018/7,644. We use BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016) as evaluation metrics. Because SPICE correlates the most with human evaluation (Lin et al., 2020), we take SPICE as the primary metric.
External Corpora To be consistent with the distribution of the CommonGen dataset, we use VATEX , Activity (Krishna et al., 2017), SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) as external corpora. We sample 500k sentences from these corpora to construct our pre-training dataset. Meanwhile, these datasets are also used as our sentence pool for the retrieval module. For both the pre-training and fine-tuning, all sentences that appear in the Com-monGen targets are not used as retrieval sentences candidates.
Baselines We compare RE-T5 with several baseline systems. GPT-2, BERT-Gen, UniLM, BART, and T5 are pre-trained language models tested in Concept Set: trailer shirt side sit road T5: A man sits on the side of a trailer and a shirt. Matching Retriever: (1)Two guys in red shirts are sitting on chairs, by the side of the road, behind that open trailer.
(2)Two men, one wearing a straw cone hat, blue shirt, talking with a guy in a tan sunhat, red plaid shirt, both with baskets in front of them, sitting on the side of a dirt road.
(3)An older guy with a tan shirt and hat sitting on the side of a road with bricks all around him and a small green bowl on the side. RE-T5(matching retriever): a man in a tan shirt sits on the side of a road. Trainable Retriever: (1)Two guys in red shirts are sitting on chairs, by the side of the road, behind that open trailer.
(2)Teenagers in matching shirts stand at the side of the road holding trash bags.
(3)A man in a white shirt and black pants standing at the side or the road. RE-T5(trainable retriever): a man in a white shirt and black pants sits on the side of a trailer on the road. Table 3: An example of sentences retrieved by different retrievers and sentences generated based on them. (Lin et al., 2020). They are all fine-tuned on Com-monGen training set with concatenated concepts as input and description sentence as output. EKI-BART (Fan et al., 2020) is a retrieve-and-generate framework for CommonGen, where they use a simple retriever to enhance pre-trained BART (Lewis et al., 2020). KG-BART (Liu et al., 2021) augment BART with Knowledge Graph on both the encoder and decoder side and continue to pre-train BART with a masked concept token generation task. CALM (Zhou et al., 2021) designs several self-supervised strategies encouraging model to focus on concept-centric information.

Implementation Details
We adopt the T5-base as the generation model and BERT-base as the trainable retriever in fine-tuning. We use the Huggingface Transformer (Wolf et al., 2020) for model implementation. For pre-training phase, we use the AdamW optimizer (Loshchilov and Hutter, 2019) with an initial learning rate of 2e-6, weight decay 0.01, adam epsilon 1e-6, and a warmup fraction of 0.01. The model is pre-trained for 3 epochs, with batch size of 16, and gradient accumulation of 4 batches. For fine-tuning, the models are optimized using AdamW with an initial learning rate of 5e-5, batch size 64, gradient accumulation 3 and warmup fraction 0.01, and trained for 20 epochs. Meanwhile, the BERT-base scorer is optimized using AdamW optimizer with an initial learning rate 2e-5, batch size 64, and the model is trained for 3 epochs. For the number of the retrieved sentences k, we experimentally choose 3. All experiments are conducted using 4 V100 with 32 GB memory.   Table 2 shows results of different approaches on the CommonGen testset. RE-T5 outperforms all previous approaches by a large margin in all metrics and sets a new state of the art. RE-T5 combines the generation flexibility of pre-trained language models with the interpretability and modularity of a retrieval-based approach. Unlike EKI-BART (Fan et al., 2020) and KG-BART (Liu et al., 2021), RE-T5 enjoys strong results without model architecture modification. It is worth noting that although T5-base baseline does not perform as well as BART (Lewis et al., 2020) baseline, our method still outperforms the two improved BART-based methods mentioned above. RE-T5 demonstrates that for state-of-the-art performance, neither model modification nor complex fusion of knowledge graphs is necessary, only a simple and effective trainable retriever is needed.

Ablation Study
We conduct ablation experiments as shown in Table 4. First, we can see that RE-T5 model outperforms the backbone T5 model by a large margin in all metrics, with 3.5 improvement in the main metric SPICE. The second line of Table 4 shows that, although large-scale pre-trained language models have been shown to learn and store a substantial amount of the world knowledge implicitly from the massive text corpora (Petroni et al., 2019), the retrieved sentences from external corpora can still explicitly expose lots of scenario knowledge to describe the relationship between concepts. The third line indicates that further pretraining with data augmentation is helpful to improve the performance of the model. In addition, the last line demonstrates that a trainable scorer can capture more helpful knowledge for the model for commonsense generation.
Example Analysis Through the example in Table 3, we can observe that the baseline model T5 generates a sentence without concept "road", and the juxtaposition between "trailer" and "shirt" in this sentence is not in line with common sense. For both matching retriever and trainable retriever, the retrieved sentences remind the model not to forget the concept "road", in addition to providing the relationship between shirt and person. Since matching retriever randomly retrieves sentences based on the number of concepts they contain, it tends to retrieve longer sentences to contain as many concepts as possible, which may confuse the model and thus ignore some concepts, for example, the sentence generated by RE-T5 (matching retriever) in this example is missing the concept "trailer". RE-T5 (trainable retriever) can solve the above problems and generate a sentence that is fluent and in line with common sense.

Conclusions
In this paper, we empirically investigated RE-T5, which utilizes a trainable retriever to retrieve sentences from external corpora to enhance the generative commonsense reasoning capability of pretrained language models, such as T5. The state-ofthe-art result achieved by RE-T5 on CommonGen benchmark demonstrates that a simple yet effective trainable retriever can be a useful addition to pre-trained language models for commonsense generation. For future work, we would like to explore the possibility of extending this simple and effective retrieval-based method to more tasks. In ad-dition, we will also try training a more advanced retrieval model to further improve the performance of commonsense generation.