TR at SemEval-2020 Task 4: Exploring the Limits of Language-model-based Common Sense Validation

In this paper, we present our submission for subtask A of the Common Sense Validation and Explanation (ComVE) shared task. We examine the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements. We also explore the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models. We find that such resources provide insignificant gains to the performance of fine-tuned language models. We also provide a qualitative analysis of the limitations of the language model fine-tuned to this task.


Introduction
The task of assimilating general world knowledge from textual data for the purpose of commonsense reasoning and inference has been a long-standing challenge in natural language understanding (Davis, 1990;Schubert, 2002). A general approach to this problem that has gained recent popularity is the use of neural-network-based language models (LM) (Bengio et al., 2003). Such models, when trained on massive amounts of diverse text corpora, have been found to capture implicitly a remarkable amount of commonsense knowledge (Trinh and Le, 2018). Moreover, recent advances in training deep contextualized word representations using language-model-type objectives over large text corpora have substantially improved the state-of-the-art performance on a wide variety of natural language understanding tasks (Peters et al., 2018;Radford, 2018;Devlin et al., 2019;Raffel et al., 2019), including an even greater ability to capture commonsense knowledge (Zhou et al., 2019b;Porada et al., 2019).
Despite this success, a significant gap remains between the performance of such pre-trained LMs and human-level performance on commonsense reasoning tasks. In fact, there is evidence suggesting that current models fail to learn factual knowledge effectively (Poerner et al., 2019) and have difficulty with a variety of basic reasoning abilities necessary for commonsense inference (Talmor et al., 2019a;Kassner and Schütze, 2019). This motivates us to explore supplementary resources aimed at augmenting the world knowledge inherent in pre-trained LMs.
In this paper, we explore the ability of a state-of-the-art pre-trained LM in tackling the ComVE subtask A (Wang et al., 2020a). In particular, given a pair of sentences, we evaluate the model's ability in identifying which sentence least agrees with common sense. We consider the model from both a language modeling and a supervised learning perspective. Moreover, we explore the utility of additional resources meant to augment the perceptual world knowledge of language models in solving this task. Finally, we provide a categorization of the types of sentences that our best performing model struggles against.

Related Work
A variety of datasets have been created to examine a system's general commonsense knowledge and inference capability through such tasks as anaphora resolution (Levesque et al., 2012;Sakaguchi et al., This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. 2019) and question answering (Talmor et al., 2019b;Huang et al., 2019;Zellers et al., 2018). Certain tasks have attempted to measure commonsense knowledge specifically pertaining to the physical (Bisk et al., 2020), temporal (Zhou et al., 2019a), and causal (Gordon et al., 2012) aspects of reasoning.
Recent approaches to tackling these challenges have focused on methods to inject external knowledge into pre-trained LMs. The methods vary by the choice of knowledge sources and the training objective (Zhang et al., 2019;Lauscher et al., 2019;Levine et al., 2019;Peters et al., 2019;Xiong et al., 2019;He et al., 2019;Wang et al., 2020b). Complementary to such approaches, methods have been developed to automatically expand the coverage of existing commonsense knowledge bases (CKB) (Li et al., 2016;Saito et al., 2018;Bosselut et al., 2019;Zou, 2020). Due to the immense number of relations that would be needed to capture the wide variety of commonsense knowledge, He et al. (2020) proposed a method to "conceptualize" explicit relations into broader, more abstract concepts. In our work, we examine the benefit of using external resources such as CKB-enhanced word embeddings as simple, additional features to a model, rather than attempting to infuse the knowledge within the pre-trained LM. We also consider more perception-based resources, such as word concreteness ratings and text-to-image generation models.
3 System Descriptions 3.1 Masked language model As a baseline system, we use the pre-trained RoBERTa LARGE  model as a masked language model (MLM). For a given sentence s, we measure the probability of each token in s conditioned on all other tokens in the sentence. We then consider the average p avg or minimum p min token probability across the entire sentence. When evaluating the pair of sentences (s 1 , s 2 ), the sentence with the higher probability is taken to be the commonsense statement. We denote this approach as MLM avg and MLM min when using p avg and p min , respectively, as the comparator.

Feature-based model
The second system we consider uses a set of features that attempts to augment the MLM. In addition to using the difference ∆p avg (∆p min ) of the average (minimum) token probability scores between the pair of sentences, we build additional features that target specific differences between the two sentences and that leverage external knowledge resources.

Token difference perplexity
We consider specifically the subset of tokens that differ between s 1 and s 2 . For example, in the pair of sentences (He poured orange juice on his cereal., He poured milk on his cereal.), we are interested primarily in the MLM probability for the words milk and orange juice. Since the difference can span multiple tokens in a given sentence, we consider the perplexity P {s i }\{s j } of the tokens remaining in sentence s i after taking the setwise token difference with sentence s j , where {s k } is the set of tokens in s k . We take as a feature the difference ∆P = P {s 1 }\{s 2 } − P {s 2 }\{s 1 } of the two perplexities.

Concrete word similarity
Concreteness -the degree to which a concept refers to a perceptible entity -has been one of the most extensively studied variables in the field of psycholinguistics. For example, it has been observed that abstract and concrete concepts are represented and processed differently in the brain (Crutch and Warrington, 2004) and exhibit differences in terms of recall/memory (Walker and Hulme, 1999;Allen and Hulme, 2006;Romani et al., 2008;Miller and Roodenrys, 2009) and word association capability (Groot, 1989). Concrete concepts are an important aspect in commonsense knowledge and can, a priori, be particularly challenging for a model trained purely with textual data to learn about and represent effectively.
There have been multiple efforts to collect concreteness ratings for a large vocabulary (Paivio et al., 1968;Wilson and Division, 1997). Most recently, Brysbaert et al. (2014) collected a dataset of concreteness ratings for about 40K words and two-word phrases. We use this collection and consider all word and phrases as concrete if they have an average concreteness rating of at least 4.0, which reduces the list to about 9200 words and phrases.
The ConceptNet Numberbatch 1 (Speer et al., 2017) word embedding set combines the commonsense knowledge captured by the ConceptNet knowledge graph with existing word embedding sets learned through distributional semantics, such as word2vec and GloVe. We leverage these embeddings in the following way. For a given sentence, we measure the average cosine distance d conc over all pairs of concrete words within the sentence. The intuition is that a commonsense statement should, at a minimum, have concrete words that are more semantically related. Continuing with the example given in Section 3.2.1, the concrete words for the two sentences are (He poured orange juice on his cereal., He poured milk on his cereal.). One expects the words milk and cereal to be closer in embedding space than juice and cereal, not merely because the latter two words appear more often together than the former, but also because of the ConceptNet relation ReceivesAction(milk, eaten with cereal). We use as a feature the difference in d conc values between the two sentences.

Text-to-image generation
In addition to purely text-based resources, we explore the utility of incorporating world knowledge and common sense through visual perception. Specifically, we consider a state-of-the-art text-to-image (TTI) generation model that has been trained over a large corpus of captioned images. Descriptions of real-world images are a form of text that may be under-represented in the corpora that have been used for LM pre-training (e.g. stories, news and encyclopedia articles). More importantly, images contain commonsense information about the physical world that is often not explicitly stated in text (e.g. the relative sizes of objects). The intuition is that a TTI model would find it more difficult to generate an image for sentences that defy the constraints of its world model compared to the sentences that obey common sense. For example, consider the the pair of sentences (He picked up a cup of orange juice., He picked up a cup of an elephant.). A useful TTI model would be able to generate an image of a cup given the first sentence more easily than when given the second sentence. A recently proposed measure of generated image quality is the Semantic Object Accuracy (SOA) (Hinz et al., 2019). In essence, the objects that are mentioned in the sentence ought to be present in the image, as determined by an appropriate object detector. We therefore consider the following approach: 1. Given TTI model φ and sentence s, generate N images for the sentence: φ(s) 1 , φ(s) 2 , ..., φ(s) N 2. Compute the object-category-averaged SOA over the generated images where C is the number of object classes present in the sentence, and I Y is the indicator function for whether image φ(s) i contains an object of class c as determined by the object detection model Y .
In the computation of SOA-C, we restrict C to the objects that are common to both sentences so that differences in the object class generation ability of the TTI model are factored out. The sentence with the higher value of SOA-C would be considered the more commonsense statement. We use as a feature the difference ∆ SOA−C in SOA-C scores between the two sentences. There are, however, significant limitations to the above approach. Crucially, a TTI model is limited by its training corpus in the types of objects that it can reasonably generate. Currently, most TTI models are trained over the Microsoft Common Objects in Context (COCO) dataset (Lin et al., 2014), which covers a set of only 80 object categories. In the ComVE task, only about 10% of sentence pairs have a common COCO object in both sentences. Moreover, the generation capability of the model depends strongly on the object class. For example, cats and dogs can be generated with much better accuracy than cups and bottles. Nevertheless, we consider the usefulness of this feature for this small subset of sentence pairs.
In our experiments, we use the DM-GAN network pretrained over the COCO dataset as the TTI model φ (Zhu et al., 2019) and generate N = 3 images per sentence. For the object detection model Y , we use the YOLOv3 model trained over the same dataset (Redmon and Farhadi, 2018). Figure 1 shows two generated images for a sample sentence pair. Figure 1: Sample generated images for The cat ran away from the dog (left) and The house ran away from the dog (right). In the left image, both a dog and a cat were identified by the object detector. The SOA-C scores for the two sentences were 1/3 and 0, respectively (N = 3, C = 1 (dog)).

GBDT Model
The above features are combined with a Gradient Boosted Decision Tree (GBDT) trained using the XGBoost library (Chen and Guestrin, 2016). Figure 2 illustrates the relative feature importance of the model using SHAP values (Lundberg and Lee, 2017;Lundberg et al., 2020). The most important features are the difference in p min scores and the difference in perplexity scores of differing tokens. As can be seen, the difference in SOA-C scores was an ineffective feature. Even within the 10% of sentence pairs that contain a common COCO object, the difference in SOA-C scores exhibited a negligible correlation (< 0.05 Pearson coefficient) with the prediction label. Thus, for this particular task, this feature had no impact to the final test set predictions of the model.

Fine-tuned LM
The final approach we consider is fine-tuning a pre-trained LM for this particular task. We use again the RoBERTa LARGE pre-trained model for this purpose. Each sentence pair (s 1 , s 2 ) is fed directly into the model in the same manner as any other type of downstream sentence-pair task. The output of the special [CLS] input token is fed into a multilayer perceptron as a classification task. We also consider a second GBDT model (referred to below as the GBDT-RoBERTa model) that uses the fine-tuned RoBERTa model's classification score as an additional feature to the features described above. Table 1 summarizes the results of the various approaches and compares with the performance of the top submission for the task. Overall, we find that the simple MLM-based approaches achieve a reasonable baseline performance on this task. The addition of targeted features in the GBDT model gives an improvement of several percentage points on top of the best baseline. The fine-tuned RoBERTa model shows a substantial performance gain over the GBDT model. Finally, the GBDT-RoBERTa model achieves a marginally higher score on the test set over the fine-tuned RoBERTa model alone. To provide more insight into the errors the fine-tuned RoBERTa model makes, we categorize the incorrectly labeled sentence pairs depending on the key difference between the sentences. We consider the following broad types:

Method
• Quantity: Statements that require an assessment of numeric values.
• Physical perception: Statements requiring perceptual knowledge (e.g. the relative size of objects) or general science knowledge (e.g. every person has a heart).
• Temporal perception: Statements dealing with the duration of events, temporal ordering/causation, or other time-related knowledge (e.g. when an event occurs).
• Definition: Statements that require knowledge of the definition of a particular word.
• Negation: One of the statements contains a negation.
• Data quality: Sentence pairs that were found to be grammatically awkward or where it was not immediately obvious which statement was the commonsense statement. Table 2 gives a breakdown of the proportion of errors found within the development set 2 . We find that the bulk of the errors required some aspect of physical or temporal world knowledge. We also compare several characteristics between the set of correct predictions S and incorrect prediction S within the development set. In terms of the negation category, the proportion of sentence pairs for which at least one sentence contains a not lemma was substantially larger in S (12.8%) than in S (2.6%), indicating some difficulty in interpreting negation statements. In terms of concept concreteness, we find that both sets show a similar distribution in the average number of unique concrete terms (2.4 in S vs 2.1 in S ). This suggests that the model had no particular difficulty in representing commonsense knowledge involving concrete terms. Finally, in terms of the lexical similarity of the sentence pair, we find that incorrect predictions exhibited a slightly higher average number in the setwise symmetric difference of tokens ({s 1 } {s 2 }) between the two sentences than correct predictions (3.6 in S vs 4.4 in S ).

Category
Proportion Example (s 1 , s 2 ) Physical 38% (Babies are born naked., Babies are born with clothes on.) Temporal 19% (He lived without Sara for about a year., He lived without food for about a year.) Quantity 9% (You have three fingers on one hand., You have five fingers on one hand.) Negation 13% (Trees can sometimes live in saltwater., Trees can not live on the ground.) Definition 11% (Pork comes from cows., Pork comes from pigs.) Data Quality 10% (A bald man brushed his hair every day., A bald man washed his hair every day.) Table 2: Proportion of question categories in S for the RoBERTa fine-tuned model.

Conclusion and Future Work
In this paper, we evaluated the performance of a state-of-the-art pre-trained LM on the task of common sense validation. We also explored the usefulness of external resources meant to supplement the implicit commonsense knowledge of the LM. We found that a subset of these resources provide value to a system relying only LM probabilities, but give negligible improvement to an LM fine-tuned to the task. Further experiments are needed to evaluate the performance of pre-trained LMs on this task when they are explicitly adapted with external commonsense knowledge bases. Additionally, we leave as future work the study of TTI models with improved object coverage and generation quality that may eventually add value to the subspace of common sense validation that depend significantly on visual world knowledge.