On Commonsense Cues in BERT for Solving Commonsense Tasks

BERT has been used for solving commonsense tasks such as CommonsenseQA. While prior research has found that BERT does contain commonsense information to some extent, there has been work showing that pre-trained models can rely on spurious associations (e.g., data bias) rather than key cues in solving sentiment classification and other problems. We quantitatively investigate the presence of structural commonsense cues in BERT when solving commonsense tasks, and the importance of such cues for the model prediction. Using two different measures, we find that BERT does use relevant knowledge for solving the task, and the presence of commonsense knowledge is positively correlated to the model accuracy.

Recently, there has been some debate about whether commonsense knowledge can be learned by a language model trained on large corpora. While Davison et al. (2019), Bosselut et al. (2019) and Rajani et al. (2019) argue that pre-trained language models can directly identify commonsense facts, Lin et al. (2019) and Klein and Nabi (2019) believe that structured commonsense knowledge is not captured well.
Pre-trained language models have achieved empirical success when fine-tuned on specific com- Figure 1: Two methods used to study structured commonsense knowledge in pre-trained Transformer.
Commonsense link is drawn from the Target Concept (Answer Concept) to the Source Concept (Question Concept). monsense tasks such as COSMOS QA (Huang et al., 2019), SWAG (Zellers et al., 2018), and Common-senseQA (Talmor et al., 2019). One possible reason of the high performance is that there exist superficial cues or spurious associations in the dataset, which enables models to answer questions without understanding the task (Niven and Kao, 2019;Kaushik et al., 2020). For example, a model can choose the spurious cue word "meadow" as a feature for positive reviews simply because "meadow" occurs frequently in positive documents. It remains an interesting research question whether commonsense knowledge plays a central role among statistical cues that BERT has when solving commonsense tasks. In other words, we are interested in investigating whether BERT solves commonsense tasks using commonsense knowledge.
We try to provide quantitative answers by mainly using the CommonsenseQA dataset, which asks a model to solve a multiple-choice problem. As shown in Figure 1, given a question and five candidate answers, a model should select one candidate answer as the output. The current state-of-the-art pre-trained language models solve the problem by representing the question jointly with each candidate answer (we call such a question-answer pair a sentence thereafter), and using pre-trained language models as the main encoder. Scoring of each sentence is based on a sentence-level hidden vector, and the candidate answer that corresponds to the highest-scored sentence is taken as the output.
We investigate the presence of commonsense cues in the BERT representation of a sentence by examining commonsense links from the answer concepts to its related contextualized question concepts. Figure 2 shows one example, where the question concept is "bird", and the correct answer is the answer concept connected through an ATLOCATION link in the CONCEPTNET knowledge graph. Such related concepts are not explicitly used in a BERT model for making prediction, and therefore its strength in the BERT representation is not necessarily optimized in task fine-tuning. We call such cues structured commonsense, which is a source of knowledge that we can explicitly measure. We take two methods for measuring structured commonsense in BERT, including directly measuring the attention weights (Clark et al., 2019) and measuring attribution scores by considering gradients (Mudrakarta et al., 2018).
We conduct two sets of experiments to quantitatively measure commonsense links in different situation. In the first set, we examine the presence of commonsense links directly in the BERT representation both before and after fine-tuning (Section 5). In the second set of experiments, we investigate the correlation between commonsense links with model predictions (Section 6). While the former can serve as a probing task for understanding commonsense learned by pre-training, the latter can serve as a means for understanding whether a model learns to make better use of commonsense knowledge through supervised fine-tuning.
Results suggest that BERT does have commonsense knowledge from pre-training, just as syntactic and word sense information. In addition, through fine-tuning, BERT relies more on commonsense cues in making prediction. The evidence is quantitatively demonstrated by stronger commonsense links in the representation, and a salient correlation between model predictions and commonsense link strengths, despite the fact that neither the answer concept nor the related question concept in a commonsense link is directly con- nected to the output layer. Interestingly, results also indicate that the stronger the structured commonsense knowledge is, the more accurate the model is. In addition to CommonsenseQA dataset, we observe similar phenomenon on Wikipedia and OMCS, demonstrating the generalization of our findings. To our knowledge, we are the first to investigate key statistical cues when BERT solves the CommonsenseQA task, providing several evidences that commonsense knowledge is indeed made use of. We release our code at https: //github.com/Nealcly/commonsense.

Related Work
There has been much recent work exploiting the underlying knowledge embedded in BERT representations. Peters et al. (2018) find that lower layers and higher layers in ELMo contain more syntactic and semantic information, respectively. Tenney et al. , who focus on attention heads. The difference lies in that our primary goal is to investigate what information is learned and made use of when solving commonsense tasks. Therefore, our investigation is task-centered. There has also been work investigating data bias and spurious associations. Gururangan et al. (2018) and Poliak et al. (2018) show that classifiers achieve accuracies around 69% on SNLI (Bowman et al., 2015) by using partial input. Kaushik et al. (2020) demonstrate BERT solve sentiment analysis and NLI by heavily relying on spurious associations. Our work is in line in investigating statistical cues. Different from the above investigations, we use probing methods to verify the presence and importance of the key feature, namely commonsense knowledge, in solving commonsense QA, rather than focusing on adversarial cases.
Commonsense reasoning is a challenging task in natural language processing. Traditional methods rely heavily on hand-crafted features (Rahman and Ng, 2012;Bailey et al., 2015) and external knowledge bases (Schüller, 2014). With recent advances in deep learning, pre-trained language models have been used as a powerful method for such tasks. Trinh and Le (2018) use a pre-trained language model to score candidate sentences for the Pronoun Disambiguation and Winograd Schema Challenge (Levesque et al., 2012). Klein and Nabi (2020) use a sentence-level loss to enhance commonsense knowledge in BERT. Mao et al. (2019) demonstrate that pre-trained language models fine-tuned on SWAG (Zellers et al., 2018) are able to provide commonsense grounding for story generation. For commonsense question answering, pre-trained language models with fine-tuning give the state-ofthe-art performance (Zellers et al., 2018;Huang et al., 2019;Talmor et al., 2019). Though the above work show usefulness of BERT on comonsense tasks, little work has been done investigating the mechansim for BERT solving the tasks. Our work thus complements existing research in this aspect.
There is also a line of work leveraging CON-CEPTNET to enhance model's commonsense reasoning ability. Lin et al.

Task and Model
We review the main experimental dataset Common-senseQA (Section 3.1), before showing the structure of a state-of-the-art model (Section 3.2).

Dataset
CommonsenseQA (Talmor et al., 2019) is a multiple-choice question answering dataset constructed based on the knowledge graph CONCEPT-NET (Speer et al., 2017), which is composed of a large set of triples taking the form source concept, relation, target concept , such as BIRD, ATLOCA-TION, COUNTRYSIDES . Given a source concept BIRD and the relation type ATLOCATION, there are three related target concepts CAGE, WINDOWSILL and COUNTRYSIDE in CONCEPTNET.
As shown in Figure 2, in the development of the CommonsenseQA dataset, crowd-workers are requested to generate question and candidate answers based on the source concept and three related target concepts in CONCEPTNET, respectively. Following Talmor et al. (2019), we call the source concept in the question as question concept, and the target concept in the answer as answer concept. Each question corresponds to only one correct answer concept among the three related CONCEPTNET target concepts. In addition, two more incorrect answer concepts are added, which do not correlate with the question concept in CONCEPTNET, resulting in 5 candidate answers for each question. We define commonsene link as the link from the answer concept to the question concept.
The CommonsenseQA dataset is designed to avoid salient bias in surface patterns. First, the lexical overlap between the correct answer and the question is similar to that between the question and incorrect candidates. Second, commonsense links are not superficial patterns that can be learned from training data. In particular, the percentage of answer-question-concept pairs in test examples that also exist in the gold training examples is 15.78%, which suggests that the main source of strong commonsense links, if observed, comes mainly from the pre-trained BERT model itself.
In order to analyze implicit structured commonsense knowledge, which is based on the link from the answer concept to the question concept, we filter out questions which do not contain explicit mentions to the question concept in its CONCEPT-NET form (e.g. paraphrase). The resulting dataset CommonsenseQA* contains 74 fewer instances.

Model
We adopt the method of Talmor et al. (2019), using BERT (Devlin et al., 2019). In particular, given a question q and 5 candidate answers a 1 , ..., a 5 , we concatenate the question with each answer to obtain 5 question-answer pair sequences (i.e. sentences) s 1 , . . . , s 5 , respectively. In each sentence, we use a special symbol [CLS] in the beginning, a symbol [SEP] between the question and the candidate answer, a symbol [SEP] in the end.
BERT uses L stacked Transformer layers (Vaswani et al., 2017) to encode each sentence. The last layer hidden state of the [CLS] token is used for linear scoring with softmax normalization. The candidate among s 1 , . . . , s 5 with the highest score is chosen as the prediction. More details of our implementation are shown in Appendix C.

Analysis Methods
As mentioned earlier, we analyze commonsense links using the attention weight (Clark et al., 2019) and the corresponding attribution score (Sundararajan et al., 2017;Mudrakarta et al., 2018). We report results in one random execution for each experiment. We additionally tried five runs for each experiments, and found that the result variation is small (Appendix B).

Attention Weights
Given a sentence, attention weights in Transformer can be viewed as the relative importance weight between each token and the other tokens when producing the next layer representation (Kovaleva et al., 2019;Vashishth et al., 2020). In particular, given a sequence of input vectors H = [h 1 , h 2 , . . . , h |H| ], its self-attention representation uses each vector as a query to retrieve all context vectors in H, yielding a matrix of attention weights α ∈ R |H|×|H| .
The value of α is computed using the scaled dot-product of the query vector of representation Q = W Q H and the key vector of representation K = W K H, followed by softmax normalization where d k is the dimension size of the key vector K. α i,j represents the attention strength from h i to h j . For multi-head attention, H is linearly projected T times to find T sets of queries, keys, and values, where T is the number of heads. The attention operation of each head is performed in parallel, with the results being concatenated. We use α m,n to denote the n-th attention head in the m-th layer. The attention weights α m,n are used as a first measure of commonsense link strengths.

Attribution Scores
Kobayashi et al. (2020) point out that analyzing only attention weights can be insufficient for investigating the behavior of attention heads. As a supplement, gradient-based feature attribution methods have been studied to interpret the contribution of each input feature to the model prediction in back-propagation (Baehrens et al., 2010;Mudrakarta et al., 2018;Hao et al., 2020). Analysis of both attention weights and the corresponding attribution scores allows us to more comprehensively understand commonsense links in BERT.
We employ an attribution technique called Integrated Gradients (Sundararajan et al., 2017). Intuitively, integrated gradients works by simulating the process of pruning the specific attention head (from the original attention weight α to a zero vector α ), and calculating the integrated gradients in back-propagation. The attribution score directly reflects how much a change of attention weights affects model outputs. A higher attribution score represents more importance of the corresponding individual attention weight. Suppose that F (x) represents the BERT model output for Common-senseQA given an input x. The attribution of attention head α t t ∈ [1, . . . , T ] in a Transformer layer can be computed by comparing with a set of baseline weights α : is closer to F (α ) when x is closer to 0, and closer to α when x is closer to 1. Therefore, 1 x=0 ∂F (α +x(α−α )) ∂α t dx gives the amortized gradient with all different x. Atr(α t ) ∈ R n×n denotes the attribution score which corresponding to the attention weight α t . Atr(α t i,j ) is represented for the interaction from token h i to h j . We set the uninformative baseline α as zero vector. Following Sundararajan et al. (2017), we approximate Atr(α t ) via a gradient summation function, where s is the number of approximation steps for computing integrated gradients. We set s to 20 based on the empirical results.

The Presence of Knowledge
We first conduct a set of experiments to investigate commonsense link strengths in BERT representations of question-answer pairs (i.e. sentences). Intuitively, if the link weight from the answer concept to the question concept is higher than those from the answer concept to other question words, then  we have evidence of BERT using commonsense cues according to CONCEPTNET. As mentioned earlier, rather than the question concept, the representation of the [CLS] token is directly connected to the output layer for candidate scoring. Hence there is no direct supervision signal from the output layer to the link weight during fine-tuning, and better prediction does not necessarily indicate strong commonsense links.

Probing Task
Without losing generality, we call both attention weights in Section 4.1 and attribution weights in Section 4.2 link weight. We evaluate link weights by calculating the most associated word (MAW), namely the question concept word that receives the maximum link weight from the answer concept among all question words. MAW is measured for each individual attention head in each layer. Denote the hidden states of the whole question, question concept and answer concept as We take two different measures of MAW accuracies, calculating the average accuracy among all attention heads, and the accuracy of the mostaccurate head, respectively. Previous work probing syntactic information from attention head takes the second method (Clark et al., 2019;Htut et al., 2019). We additionally measure the average in order to comprehensively evaluate the prevalence of commonsense cues in BERT.
The average MAW accuracy is measured by: 12 × 12 × D .
The maximum MAW accuracy is measured by: where D represents the number of instances for evaluation. In theory, if link weights for each attention head are randomly distributed, the average and maximum MAW accuracies should be both which reflects the fact that the representation does not contain explicit correlation between the answer concept and its related question concept. In contrast, MAW accuracies significantly better than this baseline indicates that commonsense knowledge is contained in the representation.

Results
The results for off-the-shelf BERT (BERT) and a BERT model fine-tuned on CommonsenseQA (BERT-FT) are shown in the first row of Table 1. First, looking at the original non-fine-tuned BERT, the maximum MAW accuracy of each layer significantly outperforms 1 the random baseline. This shows that commonsense links are a part of BERT representation of a sentence in general, just as syntactic (Goldberg, 2019) and semantic (Liu et al., 2019a) knowledge. Second, BERT-FT outperforms BERT in terms of both the average MAW accuracy and the maximum MAW accuracy, with a relatively large boost on the average MAW accuracy, which shows that structured commonsense features are enhanced by supervised training on commonsense tasks. We explore the best performing attentions head for each relation type in Table 2, finding that certain attention heads capture specific commonsense relations. There is no single attention head that does well for all relation types, both with fine-tuning and without fine-tuning, which is similar to the previously finding for syntactic heads (Raganato and Tiedemann, 2018;Clark et al., 2019).
To further differentiate commonsense cues from superficial association, we measure the cooccurrence between each word in the question and answer concept in 1 million English Wikipedia documents. There is only 1.85% question concept word among the highest co-occurring words of each answer concepts, which partly shows that the strong commonsense links do not heavily rely on superficial pattern.

Additional Datasets.
Since this set of experiments concerns the representation only, we take additional two unlabeled corpora in addition to CommonsenseQA. In particular, we extract sentences from Open Mind Common Sense (OMCS) 2 and Wikipedia, if there existing one and only one source-target concept pair in this sentence, yielding two large-scale datasets. The detailed statistics are shown in Table 1. The results are consistent with the CommonsenseQA dataset, which shows the generation ability of our methods.

Co-relating Knowledge with Task
We further conduct a set of experiments to draw the correlation between commonsense links and model prediction. The goal is to investigate how BERT 1 p ≤ 0.01 using t-test; similar for subsequent mentions. 2 Open Mind Common Sense (OMCS) corpus is the source corpus of ConceptNet.  makes use of commonsense knowledge for making a prediction in the CommonsenseQA task. In particular, we compare the link weights across the five answer candidates for the same question, and find out the candidate that is the most associated with the relevant question concept. This candidate is called the most associated target (MAT). Correlations are drawn between MATs and the model prediction for each question. Intuitively, the more the MATs are correlated with the model predictions, the more evidence we have that the model makes use of commonsense cues in making prediction.
Both attention weights and the corresponding attribution scores are used, because now we are considering model prediction, for which gradients play a role and can be measured. For all experiments, the trend of attribute scores is consistent with that measured using attention weights.

Probing Tasks
Formally, given a question q and 5 candidate answers a 1 , . . . , a 5 , we make comparisons across five candidate sentences s 1 , . . . , s 5 . In each candidate sentence, we calculate the link weight from the answer concept to the question concept according to CONCEPTNET. Denote the hidden states of the question concept and the answer concept as [h bs , . . . , h es ] and [h bt , . . . , h et ], respectively. The link weight of the answer-question-concept pair (α a2q ) is the average between each answer concept token and each question concept token Among the five candidates in each instance, we take the one with the highest α a2q as the most associated target MAT, denoted as s MAT ∈ [1, 5].   As a baseline for MAT, we further define most associated sentence (MAS) as the candidate answer that has the maximum link weight from the answer concept to the [CLS] token among the five candidates. The reason is that gradients are backpropagated from the [CLS] token rather than the question concept or the answer concept. By comparing MAT and MAS, we can have useful information on whether MAT is an influencing factor for the model decision.
We measure the correlation between MAT (s MAT ∈ [1, 5]), the model prediction (s model ∈ [1, 5]) and the gold-standard answer (s golden ∈ [1, 5]) by using two metrics, including the overlapping rate between MATs and model predictions, and the accuracy of MATs.
The overlapping rate of MATs is defined as: The accuracy of MATs is defined as the percentage of Mats that equals the gold answer: Similar to MAW, MAT and MAS can be measured for each attention head, and we calculate the average and maximum values across different heads.

Commonsense Link and Model Output
We measure the MAT performance of BERT-FT, and a BERT model that is fine-tuned for the output layer only (BERT-probing). The latter is a linear probing model (Liu et al., 2019a). Intuitively, if the probing model can solve the commonsense task accurately, then the original non-fine-tuned BERT likely encodes the rich commonsense knowledge. Table 3 shows the relative strengths of MATs and MASs according to the 12 attention heads in the top Transformer layer. First, for both models, the overlapping rates of MATs are significantly (p ≤ 0.01) larger than that with MASs. This suggests that the link weight from the answer concept to the question concept is more closely-related to the model prediction as compared to the link weight from the answer concept to the [CLS] token, despite that model output scores are calculated on the [CLS] token. The results give strong evidence that commonsense cues from BERT are relied on for model decision. Second, when fine-tuned with training data, the model gives an even stronger correlation between MAT and the model prediction. This suggests that the model can learn to make use of commonsense cues for making prediction, which partly shows how a BERT model solves CommonsenseQA. Figure 3 shows the overlapping rate between MAT and model prediction at each Transformer layer. Both the maximum and the average overlapping rates across the 12 layers are shown. The random overlapping rate of 20% is drawn as a reference. It can be seen from the figure that the maximum overlapping rate of BERT-probing is significantly larger than the random baseline, which shows that the model prediction is associated with the relevant structured commonsense cues. In addition, after fine-tuning, the BERT-FT model shows a tendency of weakened maximum MAT overlapping rate on lower Transformer layers and much strengthened MAT overlapping rate on higher layers, and in particular the top layer. The trend of MAT measured by attribution score is consistent with attention weights. This suggests that finetuned model relies more on the commonsense structure in the top layer for making prediction.
We compare the co-occurrence between question concepts and candidate answer concepts in 1 million English Wikipedia documents, and find that only 18.2% gold answers has the most cooccurrence with the question concept among 5 answer candidates, which is even lower than the random baseline (20%), showing that Common-senseQA cannot be solved by solely relying on superficial patterns.   Table 4 shows the correlation between MAT accuracies and model prediction accuracies. Each row shows a different number of heads in the top layer for which the MAT corresponds to the correct answer candidate, together with the number of test instances for such cases, and the model prediction accuracy on the instances. There is an obvious trend where increased MAT accuracies correspond to increased model prediction accuracies, which shows that making use of structured commonsense cues leads to better model prediction. Figure 4 shows the MAT accuracies of each attention head in the top layer for the test instances with correct and incorrect model predictions, respectively. The MAT accuracies of correctly predicted instances are larger than those of incorrectly predicted instances by a large margin. The finding is consistent with Table 4, which shows that structured commonsense cues are a key factor in BERT making the correct decision.   We further evaluate the model performance after pruning specific heads. We sort all the attention heads in each layer according to their MAT performance by attribution scores, and then prune these heads in order. Following Michel et al. (2019), we replace the pruned head with zero vectors. Figure 5 shows the model performance on the development set. As the number of pruned heads increases, the model performance decreases, which conforms to intuition. In addition, the model performance drops much more rapidly when the attention heads with higher MAT performances are pruned first, which demonstrates that capturing commonsense features is crucial to strong model prediction.

Commonsense Link and BERT Layer
We further investigate two specific questions on the commonsense knowledge usage. First, which layer does BERT rely on the most for making its decision. Second, does the commonsense knowledge that BERT uses come more from pre-training or fine-tuning. We compare 12 model variations by connecting the output layer on each of the Transformer layer, respectively. Table 5 shows the model accuracies and the MAT overlapping rates. First, BERT-probing gives the best performance when prediction is made on the top layer, and the accuracy generally decreases as the layer moves to the bottom. This indicates that relevant commonsense knowledge is more heavily distributed towards higher layers during pre-training. Our experimental settings here are the same as the probing task for syntactic information by Liu et al. (2019a), who find that syntactic information is distributed more heavily towards lower BERT layers.
With fine-tuning, we observe stronger improvements of both model accuracies and MAT overlaps on higher layers when comparing BERT-FT and BERT-probing. This demonstrates that commonsense knowledge on higher layers is more useful to the CommonsenseQA task. Interestingly, comparing layer 11 and layer 10, the model accuracy after fine-tuning is similar, but the MAT overlap of layer 11 is significantly larger. This can suggest that the structured commonsense knowledge that we probe attributes only partly to the overall useful knowledge for CommonsenseQA.

Conclusion
We conducted quantitative analysis to investigate how BERT solves the CommonsenseQA task, aiming to gain evidence on the key source of information involved in the disambiguation process. Empirical results demonstrated that BERT encodes structured commonsense knowledge, and is able to leverage such cues on the downstream Common-senseQA task. Our analysis has further revealed that with fine-tuning, BERT learns to make better use of commonsense features on higher layers. These suggest that BERT can learn to make use of truly relevant commonsense cues rather than superficial patterns for CommonsenseQA.