Towards Zero-shot Commonsense Reasoning with Self-supervised Refinement of Language Models

Can we get existing language models and refine them for zero-shot commonsense reasoning? This paper presents an initial study exploring the feasibility of zero-shot commonsense reasoning for the Winograd Schema Challenge by formulating the task as self-supervised refinement of a pre-trained language model. In contrast to previous studies that rely on fine-tuning annotated datasets, we seek to boost conceptualization via loss landscape refinement. To this end, we propose a novel self-supervised learning approach that refines the language model utilizing a set of linguistic perturbations of similar concept relationships. Empirical analysis of our conceptually simple framework demonstrates the viability of zero-shot commonsense reasoning on multiple benchmarks.


Introduction
Natural language processing has recently experienced unprecedented progress, boosting the performance of many applications to new levels. However, this gain in performance does not equally transfer to applications requiring commonsense reasoning capabilities, which has largely remained an unsolved problem (Marcus, 2020;Kocijan et al., 2020). In order to assess the commonsense reasoning capabilities of automatic systems, several tasks have been devised. Among them is the popular Winograd Schema Challenge (WSC), which frames commonsense reasoning as a pronoun resolution task (Levesque et al., 2012). Although appearing evident and natural to the human mind, modern machine learning methods still struggle to solve this challenge. Lately, the research community has experienced an abundance of methods proposing utilization of 1 The source code can be found at: https://github.com/SAP-samples/ emnlp2021-contrastive-refinement/ "The trophy does not fit into the suitcase, because it is too big." "The trophy does not fit into the suitcase, because it is too small." "The medal does not fit into the box, because it is too big." "The medal does not fit into the box, because it is too small." language models (LM) to tackle commonsense reasoning in a two-stage learning pipeline. Starting from an initial self-supervised learned model, commonsense enhanced LMs are obtained in a subsequent fine-tuning (ft) phase. Fine-tuning enforces the LM to solve the downstream WSC task as a plain co-reference resolution task. However, such supervised approaches are prone to leverage statistical data artifacts for reasoning, giving rise to the "Clever Hans" effect (Lapuschkin et al., 2019). As such, instead of truly featuring reasoning capabilities, approaches become very good in faking. On the other hand, the lack of commonsense reasoning capabilities of LMs can be partially attributed to the training corpora itself, as the commonsense knowledge is often not incorporated into the training text due to the assumed triviality (Trichelair et al., 2018;Saba, 2018;Kavumba et al., 2019;Liu et al., 2020;Cui et al., 2020). We hypothesize that the current self-supervised tasks used in the pre-training phase are insufficient to enforce the model to generalize commonsense concepts (Kejriwal and Shen, 2020). This shortcoming is easily unveiled by the susceptibility of LM to semantic variations. In this regard, it has been shown that LMs are sensitive to linguistic perturbations (Abdou et al., 2020). A case in point is the WSC example in Fig. 1. It shows a pair of sentences sub-ject to semantic variations establishing the same relationship between entities. This can be defined as the joint concept triplet consisting of two nouns and a verb that determines the relationship between the nouns, e.g., (container, item, fit). Inappropriate semantic sensitivity to semantic variants leads to inadequate "conceptualization" and misconstruction of such triplets. To address this, we propose self-supervised refinement. It seeks to achieve generalization through a task agnostic objective.
To this end, we tackle the problem of commonsense reasoning from a zero-shot learning perspective. Leveraging zero-shot models to gauge the intrinsic incorporation of commonsense knowledge suggests being the more valid approach than finetuned models. That can be attributed to the exploitation of implicit biases less likely to occur in this setup. Hence, the associated benchmarks constitute a more realistic and reliable benchmark (Elazar et al., 2021). Other zero-shot methods for commonsense reasoning either use large supervised datasets Winogrande (Sakaguchi et al., 2019)) or very large LMs such as GPT-3 (Brown et al., 2020). In contrast, the proposed method takes a pre-trained LM as input, which undergoes a refinement step. During refinement, the LM is exposed to semantic variations, aiming at improved concept generalization by making the model more robust w.r.t. perturbations. Motivated by the recent advancements in contrastive representation learning (Chen et al., 2020;He et al., 2020;Jean-Bastien et al., 2020;Klein and Nabi, 2020), we propose refining the LM in a self-supervised contrastive fashion. This entails refinement without the use of any labels and hence with no gradient update on the downstream datasets. Consequently, the supervision level is identical to the test time of the Winograd schema challenge.
Our contributions are two-fold: (i) we introduce the task of zero-shot commonsense reasoning for WSC by reformulating the task as performing selfsupervised refinement on a pre-trained language model (ii) We propose a self-supervised refinement framework which leverages semantic perturbations to facilitate zero-shot commonsense reasoning.

Method
Preliminaries: Transformer-based LMs are based on an encoder-decoder architecture, consisting of a set of encoding layers that process the input it-eratively. Prior to entering the Transformer stack, the input is pre-processed by a tokenizer that turns the input sentence into a sequence of tokens. Besides tokens arising from the input sentence, there are also auxiliary tokens such as [CLS], [SEP]. In BERT and RoBERTa, these tokens delimit the input from padding for fixed-length sequence processing. Furthermore, there are special tokens that are tailored to frame specific tasks. For example, [MASK] is used to mask out words for learning the masked language model. Instantiation of language models on the tokenized sequence yields a sequence of embedding vectors. To avoid clutter in the notation and subsuming the fact that only fixedlength sequences are encoded, for the following x ∈ T will refer to the tensor obtained by stacking the sequence of token embeddings.

Perturbation Generation Framework
Starting from a pre-trained LM (init-LM), we conduct a refinement step exposing the model to semantic variations of Winograd schemas. Given a sentence x and a specific semantic [perturbation token], the LM is trained to generate the embeddingx of the provided perturbation type. We enforce the generator to estimate the embedding obtained by the LM on the sentence with the actual semantic perturbation as the target. Intuitively speaking, an LM that generates perturbed representations from an unperturbed input is equipped with a generalized view of commonsense concepts. This builds upon the idea that the injection of noise to the input can flatten the loss landscape to promote generalization (Qin et al., 2019;Moosavi-Dezfooli et al., 2019).
To this end, we extend the set of auxiliary tokens with some new tokens referred to as "perturbation tokens". In the course of training, the perturbation tokens are prepended to the input sentence directly after the [CLS] token. For the following, we let P denote the set of semantic perturbations. Besides perturbations, P also includes the identity transformation [IDENTICAL], which implies no semantic change. Figure 1 shows an example of a perturbation induced by the perturbation token [SYNONYM], which entails replacing nouns of the input sentence with synonyms. Following the example from the figure, the LM seeks to map the representation of the (tokenized) sentence (a) in conjunction with [SYNONYM] to the representation of (b). To enforce consistency across com- The trophy does not fit into the suitcase, because it is too big." "The medal does not fit into the box, because it is too big." "The trophy did not fit into the suitcase, because it was too big." "The table does not fit through the doorway because it is too wide." "The desk does not fit through the corridor because it is too wide." "The table did not fit through the doorway because it was too wide." monsense concepts and semantic perturbations, we embed learning in a contrastive setting.

Self-supervised Refinement
The method's core idea is to construct an abstract, generic view of a commonsense concept by exploiting slightly different examples of the same concept (i.e., perturbations). This is achieved by joint optimization of a LM w.r.t. three different loss terms (Reconstruction, Contrastive and Diversity): Here f denotes the LM, e.g., BERT or RoBERTa parameterized by θ 1 , and q : T → P denotes a representation discriminator (MLP) parameterized by θ 2 . The functionality of the individual loss terms of Eq. 1 will be explained in the following subsections. Additionally, Fig. 2 shows a schematic illustration of the proposed approach and each loss term. Optimization of Eq. 1 entails computation of similarities between embeddings, employing a metric φ(x,x) : T×T → R. Here, we employ a variant of the BERTscore (Zhang et al., 2020) as a similarity metric. BERTscore computes sentence similarities by matching tokens based on their cosine similarity. Subsequently, the scores for the entire sequence are aggregated. Unlike the original BERTscore, we restrict token matching to each token's vicinity to accommodate that perturbations typically induce changes only in a small neighborhood. To this end, we restrict token matching by applying a sliding window mechanism centered on each token.

Reconstruction loss
The reconstruction loss's objective is to regress embeddings by minimizing the distance between the ground-truth and the approximated "perturbation" embedding:

Contrastive loss
The objective of the contrastive loss is to preserve the "semantic expressivity of individual samples and prevent the collapse to a singular perturbation representation. This is achieved by pushing apart the embeddings for different samples of the same perturbation type.

Diversity loss
The diversity loss term aims to guarantee the discriminativeness of the perturbation embeddings arising from the same sample. As such, it imposes the semantic perturbations for the same sample to be diverse, preventing the collapse of different perturbations to a single embedding. Maximizing diversity entails minimization of cross-entropy w.r.t. perturbations: Here q(.|.) : T → R denotes the likelihood of a classifier w.r.t. embeddings. N denotes the number of data samples, and α, β, γ ∈ R denote the hyperparameters, balancing the terms in the loss function.

Zero-shot Pronoun Disambiguation
For resolving the WSC we leverage the Transformer Masked Token Prediction following (Kocijan et al., 2019). This entails replacing the [MASK] token with the possible candidates. Given an associated pair of training sentences with i ∈ N , i.e., s 1 i , s 2 i , the difference between the sentence pairs is the trigger word(s). Here c 1 , c 2 denote the answer candidates, yielding probabilities for the candidates: p c 1 |s 1 i and p c 2 |s 1 i . The answer prediction corresponds to the candidate with a more significant likelihood. If a candidate consists of several tokens, the probability corresponds to the average of the log probabilities.

Setup
We approach training the language model by first training the LM on perturbations on the enhanced-WSC corpus (Abdou et al., 2020). It is a perturbation augmented version of the original WSC dataset. It consists of 285 sample sentences, with up to 10 semantic perturbations per sample. We make use of the following 7 perturbations: tense switch [TENSE], number switch [NUMBER], gender switch [GENDER], voice switch (active to passive or vice versa) [VOICE], relative clause insertion (a relative clause is inserted after the first referent)[RELCLAUSE], adverbial qualification (an adverb is inserted to qualify the main verb of each instance)[ADVERB], synonym/name substitution [SYNONYM].

Architecture
The proposed approach is applicable to any Transformer architecture. Here, we adopted standard LMs such as BERT and RoBERTa for comparability, without aiming to optimize the results for any downstream dataset/benchmark. Specifically, we employ the Hugging Face (Wolf et al., 2019) implementation of BERT large-uncased architecture as well as RoBERTA large. The LM is trained for 10 epochs for BERT and 5 for RoBERTa, using a batch size of 10 sentence samples. Each sample was associated with 4 perturbation, yielding an effective batch size of 40. For optimization, we used a typical setup of AdamW with 500 warmup steps, a learning rate of 5.0 −5 with = 1.0 −8 and = 1.0 −5 for BERT and RoBERTa, respectively. For training BERT, we used α = 130, β = 0.5, γ = 2.5, for RoBERTa α = 1.25, β = 7.25, γ = 6.255. For hyperparameter optimization of α, β, γ we follow a standard greedy heuristic, leveraging a weighted-sum optimization scheme (Jakob and Blume, 2014). From an initial a candidate solution set, coarse-grid random search is utilized to explore the neighborhood on a fine grid of a randomly selected candidates.
The representation discriminator q is a MLP consisting of two fully connected layers with Batch-Norm, parametric ReLU (PReLU) activation function and 20% Dropout.

Results
Given the BERT and RoBERTa language models for comparison, the baseline constitute the initial-LM prior to undergoing refinement. We evaluated our method on nine different benchmarks. Results are reported in Tab. 1. Accuracy gains are significant and consistent with RoBERTa across all benchmarks. On average, the proposed approach increases the accuracy of (+0.8%) with BERT and of (+4.5%) with RoBERTa. The benchmarks and the results are discussed below: DPR (Rahman and Ng, 2012): a pronoun disambiguation benchmark resembling WSC-273, yet significantly larger. According to (Trichelair et al., 2018), less challenging due to inherent biases. Here the proposed approach outperforms the baseline for both BERT and RoBERTA by a margin of (+2.85%) and (+6.56%), respectively. GAP (Webster et al., 2018): a gender-balanced coreference corpus. The proposed approach outperforms the baseline on BERT and RoBERTA with (+0.08%) and (+0.26%). KnowRef : a co-reference corpus addressing gender and number bias. The proposed approach outperforms the baseline on BERT and RoBERTA with (+0.08%) and (+3.55%). PDP-60 (Davis et al., 2016): pronoun disambiguation dataset. Our method outperforms the baseline with RoBERTa with (+5.0%), while on BERT showing a drop of (-1.67%). WSC-273 (Levesque et al., 2012): a pronoun disambiguation benchmark, known to be more challenging than PDP-60. Our method outperforms the baseline with RoBERTa with (+4.0%), with a drop of (−1.1%) with BERT.

Ablation Study
To assess each loss term's contribution, we evaluated each component's performance by removing them individually from the loss. It should be noted that L C of Eq. 3 and L D of Eq. 4 both interact in a competitive fashion. Hence, only the equilibrium of these terms yields an optimal solution. Changes -such as eliminating a term -have detrimental effects, as they prevent achieving such an equilibrium, resulting in a significant drop in performance. See Tab. 2 for the ablation study on two benchmarks. Best performance is achieved in the presence of all loss terms.

Discussion and Conclusion
We introduced a method for self-supervised refinement of LMs. Its conceptual simplicity facilitates generic integration into frameworks tackling commonsense reasoning. A first empirical analysis on multiple benchmarks indicates that the proposed approach consistently outperforming the baselines in terms of standard pre-trained LMs, confirming the fundamental viability. We believe that the performance gain will be more pronounced when leveraging larger perturbation datasets for LM refinement. Hence, future work will focus on the generation of perturbations. This could specifically entail the consideration of sample-specific perturbations.