It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.


Introduction
Neural networks have achieved remarkable progress in numerous tasks involving natural language, such as machine translation (Bahdanau et al., 2014;Kaplan et al., 2020;Arivazhagan et al., 2019), language modeling (Brown et al., 2020), open-domain dialog systems (Adiwardana et al., 2020;Roller et al., 2020), and general-purpose language understanding (Devlin et al., 2019;He et al., 2021). However, the fundamental problem of commonsense reasoning has proven to be quite challenging for modern methods and arguably remains unsolved up to this day. The tasks that aim to * Equal contribution.
The town councilors refused to give the demonstrators a permit because they feared violence. Answer: The town councilors measure reasoning capabilities, such as the Winograd Schema Challenge (Levesque et al., 2012), are deliberately designed not to be easily solved by statistical approaches, which are a foundation of most deep learning methods. Instead, these tasks require implicit knowledge about properties of realworld entities and their relations in order to resolve inherent ambiguities of natural language. Figure 1 illustrates the gist of this task: given a sentence and a pronoun (they), the goal is to choose the word that this pronoun refers to from two options (The town councilors or the demonstrators). While picking the right answer is straightforward for humans, the lack of explicit clues makes it hard for machine learning algorithms to perform better than majority vote or random choice.
Recently large Transformer-based masked language models (MLMs) (Devlin et al., 2019) were shown to achieve impressive results on several benchmark datasets for commonsense reasoning (Sakaguchi et al., 2020;Kocijan et al., 2019;Klein and Nabi, 2020). However, the best-performing methods frequently involve finetuning the entire model on large enough corpora with varying degrees of supervision; apart from providing initial parameter values, the pretrained trained language model is not used for predictions.
Moreover, these methods have mostly been evaluated on English language datasets, despite increasing interest in multilingual evaluation for NLP (Hu et al., 2020) and the existence of multilingual encoders (Conneau et al., 2020;Conneau and Lample, 2019). The XCOPA dataset (Ponti et al., 2020) was recently proposed as a benchmark for multilingual commonsense reasoning, yet its task is different from the pronoun resolution problem described above. Versions of Winograd Schema Challenge exist in different languages, but each version comes with slight differences in task specification. This makes holistic cross-lingual evaluation of new commonsense reasoning approaches a quite difficult problem for researchers in the area.
In this work, we propose a simple supervised method for commonsense reasoning, which trains a linear classifier on the self-attention weights between the pronoun and two answer options. To evaluate our method and facilitate research in multilingual commonsense reasoning, we aggregate existing Winograd Schema datasets in English, French, Japanese, Russian, Portuguese, and Chinese languages, converting them to a single format with a strict task definition. Our approach performs comparably to supervised and unsupervised baselines in this setting with both multilingual BERT and XLM-R models as backbone encoders.
Moreover, we find that the same set of attention heads can be used to solve reasoning tasks in all languages, which hints at the emergence of language-independent linguistic functions in crosslingual models and supports the conclusions made by prior work (Chi et al., 2020;. Interestingly, when using an unsupervised attentionbased method (Klein and Nabi, 2019), we observe that restricting the choice of heads to this set also improves the results of this baseline. This result suggests that the key to improved performance of such approaches might lie in the right choice of heads rather then the exact attention values.
To summarize, our contributions are as follows: • We offer a simple supervised method to utilize self-attention heads of pretrained language models for commonsense reasoning.
• We compile a multilingual dataset of Winograd schemas in six languages, bringing all tasks to the same format 1 . When evaluated on this dataset, our method performs competitively to strong baselines from prior work.
2 Related work

Winograd Schema challenges
The Winograd Schema Challenge (WSC) was proposed as a challenging yet practical benchmark for evaluation of machine commonsense reasoning (Levesque et al., 2012). Since its introduction, several English-language benchmarks of varying difficulty and size were also proposed: notable examples include Definite Pronoun Resolution (Rahman and Ng, 2012) and Pronoun Disambiguation Problem (Morgenstern et al., 2016) datasets, as well as WinoGrande, which consists of 44k crowdsourced examples (Sakaguchi et al., 2020). A version of WSC is also included in the popular Super-GLUE language understanding benchmark (Wang et al., 2019a), where it is reformulated as a natural language inference problem.
Although in general the task definition of Winograd Schema Challenge was formalized to some degree, both succeeding datasets and methods proposed by users of these datasets have introduced various changes to the task specification and even the input format. In particular, a work by  provides a thorough comparison of different ways to formalize the task for WSC and shows that the same model can give widely varying results depending on the evaluation framework. We describe our efforts to convert different datasets to a single format in Section 4.

Language models applied to commonsense reasoning
Several works attempt to solve Winograd Schema Challenge by utilizing pretrained language models. For example, Trinh and Le (2018) propose to rank possible answers with an ensemble of RNN language models by substituting the pronoun with each of the options. Recently, Klein and Nabi (2019) introduced Maximum Attention Score (MAS) for commonsense reasoning. This method uses the outputs of multi-head attention from each layer and scores each candidate answer based on the number of heads for which this answer has the highest attention value. We use the first (adapted to masked language models as proposed by Salazar et al., 2020) and the second approaches as baselines in the experiments. In essence, our method can be compared to MAS, but as we demonstrate in Section 5, several algorithm design differences along with task supervision allow us to significantly improve the commonsense reasoning performance. Large pretrained Transformer models, such as BERT (Devlin et al., 2019), have also enabled rapid progress of supervised methods for WSC. One such method is given by Sakaguchi et al. (2020): the authors propose to concatenate the sentence and one of the options and to use the [CLS] token representation of the resulting sequence for binary classification. Also, Kocijan et al. (2019) propose a margin-based loss function which aims to increase the log-probability of the correct answer as a replacement for the masked pronoun. We evaluate these methods in our experiments without training on large in-domain datasets; as we show, both methods are prone to overfitting when applied to several hundreds of examples.

Cross-lingual encoder models
Multilingual representations have been a longstanding goal of the research community: they allow to serve fewer models for a wide range of languages and to improve the results on low-resource languages. Ruder et al. (2019) gives a detailed survey of different cross-lingual word embedding approaches, as well as the history of cross-lingual representations in general.
In this work, we are interested in the latest developments in multilingual Transformer masked language models (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020; that were driven by the advances in transfer learning for NLP (Howard and Ruder, 2018;Devlin et al., 2019). In particular, we use pretrained multilingual BERT (mBERT, Devlin et al., 2019) and XLM-RoBERTa (XLM-R, Conneau et al., 2020) for all our experiments.
Recently, there has been increasing interest in the evaluation of multilingual models: as a result, several benchmarks, including XTREME (Hu et al., 2020), XNLI (Conneau et al., 2018) and XCOPA (Ponti et al., 2020) were introduced. Although XCOPA is a commonsense reasoning dataset, it is meant to serve as a multilingual version of the COPA dataset (Roemmele et al., 2011), which offers a problem different from pronoun res-olution. In this work, we aimed to create a multilingual counterpart of more widely used Winograd Schema Challenge, so that any future methods for commonsense reasoning can be easily evaluated on languages other than English.

Functions of Transformer heads
Previous works have demonstrated that it is possible to perform unsupervised zero-shot consistency parsing with attention heads of pretrained crosslingual models . In our work, we extend these findings to a conceptually different task of commonsense reasoning. This task has significant overlap with coreference resolution, which was shown to be encoded in specific heads of monolingual BERT (Clark et al., 2019;Tenney et al., 2019).
Motivated by similar results for monolingual models, several works have previously demonstrated that models such as multilingual BERT encode grammatical relations (Chi et al., 2020) and can perform zero-shot entity recognition, as well as POS-tagging (Pires et al., 2019). Besides presenting evidence for universality in pronoun resolution, which was not studied before, our analysis relies on attention heads instead of extracting representations from intermediate layer outputs.

Common sense from attention
In this section, we first give a formal definition of the commonsense reasoning task, most commonly encountered in Winograd Schema Challenge and its successors. Then, we provide necessary background information about the Transformer architecture for transfer learning and describe our proposed solution for this task.

Exact task specification
It is known that commonsense reasoning performance can vary greatly due to changes in task formulation: for example, recent work by  reports improvements of up to 6 points when posing the task as multiple choice instead of binary classification. Thus, as per recommendations from this work and in order to create a unified dataset, we choose the definition of the Winograd Schema problem which is as strict as possible.
The definition is as follows: the system receives a sentence with a pronoun and has to choose the noun (or noun phrase) that this pronoun refers to. For this choice, the system has two options; both of which, along with the pronoun, are always included as substrings of the initial sentence. We intentionally do not restrict the choice of sentence representation or the framing of the task in order to evaluate a diverse range of solutions.
Although the requirements listed above are quite general and intuitive when working with WSC, some of the datasets we employ have samples that do not conform to them. For example, it might be the case that the pronoun occurs at several positions in the sentence without explicit indication of the one to be resolved. For all such examples, we attempt to convert them to standardized instances by hand and drop them only if it is not possible via simple means: otherwise, the right answer to the problem is misspecified. We give a detailed description of our solution in Section 4.2.

Transformers for sentence representations
Our method heavily relies on the specifics of the Transformer architecture (Vaswani et al., 2017), which has attracted increased interest in NLP recently due to its generation (Raffel et al., 2020;Brown et al., 2020) and transfer learning (Devlin et al., 2019;Liu et al., 2019) capabilities. This architecture consists of several sequential layers, where each layer contains a feed-forward block and a self-attention block. Inside the selfattention block, there are multiple attention heads: each head first linearly projects the input sequence z = [z 1 , . . . , z i , . . . , z n ] into sequences of queries q i , keys k i and values v i , then computes the attention weights as softmax-normalized values of pairwise dot products between all keys and all queries: These weights are then used to combine the values into a single vector for each input vector, and the layer output is a linear combination of all attention head outputs.

Our approach
The method proposed in this work uses intermediate outputs of a Transformer masked language model with L layers and H heads in each layer. Given an instance of the Winograd Schema problem, we take the input sentence and mask the pronoun that needs to be resolved. After that, we feed the resulting sentence to the language model and obtain the activations of each self-attention layer as a tensor L × H × T , where T is the number of tokens that constitute the candidate answer. Here, we can either take the attention from the pronoun to the candidate or vice versa.
After aggregating the attention outputs by computing the mean or the maximum over T , we have two matrices for each of two possible answers, which are then flattened into vectors. Combining these vectors, we obtain an input for the binary classification task with class 0 corresponding to the first answer being correct and class 1 corresponds to the second one. Given a dataset of such inputs, we can train a logistic regression to predict the class from the multi-head attention weights α.
There are several design choices which define the exact implementation of our method. We describe them below; for each design choice, we underline the best-performing option as found by the ablation study in Section 5.4.
Feature combination: With two feature vectors for candidate answers, we can either concatenate them or subtract the vector of the second candidate from the vector of the first one.
Pooling over tokens: As the candidates can have different length, we need to transform the attention outputs to feature vectors of the same size. This can be done by one of two simple forms of aggregation: mean-or max-pooling.
Attention direction: Observe from Equation 1 that in general, α ij = α ji . To find the optimal configuration, we evaluate both options of either attending to the candidate or the pronoun.

Dataset
In this section, we describe our procedure of building a multilingual commonsense reasoning benchmark using Winograd Schema Challenge problems. We create this benchmark by combining several monolingual collections for six languages, each described in previously published works.
We intentionally do not use XCOPA (Ponti et al., 2020) as it is aimed at a different problem: instead of operating at the word level, the task of this dataset is to connect the premise and one of two hypotheses, both of which are complete sentences. Because direct application of attention-based reasoning to sentence-level tasks is a non-trivial research question, we leave it to future work.

Languages
For the English language, we work with the data from the original WSC task 2 (Levesque et al., 2012), as well as the SuperGLUE benchmark (Wang et al., 2019a) and the Definite Pronoun Resolution dataset (Rahman and Ng, 2012). For French and Japanese, we use datasets published by Amsili and Seminck (2017) and Shibata et al. (2015) respectively. We also include the corresponding part from the Russian SuperGLUE benchmark (Shavrina et al., 2020), a collection of Winograd Schemas in Chinese from the WSC website 3 , and the Portuguese version of WSC (Melo et al., 2019) into our multilingual benchmark.
In addition, we attempted to use Mandarinograd (Bernard and Han, 2020) -a Mandarin Chinese version of WSC. However, this dataset contains questions instead of pronouns that need to be resolved. As such, we were unable to incorporate its contents without significantly changing the task.

Preprocessing and filtering
As the datasets for different languages were released in several different formats, in order to have a unified evaluation framework, we needed to convert them all to the same schema. Unfortunately, due to the differences in task formalization we were unable to convert certain examples without completely changing them; as a result, these examples had to be removed from the dataset. Still, our main priority was to maintain the same task format while keeping as many examples as possible; to this end, we fixed minor annotation inconsistencies by hand wherever possible.
Below we describe the steps of our pipeline. First, several examples had more than two candidate choices, i.e. more than one incorrect option is given. We convert these examples into several binary choice problems and report the original dataset sizes after executing this step. Next, the main issue we faced was that the right answer is not included as a substring of the input sentence. Often this can be explained by missing articles, typos or differences in word capitalization. We attempt to fix all such errors in these cases.
The resulting dataset sizes are listed in Table 1; it can be seen that our conversion pipeline discards approximately 29% of data. In the future, more

Experiments
Below we describe the experimental setup used to evaluate cross-lingual transfer capabilities of different approaches to commonsense reasoning and report the results. Note that we also aim to study the universal reasoning properties of attention heads, and thus we do not evaluate our method on common monolingual Winograd Schema datasets.

Setup
Models We use multilingual BERT (Devlin et al., 2019) and XLM-R-Large (Conneau et al., 2020), as these models are frequently used in other multilingual evaluation literature. The first model has 12 layers with 12 attention heads each, whereas the second model is a 24-layer Transformer with 16 attention heads on each layer. We do not evaluate XLM-R-Base or multilingual translation encoders  because we take two best-performing models according to the XTREME benchmark (Hu et al., 2020). For our method, we use an implementation of logistic regression from scikit-learn (Pedregosa et al., 2011) with default hyperparameters as a linear classifier over attention weights.
Evaluation For unsupervised methods, we directly apply each method to each language subset and report the classification accuracy. For supervised methods, we first choose a single language for training and generate random train-validationtest splits, leaving 10% of data both for validation and testing subsets. For each language, we create 5

Baselines
To compare our approach with currently popular methods, we also evaluate a wide set of wellperforming approaches described in earlier works: Unsupervised We use three entirely unsupervised baselines inspired by prior work. For the first approach, we replace the pronoun by the number of [MASK] tokens equal to the length of each candidate answer and compare the MLM probabilities. For the second approach, we replace the pronoun with each of the answers and rank the candidates by "pseudo-perplexity" (Salazar et al., 2020), inspired by the results of Trinh and Le (2018). Both baselines use normalized scores with respect to the candidate word length. The third unsupervised baseline is Masked Attention Score (MAS), described in Klein and Nabi (2019). Similarly to our method, this approach relies on attention weights for prediction; however, they are utilized differently and the model is unable to discover an optimal subset of heads.
Supervised First, we evaluated the masked language model finetuning approach suggested by the authors of WinoGrande (Sakaguchi et al., 2020). However, in our experiments there are no addi-tional large-scale datasets; we found that with reference hyperparameters, the authors' implementation quickly overfits the training data for all languages in our relatively small benchmark, achieving less than 50% zero-shot accuracy on average.
In addition, we used the margin-based classification approach described in (Kocijan et al., 2019). This method achieves competitive results and outperforms unsupervised baselines in most setups, so we include it in our comparison.

Results
The results of our experiments for multilinual BERT and XLM-R-Large are shown in Tables 2  and 3 respectively. It can be seen that despite using only the attention weights as features, our method can outperform unsupervised approaches and performs competitively with a state-of-the-art supervised approach in several setups. Notably, the quality improves significantly when going from BERT to XLM-R: this goes in line with previous work on evaluation of cross-lingual encoders (Hu et al., 2020). At the same time, the quality of our method improves more significantly than of that suggested by Kocijan et al. (2019): this may be explained by a greater parameter count and a higher number of attention heads with more distinct specializations.

Ablation study
Here we compare several algorithm versions listed in Section 3.3. We train all models on the English part of the dataset and evaluate on all other languages, using validation subset performance as our   target metric. As the Table 4 demonstrates, each choice leads to drops in performance, with the most influential being the choice of feature concatenation instead of taking the difference and attention direction being the least important decision.

Analyzing the attention heads
In this section, we intend to analyze the reasons behind competitive generalization performance of our approach. Mainly we compare the subsets of heads learned on different languages and measure their impact on the prediction quality.

Universal commonsense reasoning
For the first experiment, we rank the heads for models trained on all languages with the XLM-R 4 representations by the absolute value of the weight. Then, we consider the top-5 heads which are ranked highest on average across all languages. These common heads are located in the higher layers of the model, which was shown previously to encode mainly semantic features (Raganato and Tiedemann, 2018;Jo and Myaeng, 2020), which intuitively corresponds to the tasks the model needs 4 The results for mBERT are available in Appendix B. to solve for pronoun resolution. Figure 2 shows the average attention weights of these heads for each word in several example sentences.
After we locate the most important common heads, we train linear classifiers restricted to these heads as features only for every language. To evaluate the importance of head choice, we also provide the performance of linear classifiers trained on a fixed subset of 5 random heads. The results of this experiment can be seen in Table 5; we observe that using the same top-5 heads (only 1.3% of the total number) across all languages preserves or even improves the results. The only exception is Chinese, which might not have enough labeled data to extract a sufficient amount of task-specific in-  formation. It means that a very small subset of attention weights is required to perform commonsense reasoning in all evaluated languages. This further supports the previous results on the analysis of linguistic universals in cross-lingual models (Chi et al., 2020;Wang et al., 2019b). Moreover, restricting the subset of heads used in the MAS baseline to those selected by the classifiers significantly improves the quality of this unsupervised method as well, nearly closing the gap with the results obtained with supervision. This leads us to the conclusion that initially the poor performance of MAS might be caused by the suboptimal choice of attention heads; when the right heads are selected, their weights do not impact the predictions as significantly. Future unsupervised methods for commonsense reasoning can use that information to pay more attention to the choice of heads, which is currently a less explored subject.

The impact of number of heads
In this experiment, we directly study the connection between the number of heads and the quality of predictions. Specifically, after training a model with a full set of attention heads, we order them by the absolute value. Then, we retrain the model while keeping only the top-N important heads.  Figure 3 displays the results of our study for the English language; results for other languages can be seen in Appendix C. From these results, we find that although the training accuracy monotonically increases with the number of used attention weights, the optimal amount of heads for crosslingual generalization is approximately equal to 16. This number is optimal or near-optimal for other languages as well, which might mean that as the number of features grows, the model either simply overfits the data or starts relying on features that are not universal for all languages.

Conclusion
In this work, we offer a simple supervised method to utilize pretrained language models for commonsense reasoning. It relies only on the outputs of self-attention and outperforms complete finetuning in a zero-shot scenario.
We also create a multilingual dataset of Winograd schemas that contains tasks from English, French, Japanese, Russian, Chinese, and Portuguese languages with the same specification. We want to encourage research on commonsense reasoning in languages other than English and release our benchmark to facilitate the development and analysis of new methods for this problem.
Lastly, we demonstrate that the reasoning capabilities of cross-lingual models are concentrated in a small subset of attention heads located in higher layers of the model. Furthermore, this subset of heads is language-agnostic, which sheds light at another facet of linguistic universals encoded in models such as multilingual BERT and XLM-R.