Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impact of pruning attention heads is not yet clear in cross-lingual and multi-lingual tasks. Through extensive experiments, we show that (1) pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks and (2) the attention heads to be pruned can be ranked using gradients and identified with a few trial experiments. Our experiments focus on sequence labeling tasks, with potential applicability on other cross-lingual and multi-lingual tasks. For comprehensiveness, we examine two pre-trained multi-lingual models, namely multi-lingual BERT (mBERT) and XLM-R, on three tasks across 9 languages each. We also discuss the validity of our findings and their extensibility to truly resource-scarce languages and other task settings.


Introduction
Prior research on mono-lingual Transformer-based (Vaswani et al., 2017) models reveals that a subset of their attention heads makes key contributions to each task, and the models perform comparably well (Voita et al., 2019;Michel et al., 2019) or even better (Kovaleva et al., 2019) with the remaining heads pruned 1 . While multi-lingual Transformer-based models, e.g. mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), are widely applied in cross-lingual and multi-lingual NLP tasks 2 Keung et al., 2019;Eskander et al., 2020), no attempt has been made to extend the findings on the aforementioned mono-lingual research to this context. In this paper, we explore the roles of attention heads in cross-lingual and multi-lingual tasks for two reasons. First, better understanding and interpretability of Transformerbased models leads to efficient model designs and parameter tuning. Second, head-pruning makes Transformer-based models more applicable to truly resource-scarce languages if it does not negatively affect model performance significantly.
The biggest challenge we face when studying the roles of attention heads in cross-lingual and multi-lingual tasks is locating the heads to prune. Existing research has shown that each attention head is specialized to extract a collection of linguistic features, e.g., the middle layers of BERT mainly extract syntactic features (Vig and Belinkov, 2019;Hewitt and Manning, 2019) and the fourth head on the fifth layer of BERT greatly contributes to the coreference resolution task (Clark et al., 2019). Thus, we hypothesize that important feature extractors for a task should be shared across languages and the remaining heads can be pruned. We evaluate two approaches used to rank attention heads, the first of which is layer-wise relevance propagation (LRP, Ding et al. (2017)). Voita et al. (2019) interpreted the adaptation of LRP in Transformerbased models on machine translation. Motivated by Feng et al. (2018) and Serrano and Smith (2019), we design a second ranking method based on gradients since the gradients on each attention head reflect its contribution to the predictions.
We study the effects of pruning attention heads on three sequence labeling tasks, namely part-ofspeech tagging (POS), named entity recognition (NER), and slot filling (SF). We focus on sequence labeling tasks since they are more difficult to annotate than document-or sentence-level classification datasets and require more treatment in crosslingual and multi-lingual research. We choose POS and NER datasets in 9 languages, where English (EN), Chinese (ZH), and Arabic (AR) are candidate source languages. The MultiAtis++ corpus  is used in the SF evaluations with EN as the source language. We do not include syntactic chunking and semantic role labeling tasks due to lack of availability of manually written and annotated corpora. In these experiments, we rank attention heads based only on the source language(s) to ensure the extensibility of the learned knowledge to cross-lingual tasks and resource-poor languages. In our preliminary experiments comparing the gradient-based method and LRP, the average F1 score improvements on NER with mBERT are 0.69 (cross-lingual) and 0.24 (multi-lingual) for LRP and 0.81 (cross-lingual) and 0.31 (multi-lingual) for the gradient-based method, though both methods rank attention heads similarly. Thus we choose the gradient-based method to rank attention heads in all our experiments.
Our evaluations confirm that only a subset of attention heads in each Transformer-based model makes key contributions to each cross-lingual or multi-lingual task and that these heads are shared across languages. Performance of models generally drop when the highest-ranked or randomly selected heads are pruned, validating the head rankings generated by our gradient-based method. We also observe performance improvements on tasks with multiple source languages by pruning attention heads. Our findings potentially apply to truly resource-scarce languages since we show that the models perform better with attention heads pruned when fewer training instances are available in the target languages.
The contributions of this paper are three-fold: • We explore the roles of attention heads in multilingual Transformer-based models and find that pruning certain heads leads to comparable or better performance in cross-lingual and multilingual sequence labeling tasks.

Datasets
We use human-written and manually annotated datasets in experiments to avoid noise from machine translation and automatic label projection. We choose POS and NER datasets in 9 languages, namely EN, ZH, AR, Hebrew (HE), Japanese (JA), Persian (FA), German (DE), Dutch (NL), and Urdu (UR). As Table 1 shows, these languages fall in diverse language families and the datasets are very different in size. EN, ZH, and AR are used as candidate source languages since they are resource-rich in many NLP tasks. Our POS datasets are all from Universal Dependencies (UD) v2.7 3 . These datasets are labeled with a common label set containing 17 POS tags.
For NER, we use NL, EN, and DE datasets from CoNLL-2002and 2003challenges (Tjong Kim Sang, 2002Tjong Kim Sang and De Meulder, 2003). Additionally, we use the People's Daily dataset 4 , iob2corpus 5 , AQMAR (Mohit et al., 2012), ArmanPerosNERCorpus (Poostchi et al., 2016), MK-PUCIT (Kanwal et al., 2020), and a news-based NER dataset (Mordecai and Elhadad, 2012) for the languages CN, JA, AR, FA, UR, and HE, respectively. Since the NER datasets are individually constructed in each language, their label sets do not fully agree. As there are four NE types (PER, ORG, LOC, MISC) in the three sourcelanguage datasets, we merge other NE types into the MISC class to allow cross-lingual evaluations.
We evaluate SF models on MultiAtis++ with EN as the source language and Spanish (ES), Portuguese (PT), DE, French (FR), ZH, JA, Hindi (HI), and Turkish (TR) as target languages. There are 71 slot types in the TR dataset, 75 in the HI dataset, and 84 in the other datasets. We do not use the intent labels in our evaluations since we study only sequence labeling tasks. Thus our results are not directly comparable with .

Methodology
Here, we introduce the gradient-based method we use in the experiments to rank the attention heads. Feng et al. (2018) claim that gradients measure the importance of features to predictions. Since each head functions similarly as a standalone feature extractor in a Transformer-based model, we use gradients to approximate the importance of the feature set extracted by each head and rank the heads accordingly. Michel et al. (2019) determine importance of heads with accumulated gradients at each head in a training epoch. Different from their approach, we fine-tune the model on the training set and rank the heads using gradients on the development set to ensure that the head importance rankings are not significantly correlated with the training instances in one source language. Specifically, our method generates head rankings for each language in three steps: (1) We fine-tune a Transformer-based model on a mono-lingual task for three epochs.
(2) We re-run the fine-tuned model on the development partition of the dataset with back-propagation but not parameter updates to obtain gradients.
(3) We sum up the absolute gradients on each head, layer-wise normalize the accumulated gradients, and scale them into the range [0, 1] globally.
We show Spearman's rank correlation coefficients (Spearman's ρ) between head rankings of each language pair generated by our method on POS, NER, and SF in Figure 1. The highestranked heads largely overlap in all three tasks, while the rankings of unimportant heads vary more in mBERT than XLM-R. After ranking the attention heads, we fine-tune the model, with the lowest-ranked head in the source language pruned. We keep increasing the number of heads to prune until it reaches a preset limit or when the performance starts to drop. We limit the number of trials to 12 since the models mostly show improved performance within 12 attempts 6 .

Experiments and Analysis
This section displays and explains experimental results on cross-lingual and multi-lingual POS, NER, and SF tasks. Training sets in target languages are not used to train the model under the cross-lingual setting. Our experiments are based on the Huggingface (Wolf et al., 2020) Table 2: F-1 scores of mBERT and XLM on POS. SL and TL refer to source and target languages and CrLing and MulLing stand for cross-lingual and multi-lingual settings, respectively. Unpruned results are produced by the full models and pruned results are the best scores each model produces with up to 12 lowest-ranked heads pruned. The higher performance in each pair of pruned and unpruned experiments is in bold.
and XLM-R. Specifically, we use the pre-trained bert-base-multilingual-cased and xlm-roberta-base models for their comparable model sizes. The models are fine-tuned for 3 epochs with a learning rate of 5e-5 in all the experiments. We use the official dataset splits and load training instances with sequential data samplers, so the reported evaluation scores are robust to randomness. These results support that pruning heads generally has positive effects on model performance in cross-lingual and multi-lingual tasks, and that our method correctly ranks the heads.

POS
Consistent with Conneau et al. (2020), XLM-R usually outperforms mBERT, with exceptions in cross-lingual experiments where ZH and JA datasets are involved. Word segmentation in ZH and JA is different from the other languages we choose, e.g. words are not separated by white spaces and unpaired adjacent word pieces often make up a new word. As XLM-R applies the SentencePiece tokenization method (Kudo and Richardson, 2018), it is more likely to detect wrong word boundaries and make improper predictions than mBERT in cross-lingual experiments involving ZH or JA datasets. We note that the performance improvements are solid regardless of the  Table 3: F-1 scores of mBERT and XLM on NER. SL and TL refer to source and target languages and CrLing and MulLing stand for cross-lingual and multi-lingual settings, respectively. Unpruned results are produced by the full models and pruned results are the best scores each model produces with up to 12 lowest-ranked heads pruned.
source language selection and severe differences of training data sizes in EN, ZH, and AR. This demonstrates the correctness of the head rankings our method generates and that the important attention heads for a task are almost language invariant.
We also examine to what extent the score improvements are affected by the relationships between source and target languages, e.g. language families, URIEL language distance scores (Littell et al., 2017), and the similarity of the head ranking matrices. There are three non-exclusive clusters of language families (containing more than one language) in our choice of languages, namely Indo-European (IE), Germanic, and Semitic languages. Average score improvements between models with and without head pruning are 0.40 (IE), 0.16 (Germanic), and 0.91 (Semitic) for mBERT and 0.19 (IE), 0.18 (Germanic), and 0.19 (Semitic) for XLM-R. In comparison, the overall average score im-provements are 0.53 for mBERT and 0.97 for XLM-R. Despite the generally higher performance of models when the source and target languages are in the same family, the score improvements by pruning heads are not necessarily associated with language families. Additionally, we use Spearman's ρ to measure the correlations between improved F-1 scores and URIEL language distances. The correlation scores are 0.11 (cross-lingual) and 0.12 (multi-lingual) for mBERT, and -0.40 (crosslingual) and 0.23 (multi-lingual) for XLM-R. Similarly, the Spearman's ρ between score improvements and similarities in head ranking matrices shown in Figure 1 are -0.34 (cross-lingual) and 0.25 (multi-lingual) for mBERT, and -0.52 (crosslingual) and -0.10 (multi-lingual) for XLM-R. This indicate that except in the cross-lingual XLM-R model which faces word segmentation issues on ZH or JA experiments, pruning attention heads  improves model performance regardless of the distances between source and target languages. Thus our findings are potentially applicable to all crosslingual and multi-lingual POS tasks.

NER
As Table 3 shows, pruning attention heads generally has positive effects on our cross-lingual and multi-lingual NER models. Even in the multilingual AR-UR experiment where the full mBERT model achieves an F-1 score of 99.26, the score is raised to 99.31 by pruning heads. Scores are comparable with and without head pruning in the 19 cases where model performances are not improved. This also lends support to the specialized role of important attention heads and the consistency of head rankings across languages. In NER experiments, performance drops mostly happen when the source and target languages are from different families. This is likely caused by the difference between named entity (NE) representations across language families. We show in Section 5.2 that the gap is largely bridged when a language from the same family as the target language is added to the source languages. Average score improvements are comparable on mBERT (0.81 under cross-lingual and 0.31 under multi-lingual settings) and XLM-R (1.08 under cross-lingual and 0.67 under multi-lingual settings) in NER experiments. The results indicate that the performance improvements introduced by headpruning are not sensitive to the pre-training corpora of models. The correlations between F-1 score improvements and URIEL language distances are small, with Spearman's ρ of -0.05 (cross-lingual) and -0.27 (multi-lingual) for mBERT and 0.10 (cross-lingual) and 0.12 (multi-lingual) for XLM-R. Similarities between head ranking matrices do not greatly affect score improvements either, the Spearman's ρ of which are -0.08 (cross-lingual) and 0.06 (multi-lingual) for mBERT and 0.05 (cross-lingual) and 0.12 (multi-lingual) for XLM-R. The findings in POS and NER experiments are consistent, supporting our hypothesis that important heads for a task are shared by arbitrary source-target language selections.

Slot Filling
We report SF evaluation results in Table 4. In 31 out of 34 pairs of experiments, pruning up to 12 heads results in performance improvements, while the scores are comparable in the other three cases. These results agree with those in POS and NER experiments, showing that only a subset of heads in each model makes key contributions to crosslingual or multi-lingual tasks.
We also evaluate the correlations between score changes and the closeness of source and target languages. In terms of URIEL language distances, the Spearman's ρ are 0.69 (cross-lingual) and 0.14 (multi-lingual) for mBERT and -0.59 (cross-lingual) and 0.14 (multi-lingual) for XLM-R. The coefficients are -0.25 (cross-lingual) and -0.73 (multi-lingual) for mBERT and -0.70 (crosslingual) and -0.14 (multi-lingual) between score improvements and similarities in head ranking matrices. While these coefficients are generally higher than those in POS and NER evaluations, their pvalues are also high (0.55 to 0.74), indicating the correlations between the score changes and source-  Table 5: F-1 score differences from the full mBERT model on NER (upper) and POS (lower) by pruning highest ranked (Max-Pruning) or random (Rand-Pruning) heads in the ranking matrices. The source language is EN. Blue and red cells indicate score drops and improvements, respectively. target language closeness are not statistically significant. 7

Discussions
In this section, we perform case studies to confirm the validity of our head ranking method. We also illustrate the extensibility of the knowledge we learn from the main experiments to a wider range of settings, e.g. when the training dataset is limited in size or constructed over multiple source languages.

Correctness of Head Rankings
We evaluate the correctness of our head ranking method through comparisons between results in Tables 2 and 3 and those produced by pruning (1) randomly sampled heads and (2) highest ranked heads. Specifically, we repeat the head-pruning experiments with mBERT on NER and POS using EN as the source language and display the score differences from the the full models in Table 5. Same as in the main experiments, we pick the best score from pruning 1 to 12 heads in each experiment. A random seed of 42 is used for sampling attention heads to prune under the random sampling setting. In 14 out of 16 NER experiments, pruning the heads ranked highest by our method results in noticeable performance drops compared to the full model. Consistently, pruning the highest-ranked attention heads harms the performance of mBERT in 15 out of 16 POS experiments. Though score changes are slightly positive for cross-lingual EN-DE and multi-lingual EN-ZH NER tasks and in the cross-lingual EN-ZH POS experiment, improvements introduced by pruning lowest-ranked heads are more significant, as Table 2 and Table 3 show. Pruning random attention heads also has mainly negative effects on the performance of mBERT. These results indicate that while pruning attention heads potentially boosts the performance of models, reasonably choosing the heads to prune is important. Our gradient-based method properly ranks the heads by their priority to prune.

Multiple Source Languages
Training cross-lingual models on multiple source languages is a practical way to improve their performance, due to enlarged training data size and supervision from source-target languages closer to each other (Wu et al., 2020;Moon et al., 2019;Chen et al., 2019;Rahimi et al., 2019;Täckström, 2012). We also explore the effects of pruning attention heads under the multi-source settings. In this section, we experiment with mBERT on EN, DE, AR, HE, and ZH datasets for both NER and POS tasks. These languages fall into three mutually exclusive language families, enabling our analysis on the influence of training cross-lingual models with source languages belonging to the same family as the target language. Similar to related research, the model is fine-tuned on the concatenation of training datasets in all the languages but the one on which the model is tested.
Since the head ranking matrices are not identical across languages, we design three heuristics to rank the heads in the multi-source experiments. The first method merges the head ranking matrices of all the source languages into one matrix and re-generates the rankings. The second method ranks the attention heads after summing up the head ranking   matrices. We also examine the efficacy of pruning heads based on the head rankings from a single language. For this heuristic, we run experiments using the head ranking matrix from each language and report the highest score. We refer to the three heuristics as MD, SD, and EC, respectively. Table 6 displays the results. We note that in the NER evaluations, the performance of mBERT on all the languages but ZH are higher than those in the single-source experiments. This supports our hypothesis that supervision from languages in the same family as the target language helps improve model performance. Different from NER, the evaluation results on POS are not much higher than the single-source evaluation scores, implying that syntactic features are more consistent across languages than appearances of named entities. However, it is consistent on both tasks that pruning attention heads brings performance boosts to all the multisource experiments. While the EC heuristic provides the largest improvement margin in 3 out of 5 experiments, it requires a lot more trial experiments. MD and SD perform comparably well in most cases so they are also promising heuristics for ranking attention heads under the multi-source setting. The results support that pruning attention heads is beneficial to Transformer-based models in cross-lingual tasks even if the training dataset is already large and diverse in languages.

Extension to Resource-poor Languages
While the languages we use in the main experiments are not truly resource-poor, we examine our findings when training sets in the target languages are smaller. We design experiments under the multilingual setting with subsampled training datasets in target languages. Specifically, we randomly divide the training set of each target language into 10 disjoint subsets and compare model performance, with and without head pruning, using 1 to 9 sub- sets. We do not use 0 or 10 subsets since they correspond to cross-lingual and fully multi-lingual settings, respectively. We run the evaluations on NER and POS tasks. These datasets vary greatly in size, allowing us to validate our findings on targetlanguage datasets with as few as 80 training examples. The UR NER dataset is excluded from this case study since its training set is overly large. We note that the score differences with and without head pruning are, in the main experiments, consistent for all the choices of models and source languages. Thus, we only display the mBERT performance with EN as the source language on NER in Figure 2 and that on POS in Figure 3.
The evaluation results are consistent with those in our main experiments, where the model with up to 12 attention heads pruned generally outperforms the full mBERT model. This further supports our hypothesis that pruning lower-ranked attention heads has positive effects on the performance of Transformer-based models in truly resource-scarce languages. It is also worth noting that pruning attention heads often causes the mBERT model to reach peak evaluation scores with less training data in the target language. For example, in the EN-JA NER experiments, the full model achieves the highest F-1 score when all the 800 training instances in the JA dataset are used while the model with heads pruned achieves a comparable score with 20% less data. This suggests that pruning attention heads makes deep Transformer-based models easier to train with less training data and thus more applicable to truly resource-poor languages.

Conclusion and Future Work
This paper studied the contributions of attention heads in Transformer-based models. Past research has shown that in mono-lingual tasks, pruning a large number of attention heads can achieve comparable or higher performance than the full models. However, we were the first to extend these findings to cross-lingual and multi-lingual sequence labeling tasks. Using a gradient-based method, we identified the heads to prune and showed that pruning attention heads generally has positive effects on mBERT and XLM-R performances. Additional case studies empirically demonstrated the validity of our findings and showed further extensibility of them to a wider range of task settings. In addition to better understanding of Transformerbased models under cross-and multi-lingual settings, our findings can be applied to existing models to achieve better performance with reduced training data and resource consumption. Future work could include improving model interpretability in other cross-lingual and multi-lingual tasks, e.g. XNLI (Conneau et al., 2018) and other passage-level classification tasks.