What’s in Your Head? Emergent Behaviour in Multi-Task Transformer Models

The primary paradigm for multi-task training in natural language processing is to represent the input with a shared pre-trained language model, and add a small, thin network (head) per task. Given an input, a target head is the head that is selected for outputting the final prediction. In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for. We find that non-target heads exhibit emergent behaviour, which may either explain the target task, or generalize beyond their original task. For example, in a numerical reasoning task, a span extraction head extracts from the input the arguments to a computation that results in a number generated by a target generative head. In addition, a summarization head that is trained with a target question answering head, outputs query-based summaries when given a question and a context from which the answer is to be extracted. This emergent behaviour suggests that multi-task training leads to non-trivial extrapolation of skills, which can be harnessed for interpretability and generalization.


Introduction
The typical framework for training a model in natural language processing to perform multiple tasks is to have a shared pre-trained language model (LM), and add a small, compact neural network, often termed head, on top of the LM, for each task (Clark et al., 2019;Liu et al., 2019b;Nishida et al., 2019;Hu and Singh, 2021). The heads are trained in a supervised manner, each on labelled data collected for the task it performs (Devlin et al., 2019). At inference time, the output is read out of a selected target head, while the outputs from the other heads are discarded (Figure 1).
What is the nature of predictions made by nontarget heads given inputs directed to the target head? One extreme possibility is that the pre-  : An illustration of multi-task training with a pre-trained LM. Given an input for one of the tasks, a shared representation is computed with a pre-trained LM (green). The target head outputs the prediction, while the other heads are ignored. In this work, we characterize the behaviour of the non-target head.
trained LM identifies the underlying task, and constructs unrelated representations for each task. In this case, the output of the non-target head might be arbitrary, as the non-target head observes inputs considerably different from those it was trained on. Conversely, the pre-trained LM might create similar representations for all tasks, which can lead to meaningful interactions between the heads.
In this work, we test whether such interactions occur in multi-task transformer models, and if nontarget heads decode useful information given inputs directed to the target head. We show that multihead training leads to a steering effect, where the target head guides the behaviour of the non-target head, steering it to exhibit emergent behaviour, which can explain the target head's predictions, or generalize beyond the task the non-target head was trained for.
We study the "steering effect" in three multihead models (Figure 2). In a numerical reading comprehension task (Dua et al., 2019), the model is given a question and paragraph and either uses an extractive head to output an input span, or a generative head to generate a number using arithmetic operations over numbers in the input (Figure 2, left). Treating the extractive head as the non-target head, we observe that it tends to output the arguments to the arithmetic operation performed by the decoder, and that successful argument extraction is correlated with higher performance. Moreover, we perform interventions (Woodward, 2005;Elazar et al., 2021), where we modify the representation based on the output of the extractive head, and show this leads to predictable changes in the behaviour of the generative head. Thus, we can use the output of the non-target head to improve interpretability.
We observe a similar phenomenon in multi-hop question answering (QA) model (Yang et al., 2018), where a non-target span extraction head outputs supporting evidence for the answer predicted by a classification head (Figure 2, center). This emerging interpretability is considerably different from methods that explicitly train models to output explanations (Perez et al., 2019;Schuff et al., 2020).
Beyond interpretability, we observe non-trivial extrapolation of skills when performing multi-task training on extractive summarization (Hermann et al., 2015) and multi-hop QA (Figure2, right). Specifically, a head trained for extractive summarization outputs supporting evidence for the answer when given a question and a paragraph, showing that multi-task training steers its behaviour towards query-based summarization. We show this does not happen in lieu of multi-task training.
To summarize, we investigate the behaviour of non-target heads in three multi-task transformer models, and find that without any dedicated training, non-target heads provide explanations for the predictions of target heads, and exhibit capabilities beyond the ones they were trained for. This extrapolation of skills can be harnessed for many applications. For example, teaching models new skills could by done by training on combinations of tasks different from the target task. This would be useful when labeled data is not available or hard to collect. Also, training an additional head that extracts information from the input could be applied as a general practice for model interpretability.

Multi-Head Transformer Models
The prevailing method for training models to perform NLP tasks is to add parameter-thin heads on top of a pre-trained LM, and fine-tune the entire network on labeled examples (Devlin et al., 2019).
Given a text input with n tokens x = x 1 , . . . , x n , the model first computes contextualized representations H = h 1 , . . . , h n = LM θ (x) using the pre-trained LM parameterized by θ. These representations are then fed into the output heads, with each head o estimating the probability p ψo (y | H) of the true output y given the encoded input H and the parameters ψ o of o. The head that produces the final model output, termed the target head, is chosen either deterministically, based on the input task, or predicted by an output head classifier p(o | x). Predictions made by nontarget heads are typically ignored. When p(o | x) is deterministic it can be viewed as an indicator  function for the target head. Training multi-head transformer models is done by marginalizing over the set of output heads O, and maximizing the probability In the next sections, we will show that the predictions of o interact with those of o . We will denote by o t the target head, and by o s the steered head.

Overview: Experiments & Findings
This section provides an overview of our experiments, which are discussed in detail in §4, §5, §6.
Given a model with a target head o t and a steered head o s , our goal is to understand the behaviour of o s on inputs where o t provides the prediction. To this end, we focus on head combinations, where o s is expressive enough to explain the outputs of o t , but unlike most prior work aiming to explain by examining model outputs (Perez et al., 2019;Schuff et al., 2020;, o s is not explicitly trained for this purpose. Concretely, our analysis covers three settings, illustrated in Figure 2 and summarized in Table 1. The first setting (Figure 2 left, and §4) considers a model with generative and extractive heads, trained on the DROP dataset (Dua et al., 2019) for numerical reasoning over text. Surprisingly, we observe that the arguments for the arithmetic computation required for the generative head to generate its answer often emerge in the outputs of the extractive head. The second setting (Figure 2 middle, and §5) considers a model with a classification head outputting 'yes'/'no' answers, and a span extraction head, trained on the HOTPOTQA dataset (Yang et al., 2018) for multi-hop reasoning. The outputs of the extractive head once again provide explanations in the form of supporting facts from the input context. The last setting (Figure 2 right, and §6) considers a model with two extractive heads, one for span extraction and another for (sentence-level) extractive summarization. Each head is trained on a different dataset; HOTPOTQA for span extraction and CNN/DAILYMAIL for summarization (Hermann et al., 2015). We find that the summarization head tends to extract the supporting facts given inputs from HOTPOTQA, effectively acting as a query-based summarization model.
We now present the above settings. Table 1 summarizes the main results. We denote by FFNN (l) m×n a feed-forward neural network with l layers that maps inputs of dimension m to dimension n.

Setting 1: Emerging Computation Arguments in Span Extraction
We start by examining a combination of generative and extractive heads ( Figure 2, left), and analyze the spans extracted from the input when the generative head is selected to output the final answer.

Experimental Setting
Model We take GENBERT (Geva et al., 2020), a BERT-base model fine-tuned for numerical reasoning, and use it to initialize a variant called MSEGENBERT, in which the single-span extraction head is replaced by a multi-span extraction (MSE) head introduced by Segal et al. (2020), which allows extracting multiple spans from the input. This is important for supporting extraction of more than one argument. MSEGENBERT has three output heads: The multi-span head, which takes H ∈ R d×n , and uses the BIO scheme    (Ramshaw and Marcus, 1995) to classify each token in the input as the beginning of (B), inside of (I), or outside of (O) an answer span: The second head is the generative head, o gen , a standard transformer decoder (Vaswani et al., 2017) initialized by BERT-base, that is tied to the encoder and performs cross-attention over H (Geva et al., 2020). Last, a classification head takes the representation h CLS of the CLS token and selects the target head (o mse or o gen ): Implementation details are in Appendix A.1.
Data We fine-tune MSEGENBERT on DROP (Dua et al., 2019), a dataset for numerical reasoning over paragraphs, consisting of passage-questionanswer triplets where answers are either spans from the input or numbers that are not in the input. Importantly, o mse is trained only on span questions, as its outputs are restricted to tokens from the input. Moreover, less than 5% of DROP examples have multiple spans as an answer.
To evaluate the outputs of o mse on questions where the answer is a number that is not in the input, we use crowdsourcing to annotate 400 such examples from the development set. Each example was annotated with the arguments to the computation that are in the passage and are required to compute the answer. Each example was annotated by one of 7 crowdworkers that were qualified for the task. We regularly reviewed annotations to give feedback and avoid bad annotations. An example annotation is provided in Table 3. On average, there are 1.95 arguments annotated per question.
Evaluation metrics Given a list P of extracted spans by o mse and a list of annotated arguments G, we define the following metrics for evaluation: 1 We check argument recall by computing the fraction of arguments in G that are also in P: |P ∩ G| |G| . We can then compute average recall over the dataset, and the proportion of questions with a perfect recall of 1.0 (first column in Table 2). Similarly, we compute precision by computing the fraction of arguments in P that are also in G: |P ∩ G| |P| and then the average precision over the dataset.  1923, 1922 1923, 1937, 1922  also measure the number of extracted spans on outof-distribution math word problems, and observe similar patterns (details are in Appendix B.2).

Results
Moreover, model performance, which depends on o gen , is correlated with the recall of predicted spans, extracted by o mse . The Spearman correlation between model F 1 and recall for MSEGENBERT is high at 0.351 (Table 2) and statistically significant (p-value 5.6e −13 ), showing that when the computation arguments are covered, performance is higher.
These findings illustrate that multi-task training leads to emergent behaviour in the extractive head, which outputs computation arguments for the output of the generative head. We now provide more fine-grained analysis.
Distribution of extracted spans On average, MSEGENBERT extracts 2.12 spans per example, which is similar to 1.95 spans extracted by annotators. Moreover, the average ratio |P| |G| is 1.2, indicating good correlation at the single-example level. Table 4 shows example outputs of o mse vs. the annotated arguments for the same questions. The full distributions of the number of extracted spans by MSEGENBERT compared to the annotated spans are provided in Appendix B.1.
Parameter sharing across heads We conjecture that the steering effect occurs when the heads are strongly tied, with most of their parameters shared. To examine this, we increase the capacity of the FFNN in o mse from l = 1 layer to l = 2, 4 layers, and also experiment with a decoder whose parameters, unlike GENBERT, are not tied to the encoder.
We find ( Table 2) that reducing the dependence between the heads also diminishes the steering effect. While the models still tend to extract computation arguments, with much higher recall and precision compared to MSEBERT, they output 1.2 spans on average, which is similar to the distribution they were trained on. This leads to higher precision, but much lower recall and fewer cases of prefect recall. Overall model performance is not affected by changing the capacity of the heads.

Influence of Extracted Spans on Generation
The outputs of o mse and o gen are correlated, but can we somehow control the output of o gen by modifying the value of span tokens extracted by o mse ? To perform such intervention, we change the crossattention mechanism in MSEGENBERT's decoder. Typically, the keys and values are both the encoder representations H. To modify the values read by the decoder, we change the value matrix to H i↔j : where in H i↔j the positions of the representations h i and h j are swapped (illustrated in Figure 3). Thus, when the decoder attends to the i'th token, it will get the value of the j'th token and vice versa. We choose the tokens i, j to swap based on the output of o mse . Specifically, for every input token x k that is a digit, 2 we compute the probability p B k by o mse that it is a beginning of an output span. Then, we choose the position i = arg max k p B k , and the position j as a random position of a digit token. As a baseline, we employ the same procedure, but swap the positions i, j of the two digit tokens with the highest outside (O) probabilities.
We focus on questions where MSEGENBERT predicted a numeric output. Table 5 shows in how many cases each intervention instance changed the model prediction (by o gen ). For 40.6% of the questions, the prediction was altered due to the intervention in the highest probability B token, compared to only 0.03% (2 cases) by the baseline intervention. This shows that selecting the token based on o mse affects whether this token will lead to a change.
More interestingly, we test whether we can predict the change in the output of o gen by looking at the two digits that were swapped. Let d and d be the values of digits swapped, and let n and n be the numeric outputs generated by o gen before and after the swap. We check whether |n−n | = |d−d | * 10 c for some integer c. For example, if we swap the  digits '7' and '9', we expect the output to change by 2, 20, 0.2, etc. We find that in 543 cases out of 2,460 (22.1%) the change in the model output is indeed predictable in the non-baseline intervention, which is much higher than random guessing, that would yield 10%. Last, we compare the model accuracy on predictable and unpredictable cases, when intervention is not applied to the examples. We observe that exact-match performance is 76% when the change is predictable, but only 69% when it is not. This suggests that interventions lead to predictable changes with higher probability when the model is correct.
Overall, our findings show that the spans extracted by o mse affect the output of o gen , while spans o mse marks as irrelevant do not affect the output. Moreover, the (relative) predictability of the output after swapping shows that the model performs the same computation, but with a different argument.

Setting 2: Emerging Supporting Facts in Span Extraction
We now consider a combination of an extractive head and a classification head (Figure 2, middle).

Experimental Setting
Model We use the BERT-base READER model introduced by Asai et al. (2020), which has two output heads: A single-span extraction head, which predicts for each token the probabilities for being the start and end position of the answer span: The second head is a classification head for the answer type: yes, no, span, or no-answer:  Examples in HOTPOTQA are annotated with supporting facts, which are sentences from the context that provide evidence for the final answer. We use the supporting facts to evaluate the outputs of o sse as explanations for questions where the gold answer is yes or no.
Evaluation metrics Let F be the set of annotated supporting facts per question and P be the top-k output spans of o sse , ordered by decreasing probability. We define Recall@k to be the proportion of supporting facts covered by the top-k predicted spans, where a fact is considered covered if a predicted span is within the supporting fact sentence and is not a single stop word (see Table 7). 3 We use k = 5 and report the fraction of questions where Recall@5 is 1 (Table 6, first column), to measure the cases where o sse covers all relevant sentences in the first few predicted spans.
Additionally, we introduce an InverseMRR metric, based on the MRR measure, as a proxy for precision. We take the rank r of the first predicted span in P that is not a supporting fact from F, and use 1 − 1 r as the measure (e.g., if the rank of the first non overlapping span is 3, the reciprocal is 1/3 and the InverseMRR is 2/3). If the first predicted span is not in a supporting fact, InverseMRR is 0; if all spans for k = 5 overlap, InverseMRR is 1.

Results
Results are presented in Table 6. Comparing READER and READER only sse , the Recall@5 and InverseMRR scores are substantially higher when using multi-task training, with an increase of 10.9% and 4.5%, respectively, showing again that multitask training is the key factor for emerging explanations. Example questions with the spans extracted by READER are provided in Table 7.
As in §4, adding an additional layer to o sse (READER l=2 ) decreases the frequency of questions with perfect Recall@5 (0.605 → 0.568). This shows again that reducing the dependency between the heads also reduces the steering effect. It is notable that the performance on HOTPOTQA is similar across the different models, with only a slight deterioration when training only the extraction head (o sse ). This is expected as READER only sse is not trained with yes/no questions, which make up a small fraction of HOTPOTQA.

Setting 3: Emerging Query-based Summaries
In §4 and §5, we considered models with output heads trained on examples from the same data distribution. Would the steering effect occur when output heads are trained on different datasets? We now consider a model trained to summarize text and answer multi-hop questions (Figure 2, right).

Experimental Setting
Model We create a model called READERSUM as follows: We take the READER model from §5, and add the classification head presented by Liu The sentences are ranked by their scores and the top-3 highest score sentences are taken as the summary (top-3 because choosing the first 3 sentences of a document is a standard baseline in extractive summarization (Nallapati et al., 2017;Liu and Lapata, 2019)). Implementation details are in A.3.

Data
The QA heads (o sse , o type ) are trained on HOTPOTQA, while the summarization head is trained on the CNN/DAILYMAIL dataset for extractive summarization (Hermann et al., 2015). We use the supporting facts from HOTPOTQA to evaluate the outputs of o sum as explanations for predictions of the QA heads.
Evaluation metrics Annotated supporting facts and the summary are defined by sentences from the input context. Therefore, given a set T of sentences extracted by o sum (|T | = 3) and the set of supporting facts F, we compute the Recall@3 of T against F.

Results
Results are summarized in Table 8. When given HOTPOTQA examples, READERSUM extracts summaries that cover a large fraction of the supporting facts (0.79 Recall@3). This is much higher compared to a model that is trained only on the extractive summarization task (READERSUM only sum with 0.69 Recall@3). Results on CNN/DAILYMAIL show that this behaviour in READERSUM does not stem from an overall improvement in extractive summarization, as READERSUM performance is slightly lower compared to READERSUM only sum . To validate this against other baselines, both READERSUM and READERSUM only sum achieved substantially better Recall@3 scores compared to a baseline that extracts three random sentences from the context (RANDOM with 0.53 Recall@3), and summaries generated by taking the first three sentences of the context (LEAD3 with 0.6 Recall@3).
Overall, the results show multi-head training endows o sum with an emergent behavior of querybased summarization, which we evaluate next. Example summaries extracted by READERSUM for HOTPOTQA are provided in Appendix C.

Influence of questions on predicted summaries
We run READERSUM on examples from HOT-POTQA while masking out the questions, thus, o sum observes only the context sentences. As shown in Table 8 (READERSUM masked ), masking the question leads to a substantial decrease of 13 Recall@3 points in comparison to the same model without masking (0.79→0.66).
Since our model appends a [CLS] token to every sentence, including the question (which never appears in the summary), we can rank the question sentence based on the score of o sum . Computing the rank distribution of question sentences, we see (Figure 4) that the distributions of READERSUM only sum and READERSUM are significantly different, 4 and that questions are ranked higher in READERSUM. This shows that the summarization head puts higher emphasis on the question in the multi-head setup.
Overall, these results provide evidence that multi-head training pushes o sum to perform querybased summarization on inputs from HOTPOTQA.

Related Work
Transformer models with multiple output heads have been widely employed in previous works (Hu 4 Wilcoxon signed-rank test with p-value 0.001. and Singh, 2021; Aghajanyan et al., 2021;Segal et al., 2020;Hu et al., 2019;Clark et al., 2019). To the best of our knowledge, this is the first work that analyzes the outputs of the non-target heads. Previous work used additional output heads to generate explanations for model predictions (Perez et al., 2019;Schuff et al., 2020;. Specifically, recent work has explored utilization of summarization modules for explainable QA (Nishida et al., 2019;Deng et al., 2020). In the context of summarization, Xu and Lapata (2020) have leveraged QA resources for training query-based summarization models. Hierarchies between NLP tasks have also been explored in multi-task models not based on transformers (Søgaard and Goldberg, 2016;Hashimoto et al., 2017;Swayamdipta et al., 2018). Contrary to previous work, the models in this work were not trained to perform the desired behaviour. Instead, explanations and generalized behaviour emerged from training on multiple tasks.
A related line of research has focused on developing probes, which are supervised network modules that predict properties from model representations (Conneau et al., 2018;van Aken et al., 2019;Tenney et al., 2019;Liu et al., 2019a). A key challenge with probes is determining whether the information exists in the representation or is learned during probing (Hewitt and Liang, 2019;Tamkin et al., 2020;Talmor et al., 2020). Unlike probes, steered heads are trained in parallel to target heads rather than on a fixed model. Moreover, steered heads are not designed to decode specific properties from representations, but their behaviour naturally extends beyond their training objective.
Our findings also relate to explainability methods that highlight parts from the input via the model's attention (Wiegreffe and Pinter, 2019), and extract rationales through unsupervised training (Lei et al., 2016). The emerging explanations we observe are based on the predictions of a head rather than on internal representations.

Conclusions and Discussion
We show that training multiple heads on top of a pre-trained language model creates a steering effect, where the target head influences the behaviour of another head, steering it towards capabilities beyond its training objective. In three multi-task settings, we find that without any dedicated training, the steered head often outputs explanations for the model predictions. Moreover, modifying the input representation based on the outputs of the steered head can lead to predictable changes in the target head predictions.
Our findings provide evidence for extrapolation of skills as a consequence of multi-task training, opening the door to new research directions in interpretability and generalization. Future work could explore additional head combinations, in order to teach models new skills that can be cast as an extrapolation of existing tasks. In addition, the information decoding behaviour observed in this work can serve as basis for developing general interpretability methods for debugging model predictions.
A natural question that arises is what head combinations lead to a meaningful steering effect. We argue that there are two considerations involved in answering this question. First, the relation between the tasks the heads are trained on. The tasks should complement each other (e.g. summarization and question answering), or the outputs of one task should be expressive enough to explain the outputs of the other task, when applied to inputs of the other task. For example, extractive heads are particularly useful when the model's output is a function of multiple input spans. Another consideration is the inputs to the heads. We expect that training heads with similar inputs (in terms of length, language-style, etc.) will make the underlying language model construct similar representations, thus, increasing the probability of a steering effect between the heads. pages 1923-1933

B.2 Distribution of Extracted Spans on Math-Word-Problems
To further test the emergent behaviour of MSEGEN-BERT ( §4), we compare the number of extracted spans on an out-of-distribution sample, by MSEGENBERT and MSEGENBERT only mse that was trained without the decoder head (o gen ). Specifically, we run the models on MAWPS (Koncel-Kedziorski et al., 2016), a collection of small-size math word problem datasets. The results, shown in Figure 6, demonstrate the generalized behaviour of o mse , which learns to extract multiple spans when trained jointly with the decoder. Question: Who is Bruce Spizer an expert on, known as the most influential act of the rock era? ("The Beatles") Context: The Beatles were an English rock band formed in Liverpool in 1960. With members John Lennon, Paul McCartney, George Harrison and Ringo Starr, they became widely regarded as the foremost and most influential act of the rock era. Rooted in skiffle, beat and 1950s rock and roll, the Beatles later experimented with several musical styles, ranging from pop ballads and Indian music to psychedelia and hard rock, often incorporating classical elements and unconventional recording techniques in innovative ways. In 1963 their enormous popularity first emerged as "Beatlemania", and as the group's music grew in sophistication in subsequent years, led by primary songwriters Lennon and McCartney, they came to be perceived as an embodiment of the ideals shared by the counterculture of the 1960s. David "Bruce" Spizer (born July 2, 1955) is a tax attorney in New Orleans, Louisiana, who is also recognized as an expert on The Beatles. He has published eight books, and is frequently quoted as an authority on the history of the band and its recordings.

C Example Emergent Query-based Summaries
Examples are provided in Tables 9, 10, 11, and 12.
Question: Which Eminem album included vocals from a singer who had an album titled "Unapologetic"? ("The Marshall Mathers LP 2") Context: "Numb" is a song by Barbadian singer Rihanna from her seventh studio album "Unapologetic" (2012). It features guest vocals by American rapper Eminem, making it the pair's third collaboration since the two official versions of "Love the Way You Lie". Following the album's release, "Numb" charted on multiple charts worldwide including in Canada, the United Kingdom and the United States. "The Monster" is a song by American rapper Eminem, featuring guest vocals from Barbadian singer Rihanna, taken from Eminem's album "The Marshall Mathers LP 2" (2013). The song was written by Eminem, Jon Bellion, and Bebe Rexha, with production handled by Frequency. "The Monster" marks the fourth collaboration between Eminem and Rihanna, following "Love the Way You Lie", its sequel "Love the Way You Lie (Part II)" (2010), and "Numb" (2012). "The Monster" was released on October 29, 2013, as the fourth single from the album. The song's lyrics present Rihanna coming to grips with her inner demons, while Eminem ponders the negative effects of his fame.  (2) input from HOTPOTQA and the predicted summary by READERSUM. The summary is marked in bold.
Question:Are both Dictyosperma, and Huernia described as a genus? ("yes") Context:The genus Huernia (family Apocynaceae, subfamily Asclepiadoideae) consists of stem succulents from Eastern and Southern Africa, first described as a genus in 1810. The flowers are five-lobed, usually somewhat more funnel-or bell-shaped than in the closely related genus "Stapelia", and often striped vividly in contrasting colours or tones, some glossy, others matt and wrinkled depending on the species concerned. To pollinate, the flowers attract flies by emitting a scent similar to that of carrion. The genus is considered close to the genera "Stapelia" and "Hoodia". The name is in honour of Justin Heurnius (1587-1652) a Dutch missionary who is reputed to have been the first collector of South African Cape plants. His name was actually mis-spelt by the collector.Dictyosperma is a monotypic genus of flowering plant in the palm family found in the Mascarene Islands in the Indian Ocean (Mauritius, Reunion and Rodrigues). The sole species, Dictyosperma album, is widely cultivated in the tropics but has been farmed to near extinction in its native habitat. It is commonly called princess palm or hurricane palm, the latter owing to its ability to withstand strong winds by easily shedding leaves. It is closely related to, and resembles, palms in the "Archontophoenix" genus. The genus is named from two Greek words meaning "net" and "seed" and the epithet is Latin for "white", the common color of the crownshaft at the top of the trunk.