Language Modelling as a Multi-Task Problem

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.


Introduction
Humans are optimising their behaviour towards a multitude of objectives to reach their goals in dayto-day life. By learning many things at the same time and exploiting their commonalities, they acquire more general knowledge about the world, which in turn helps them to learn new things quicker (Perkins et al., 1992;Schwartz et al., 2005;Cormier and Hagman, 2014;Luriia, 1976). This idea of finding more general solutions through the diversification of tasks has found its way also to the machine learning community, in the field of multitask learning (MTL) (Caruana, 1993(Caruana, , 1997. In MTL, multiple tasks are optimised jointly, enabling the transfer of relevant information across tasks. MTL research yields fruitful results in both application (e.g. Collobert and Weston, 2008;Collobert et al., 2011;Donahue et al., 2014;Kaiser et al., 2017) and theory (e.g. Baxter, 2000;Maurer, 2006;Ando and Zhang, 2005;Argyriou et al., 2008).
However, deciding on a setup requires making many arbitrary choices. The researcher or engineer has to decide which tasks to train together (e.g. Bingel and Søgaard, 2017;Standley et al., 2020); at which hierarchy-level to allow tasks to interact (e.g. Søgaard and Goldberg, 2016); which degree of parameter sharing to employ (Ruder, 2017); which distribution of training data to employ (e.g. Luong et al., 2016), and so on. Having to make so many arbitrary choices is inconvenient for modellers, but also stands in the way of understanding the learning principles of neural models in multi-task settings. The highly constructed learning scenarios make it difficult to see whether outcomes should be attributed to one of the many a-priori decisions or to inherent properties of the learning process.
In this paper, we propose to study MTL not in a constructed, artificial scenario, but in a more natural setting. To do so, we consider the objective of language modelling and exploit the fact that it can be seen as a conglomerate of many different tasks. To give an example: rules of word ordering have to be learned simultaneously to rules of feature agreement and the monotonicity properties of different linguistic environments. These different tasks all need to be learned to achieve the greater goal of producing acceptable sentences, and they have to be optimised in parallel when the language model (LM) is trained. Language modelling is in that sense a natural multi-task learning problem with a naturally given task hierarchy provided by linguistic theory (see also Figure 1).
Studying language modelling as a multi-task problem has several distinct advantages. From an MTL perspective, it gives us a complete hierarchy of relevant tasks that can freely interact throughout the learning process, unconstrained by prior assumptions. We can make theoretically informed decisions about these tasks, drawing on linguistic theory. We can also deduce from linguistics how these tasks relate to each other (or, in other words, how similar they are), which in MTL is considered to be one of the crucial factors for the learn- ing outcomes (e.g. Thrun and O'Sullivan, 1996;Passos et al., 2012). MTL has not yet been studied from this dynamic and unconstrained perspective. Then, somewhat more delicately, the extent to which models can exploit similarities hypothesised by linguistic theory can play a role in confirming or refuting specific linguistic hypotheses. Lastly, when it comes to interpretability research, applying concepts from MTL can be valuable to better understand the learning dynamics of models. By understanding how models are finding solutions, we can infer what these solutions are.
Outline In the remainder of this paper, we will first provide some basic background about MTL ( § 2.1), the subset of linguistic tasks we focus on (Negative Polarity Items, where we consider their different licensing contexts as tasks, § 2.2) and discuss some related work in interpretability ( § 2.3). Then, in § 3 and § 4, respectively, we present our approach and empirical results that showcase our idea. In § 5, we discuss our results and framework in the light of the three fields mentioned before. We conclude in § 6.

Background
In this paper, we aim to bring together three strands of research: MTL, linguistics and interpretability research. As a proof of concept, we focus on one specific complex subset of linguistic tasks: licensing of Negative Polarity Items (NPIs). Below, we give a short overview of the most important characteristics of the three fields of interest.

Multi-task learning
In MTL, multiple tasks are learned together to enable information transfer from one task to another. If the transfer is successful, the benefits might be threefold: the model learns tasks with less training data (i.e. more efficient, Collobert et al., 2011;Benton et al., 2017;Kaiser et al., 2017), up to a higher final accuracy (Collobert and Weston, 2008;Kaiser et al., 2017) and in a way that better generalises to new tasks (Baxter, 2000;Collobert and Weston, 2008). Caruana (1993Caruana ( , 1997 and Ruder (2017) propose several different -but related -processes that might enable positive transfer: related tasks can provide additional training examples for each other on the features they share (statistical data amplification), certain features might be easier to learn through one task than through another, but be useful for both of them (eavesdropping), and idiosyncratic features of single tasks can be averaged out, while more general features are reinforced (attention focusing) 1 .
However, positive transfer is not guaranteed; It is also possible that performance deteriorates due to interference between different tasks, resulting in negative transfer, (Rosenstein et al., 2005;Pan and Yang, 2010;. Whether transfer is positive depends on the task similarity and whether the model is able to exploit this similarity (Rosenstein et al., 2005;Thrun and O'Sullivan, 1996;Passos et al., 2012).
The main goal of MTL so far has been to avoid negative-and promote positive transfer by determining task-similarity and regulate the interactions between tasks based on these similarities. Due to its pivotal role, much research effort was spent on determining similarities of tasks and the regulation of information transfer between them (for an overview, see Zhang and Yang, 2017;Ruder, 2017). The disadvantage of these approaches is that assuming fixed tasks and regulating transfer between them based on fixed task-similarities puts large constraints on possible transfers between tasks, because it neglects the fact that learning processes are dynamic. From the perspective of the model, tasks, as well as their similarities, can change throughout the learning process. Here, we only use predefined tasks and their similarities to analyse the learning behaviour of the model, without constraining the learning process in any way.

Negative Polarity Items
We exemplify our idea by analysing the learning behaviour on a complex subset of linguistic tasks: the licensing of Negative Polarity Items (NPIs). The properties of NPI licensing make it an interesting and adequate subset of tasks to study, as it has a high degree of complexity, has an appropriate frequency within natural language and was previously frequently investigated in neural models.
NPIs are characterised by the property that they can only occur within the scope of certain licensing contexts. For instance, in the example below, the NPI 'any' can occur in sentence (1)a., where it is in the scope of a negation, but not in sentence (1)b., where there is no licensor present.
Bill didn't buy any books that day. b. * Bill did buy any books that day.
Nobody has ever been there.
b. * Somebody has ever been there.
Grasping the phenomenon of NPI licensing requires understanding of three different aspects: 1. The class of NPIs: there is a group of expressions that are restricted in their occurrence.
2. Licensing contexts: there exists a group of expressions that allow NPIs to occur.
3. Scope and structure: the licensing contexts have to stand in a certain structural relationship to the NPIs.
We focus on how LMs learn the second aspect by analysing how different types of licensing contexts interact and generalize throughout training. During learning they should be able to exploit their similarity in the other two aspects.

Interpretability
Interpretability research on LMs has shown that in pre-trained models, such as BERT (Devlin et al., 2019), hierarchical structure emerges throughout the layers and that this structure demonstrates parallels with linguistic theory (Peters et al., 2018;Tenney et al., 2019). However, the emergence of this structure has not been explicitly connected to MTL yet.
In recent years, research has shown that LMs are able to understand NPI licensing. Jumelet and Hupkes (2018) evaluate the performance of LMs on data sets containing NPI constructions extracted from large corpora, and Marvin and Linzen (2018); Wilcox et al. (2019); Warstadt and Bowman (2020) test them on artificial data sets containing templatebased NPI constructions. In our own experimental setup we will utilise the extensive template-based NPI corpus of Warstadt et al. (2019).
What these approaches have in common is their focus on the performance of pretrained LMs. Our MTL approach sheds light on an unexplored aspect of NPI understanding: the learning dynamics of the model during training.

Approach
We consider two different types of experiments. First, to understand to which extent models can understand and use the similarity between different licensing contexts (our tasks) during learning, we exploit the effect that frequency of the different contexts has on learning. Second, we manipulate the LMs' training corpus to constrain their ability to leverage information from other licensing contexts during learning. In accordance with the MTL-literature, we expect the LMs to learn tasks more data-efficient and to a higher final accuracy if they can leverage information across contexts. Before we describe our experiments in more detail, we present our model architecture and training, the evaluation procedure of the licensing contexts, and the filter procedure we use to manipulate the training corpus.

Model
Following previous work in this area, we consider recurrent language models. We focus on unidirectional LSTM models and mirror the hyperparameter setup of Gulordava et al. (2018) 2 . We train the models on the corpus provided by the same authors 3 -a subset of the English Wikipedia -or modified versions of the same for our second experiment (see § 4.2). To track the learning process, we save models every 100 batches of training (371 model-checkpoints per epoch). For all experiments, we average performance across five random seeds.

Evaluation
To estimate the LMs' understanding of NPIs and their dependence on the different licensing contexts, we adapt the Cloze task of Warstadt et al. (2019), based on the implementation of Jumelet (2020). This task considers nine different types of licensing contexts (a list of the contexts, including examples, can be found in Table 1). For every such context, Warstadt et al. (2019) generated a large number of minimal pair sentences, containing correctly and incorrectly licensed NPIs. For instance, for the adverbs licensing context: (3) a.
A lady rarely ever thought that the children saw the boy. b. * A lady sometimes ever thought that the children saw the boy.
Following previous work, we quantify an LM's understanding of a particular type of licensing context by computing the percentage of minimal pairs in that context for which the model correctly assigns a higher probability to the NPI in the licensing contexts than in the non-licensing contexts. I.e., in the example above, we would compare the probability the model assigns to the word ever in the contexts "A lady rarely" and "A lady sometimes" (see also Figure 2).

Identification of NPIs in training corpus
The Warstadt et al. (2019) corpus provides us with a task to evaluate nine different context types that license NPIs. To manipulate the training corpus for our experiments we also need to identify sentences in the training corpus of the model in which these contexts actually licence NPIs. To do so, we need to locate these contexts, as well as establish that they in fact licence an NPI in a particular sentence. We consider the nine Warstadt et al. context types, and the corresponding list of 30 expressions that are part of these contexts (e.g. the list of adverbs licensing NPIs). As for the NPIs, we consider an extensive list of 160 distinct NPIs 4 , based on the collection provided by Hoeksema (2012). We then identify sentences in which an element of our NPI list is preceded by an element from our context list, ensuring that there is a dependency relation between them using the dependency parser of spaCy (Honnibal and Johnson, 2015). When there are multiple potential licensors in a sentence, we use the hierarchical distance between the licensor and the NPI in the parse tree as a heuristic to find the correct licensor. By testing this procedure on a manually labeled set of 200 randomly selected sentences with multiple licensors, we estimate that it identifies the correct among multiple licensors in around 97% of cases. In Table 1, we report examples and frequencies of the different licensing contexts in the training corpus based on this filtering scheme.

Experiments and results
As a first step, we assess whether the LMs can adequately represent all nine categories of the evaluation task. To do so, we train five models on the regular training corpus, and compute their final accuracy on our nine tasks. All models show adequate performance on most contexts (see Table 2), with the exception of the simple question context. Additionally, we observe that the models achieve their accuracy surprisingly fast: already after two epochs, there are no more substantial changes in empirical error (see Figure 3). In the rest of our experiments, we therefore focus only on these first two epochs.

Frequency vs data efficiency
While some licensing contexts are rather common (e.g. negation), others appear scarcely as a licensor (e.g. adverbs). Therefore, throughout the learning process, the LMs encounter many instances of the more frequent contexts before they see an example of an infrequent context. If LMs were able to leverage information across contexts, less frequent contexts should thus have more prior established NPI-understanding that they can bootstrap from. Consequently, the LMs should require fewer training examples to learn less frequent contexts than they need to learn more frequent contexts. In other words, the LM should be more data efficient for these infrequent contexts. In our first experiment, we use this hypothesised relationship between frequency and data efficiency to assess whether LMs can exploit the similarities between different licensing contexts. To be able to compare across different contexts, we quantify the data efficiency of an LM for a particular context as the number of examples the LM needs to observe until it reaches 95% of its final accuracy for that context. 5 To make this measure more robust, we first apply a Savitzky-Golay noise-filter to the learning curve (degree of polynomial = 1, window size = 25; Savitzky and Golay 1964  We compute the data efficiency of the trained LMs for all nine contexts and compute the correlation between a context's frequency and the model's data efficiency with respect to that context. In Figure 4, we plot the average data efficiency of each context against the frequency of that context, as well as the linear fit that relates these two variables. The experiment demonstrates a strong relationship between the data efficiency and frequency of a respective context: r = .89, p < .05. Hence, the less frequent a licensing context is, the fewer examples are needed for the model to learn it, from which we conclude that the model is indeed able to transfer knowledge from previously acquired knowledge.

Transfer from general knowledge
While the presented relationship between frequency and data efficiency demonstrates that LMs can leverage previously learned information to learn less frequent licensing contexts, it does not unequivocally show that it leverages information from other NPI contexts. After all, when a less frequent context is encountered, the LM has not only had the opportunity to acquire prior knowledge about NPIs, it has also simply seen more language in general. In other words, the LM may meanwhile 5 The more data efficient, the lower this number thus is. also have acquired more general language knowledge, which may help it to more quickly learn a less frequent licensing context. In our second experiment, we isolate transfer from general language knowledge and transfer from previously observed NPIs by training LMs on single-context corpora.
Single-context corpora Single-context corpora contain NPIs licensed only by a single context. LMs trained on these corpora can thus not transfer knowledge acquired from other licensing contexts, as these are not present in the training data. By comparing the data efficiency of contexts between LMs trained on all-context and single-context corpora, we can thus infer how much of the increase of data efficiency for lower-frequent contexts is due to leveraging information from other contexts.
To create our nine single-context corpora, we use the procedure described in § 3.3 to identify all sentences containing NPIs licensed by our nine contexts. For every context, we then create a corpus in which all sentences containing other contexts licensing NPIs are replaced by a neutral sentence of the same length, sampled from the rest of the corpus. During this replacement procedure, the ordering and composition of the corpus remained otherwise intact.
When we compare the learning of single-context with all-context models, we cannot rely on the previously used data-efficiency metric from Experiment 4.1. The data-efficiency measure is bound to how quickly the model reaches its final accuracy and accordingly benefits when its final accuracy decreases. As we expect the final accuracy to be lower in the single context models, comparing only data-efficiencies between models is likely to be uninformative. 6 . In this experiment, as explained below, we instead consider the area between the curves (AbC).
Area between Curves (AbC) Area between Curves (AbC) incorporates both data efficiency and accuracy: for every context, we calculate the area between the all-contexts and single-context learning curves until the point in time where they both have reached 95% of their final accuracy. The larger this area is, the more impactful it is to remove all other NPI contexts, and the more the model leveraged from these contexts. The learning curves of all contexts, along with an illustration of the AbC-measure, can be found in Figure 5.
As a first interesting observation, we see that for seven of the nine contexts, the all-contexts model learns faster and achieves higher final performance. 7 Both frequent and infrequent contexts thus benefit from information acquired by other licensing contexts, in terms of both data-efficiency and final accuracy.
This positive transfer can also be seen in Figure 6, where we plot the AbC for all licensing contexts against their frequency. This plot also confirms the relationship found in our previous experiments: the less frequent a context is, the more it benefits from other NPIs (r = .76, p < .05). 7 A one-sided Welch's test confirms that the calculated AbCs are overall different from zero: t = 2.61, p < .05.

Discussion
In this paper, we studied language modelling as a multi-task problem. We show that neural language models can find and exploit similarity between the different language construction rules that we deduced from linguistic theory and that their transfer behaviour mirrors the generalisation behaviour in traditionally constructed MTL settings. In this section, we now reflect on how our setup and results contribute to the three different areas that we mentioned in the introduction: MTL, linguistics and interpretability research.

Multi-task learning research
Studying LMs as multi-task learners, we observe several phenomena known from traditional MTL: when trained in parallel, similar (sub)tasks are learned more efficiently (compare Collobert et al., 2011;Kaiser et al., 2017), and with higher accuracy (Collobert and Weston, 2008;Kaiser et al., 2017), and this effect is stronger for less frequent tasks (Benton et al., 2017;Kaiser et al., 2017).
Our study differs in one crucial aspect from previous research on MTL: it looks at learning dynamics within one, larger, natural task instead of between tasks defined by the modeller. As a consequence, the learning process itself is not constrained through a priori decisions concerning task selection, or how tasks should be optimised together. In our scenario, contrary to traditional MTL, we use tasks and their hypothesised similarity only to analyse the learning process of the language model, not to inform its training. As such, our natural setting allows to study traditional MTL phenomena, such as data amplification, eavesdropping, and attention focusing, independent of arbitrary decisions regarding task selection and optimisation. This knowledge can then be transferred to scenarios in which more control over the selection of tasks may be required.

Interpretability research
A second field where we believe studying language models as multi-task learners can contribute, is the field of interpretability. On a more basic level, our paper confirms previous findings in interpretability that LMs are able to adequately model NPIs (Jumelet and Hupkes, 2018;Wilcox et al., 2019;Marvin and Linzen, 2018). We add to this literature by explicitly showing that LMs are connecting different types of contexts together through their learning behaviour. Contrary to previous work, we are tapping the learning process itself as a source of information to better understand the inner workings of these models.
Traditional concepts from MTL, such as the earlier mentioned explanations of Caruana (1993) and Ruder (2017) ( § 2.1) are valuable to better under-standing what models are learning and how. For instance, when we observe that the solution of models improves when more varied NPI material is presented (our single-versus all-context experiment), MTL can aid to formulate concrete hypotheses about why this is the case. This, in turn, can help us improve our understanding of the solutions that are learned by the model. For instance, we find that the single-context models usually level-off on a lower accuracy-level than the all-context model (see Figure 5). This is not merely explainable by the amount of data, as we continue to add training examples in either case. The difference between models instead appears to be due to the variety of the training data. The idea of attention focusing (Caruana, 1993(Caruana, , 1997Ruder, 2017) helps us to understand what is going on: by being trained on more varied NPI material, the model can better sort out which features are relevant and which ones are instead idiosyncrasies correlated with specific contexts. Such hypotheses can then help inform further experiments, that investigate -for example -which features specifically are better learned through attention focusing.

Linguistics research
Finally, we believe that studying language models as multi-task learners can also contribute to the field of linguistics. In our study, we show that LMs can find and exploit similarities between linguistically defined concepts. Turning things around, this generalisation behaviour of models can also be seen as a confirmation of the linguistic task hierarchy that we assumed from the start. The language modelling objective is unconstrained by linguistic theory and therefore does not necessarily have to find the same solutions as linguistics. Similarity derived from the learning behaviour of language models might therefore be used as a tool to work on more disputed ideas in linguistics and to form new hypotheses in linguistic theory. While the linguistic insights that can be drawn from the current study are relatively limited, they do provide a proof of concept for future work: we show that domain knowledge and learning behaviour of neural models can be connected.

Conclusion
In the current study we explored the possibility to use multi-task learning as a framework to study learning behaviour within a task. To this end we considered LMs as multi-task learners and investigated how they learn the task-cluster of NPIlicensing. We find that LMs pick up on similarities that we assume from linguistic theory and exploit them to learn similar language constructions with less data and to a higher accuracy. Especially less frequent tasks benefit from this effect.
These results resemble positive transfer in 'traditional' MTL. We lined out the possible benefits that our study may have for MTL research, interpretability and linguistics. From here there are many directions for future work: targeting less comprehensively researched areas in linguistics to add empirical data to otherwise usually theoretical linguistic discussions, investigating the change of internal representations in place of the behavioural measure used here to more precisely describe the learning process, or applying the approach to other high-level tasks in other modalities obeying other knowledge domains are just few of theses possibilities.