Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.


Introduction
Multi-task learning (MTL) (Caruana, 1993(Caruana, , 1997 has developed into a standard repertoire in natural language processing (NLP). It enables neural networks to learn tasks in parallel (Caruana, 1993) while leveraging the benefits of sharing parameters. The shift-or "tsunami" (Manning, 2015)-of deep learning in NLP has facilitated the wide-spread use of MTL since the seminal work by Collobert et al. (2011), which has led to a multi-task learning "wave" (Ruder and Plank, 2018) in NLP. It has since been applied to a wide range of NLP tasks, developing into a viable alternative to classical pipeline approaches. This includes early adoption in Recurrent Neural Network models, e.g. (Lazaridou et al., 2015;Chrupała et al., 2015;Plank et al., 2016;Søgaard and Goldberg, 2016;Hashimoto et al., 2017), to the use of large pre-trained language models with multi-task objectives (Radford et al., 2019;Devlin et al., 2019). MTL comes in many flavors, based on the type of sharing, the weighting of 1 The code is available at: https://github. com/machamp-nlp/machamp (v0.2), and an instructional video at https://www.youtube.com/watch? v=DauTEdMhUDI.
losses, and the design and relations of tasks and layers. In general though, outperforming single-task settings remains a challenge (Martínez Alonso and Plank, 2017;Clark et al., 2019). For an overview of MTL in NLP we refer to Ruder (2017).
As a separate line of research, the idea of language model pre-training and contextual embeddings (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019) is to pre-train rich representation on large quantities of monolingual or multilingual text data. Taking these representations as a starting point has led to enormous improvements across a wide variety of NLP problems. Related to MTL, recent research effort focuses on fine-tuning contextualized embeddings on a variety of tasks with supervised objectives (Kondratyuk and Straka, 2019;Sanh et al., 2019;Hu et al., 2020).
We introduce MACHAMP, a flexible toolkit for multi-task learning and fine-tuning of NLP problems. The main advantages of MACHAMP are: • Ease of configuration, especially for dealing with multiple datasets and multi-task setups; • Support of a wide range of NLP tasks, including a variety of sequence labeling approaches, text classification, dependency parsing, masked language modeling, and text generation (e.g., machine translation); • Support of the initialization and fine-tuning of any contextualized embeddings from Hugging Face (Wolf et al., 2020).
As a result, the flexibility of MACHAMP supports up-to-date, general-purpose NLP (see Section 2.2). The backbone of MACHAMP is Al-lenNLP (Gardner et al., 2018), a PyTorch-based (Paszke et al., 2019) Python library containing modules for a variety of deep learning methods and NLP tasks. It is designed to be modular, high- level and flexible. It should be noted that contemporary to MACHAMP, jiant (Pruksachatkun et al., 2020) was developed, and AllenNLP included multi-task learning as well since release 2.0. MACHAMP distinguishes from the other toolkits by supporting simple configurations, and a variety of multi-task settings.

Model
In this section we will discuss the model, its supported tasks, and possible configuration settings.

Model overview
An overview of the model is shown in Figure 1. MACHAMP takes a pre-trained contextualized model as initial encoder, and fine-tunes its layers by applying an inverse square root learning rate decay with linear warm-up (Howard and Ruder, 2018), according to a given set of downstream tasks. For the task-specific predictions, each task has its own decoder, which is trained for the corresponding task. The model defaults to the embedding-specific tokenizer in Hugging Face (Wolf et al., 2020). 2 When multiple datasets are used for training, they are first separately split into batches so that each batch only contains instances from one dataset. Batches are then concatenated and shuffled before training. This means that small datasets will be underrepresented, which can be overcome by smoothing the dataset sampling (Section 3.2.2). During de-coding, the loss function is only activated for tasks which are present in the current batch. By default, all tasks have an equal weight in the loss function. The loss weight can be tuned (Section 3.2.1).

Supported task types
We here describe the tasks MACHAMP supports. SEQ For traditional token-level sequence prediction tasks, like part-of-speech tagging. MACHAMP uses greedy decoding with a softmax output layer on the output of the contextual embeddings. STRING2STRING An extension to SEQ, which learns a conversion for each input token to its label. Instead of predicting the labels directly, the model can now learn to predict the conversion. This strategy is commonly used for lemmatization (Chrupała, 2006;Kondratyuk and Straka, 2019), where it greatly reduces the label vocabulary. We use the transformation algorithm from UDPipe-Future (Straka, 2018), which was also used by Kondratyuk and Straka (2019).
SEQ BIO A variant of SEQ which exploits conditional random fields (Lafferty et al., 2001) as decoder, masked to enforce outputs following the BIO tagging scheme.
MULTISEQ An extension to SEQ which supports the prediction of multiple labels per token. Specifically, for some sequence labeling tasks it is unknown beforehand how many labels each token should get. We compute a probability score for each label, employing binary cross-entropy as loss, and outputting all the labels that exceed a certain threshold. The threshold can be set in the dataset configuration file. DEPENDENCY For dependency parsing, MACHAMP uses the deep biaffine parser (Dozat and Manning, 2017) as implemented by Al-lenNLP (Gardner et al., 2018), with the Chu-Liu/Edmonds algorithm (Chu, 1965;Edmonds, 1967) for decoding the tree.
MLM For masked language modeling, our implementation follows the original BERT settings (Devlin et al., 2019). The chance that a token is masked is 15%, of which 80% are masked with a [MASK] token, 10% with a random token, and 10% are left unchanged. We do not include the next sentence prediction task following Liu et al. (2019), for simplicity and efficiency. We use a cross entropy loss, smell ya later ! negative (b) Example of a sentence-level file format (e.g., for sentiment classification), where only a sentence is required and is defined in column 0 (i.e., sent idxs=[0]) and a single layer of annotation is in the second column (column idx=1). and the language model heads from the defined Hugging Face embeddings (Wolf et al., 2020). It assumes raw text files as input, so no column idx has to be defined (See Section 3.1).
CLASSIFICATION For text classification, it predicts a label for every text instance by using the embedding of the first token, which is commonly a special token (e.g.
[CLS] or <s>). For tasks which model a relation between multiple sentences (e.g., textual entailment), a special token (e.g. [SEP]) is automatically inserted between the sentences to inform the model about the sentence boundaries.
SEQ2SEQ For text generation, MACHAMP employs the sequence to sequence (encoder-decoder) paradigm (Sutskever et al., 2014). We use a recurrent neural network decoder, which suits the auto-regressive nature of the machine translation tasks (Cho et al., 2014) and an attention mechanism to avoid compressing the whole source sentence into a fixed-length vector (Bahdanau et al., 2015).

Usage
To use MACHAMP, one needs a configuration file, input data and a command to start the training or prediction. In this section we will describe each of these requirements.

Data format
MACHAMP supports two types of data formats for annotated data, 3 which correspond to the level of annotation (Section 2.2). For token-level tasks, we will use the term "token-level file format", whereas for sentence-level task, we will use "sentence-level file format".
The token-level file format is similar to the tabseparated CoNLL format (Tjong Kim Sang and De Meulder, 2003). It assumes one token per line (on a column index word idx), with each annotation layer following each token separated by a tab character (each on a column index column idx) (Figure 2a). Token sequences (e.g., sentences) are delimited by an empty line. Comments are lines on top of the sequence (which have a different number of columns with respect to "token lines"). 4 It should be noted that for dependency parsing, the format assumes the relation label to be on the column idx and the head index on the following column. Further, we also support the UD format by removing multi-word tokens and empty nodes using the UD-conversion-tools (Agić et al., 2016).
The sentence-level file format (used for text classification and text generation) is similar (Figure 2b), and also supports multiple inputs having the same annotation layers. A list of one or more column indices can be defined (i.e., sent idxs) to enable modeling the relation between any arbitrary number of sentences.

Configuration
The model requires two configuration files, one that specifies the datasets and tasks, and one for the hyperparameters. For the hyperparameters, a default option is provided (configs/params.json, see Section 4).

Dataset configuration
An example of a dataset configuration file is shown in Figure 3. On the first level, the dataset names are specified (i.e., "UD" and "RTE"), which should be unique identifiers. Each of these datasets needs at least a train data path, a validation data path, a word idx or sent idxs, and a list of tasks (corresponding to the layers of annotation, see Section 3.1).
Loss weight In multi-task settings, not all tasks might be equally important, or some tasks might just be harder to learn, and therefore should gain more weight during training. This can be tuned by setting the loss weight parameter on the task level (by default the value is 1.0 for all tasks). Ammar et al. (2016) have shown that embedding which language an instance belongs to can be beneficial for multilingual models. Later work (Stymne et al., 2018;Wagner et al., 2020) has also shown that more fine-grained distinctions on the dataset level 5 can be beneficial when training on multiple datasets within the same language (family). In previous work, this embedding is usually concatenated to the word embedding before the encoding. However, in contextualized embeddings, the word embeddings themselves are commonly used as encoder, hence we concatenate the dataset embeddings in between the encoder and the decoder. This parameter is set on the dataset 5 These are called treebank embeddings in their work. We will use the more general term "dataset embeddings", which would often roughly correspond to languages and/or domains/genres. level with dataset embed idx, which specifies the column to read the dataset ID from. Setting dataset embed idx to -1 will use the dataset name as specified in the json file as ID.

Dataset embedding
Max sentences In order to limit the maximum number of sentences that are used during training, max sents is used. This is done before the sampling smoothing (Section 3.2.2), if both are enabled. It should be noted that the specified number will be taken from the top of the dataset.

Hyperparameter configuration
Whereas most of the hyperparameters can simply be changed from the default configuration provided in configs/params.json, we would like to highlight two main settings.
Pre-trained embeddings The name/path to pretrained Hugging Face embeddings 6 can be set in the configuration file at the transformer model key; transformer dim might be adapted accordingly to reflect the embeddings dimension.
Dataset sampling To avoid larger datasets from overwhelming the model, MACHAMP can resample multiple datasets according to a multinomial distribution, similar as used by Conneau and Lample (2019). MACHAMP performs the sampling on the batch level, and shuffles after each epoch (so it can see a larger variety of instances for downsampled datasets). The formula is: where p i is the probability that a random sample is from dataset i, and α is a hyperparameter that can be set. Setting α=1.0 means using the default sizes,  Figure 4. Smoothing can be enabled in the hyperparameters configuration file at the sampling smoothing key.

Training
Given the setup illustrated in the previous sections, a model can be trained via the following command. It assumes the configuration ( Figure 3) is saved in configs/upos-lemma-rte.json.
python3 train.py --dataset_config \ configs/upos-lemma-rte.json By default, the model and the logs will be written to logs/<JSONNAME>/<DATE>. The name of the directory can be set manually by providing --name <NAME>. Further, --device <ID> can be used to specify which GPU to use, otherwise the CPU will be used. As a default, train.py uses configs/params.json for the hyperparameters, but this can be overridden by using --parameters config <CONFIG FILE>.

Inference
Prediction can be done with: python3 predict.py \ logs/<NAME>/<DATE>/model.tar.gz \ <INPUT FILE> <OUTPUT FILE> It requires the path to the best model (serialized during training) stored as model.tar.gz in the logs directory as specified above. By default, the data is assumed to be in the same format as the training data (i.e., with the same number of column idx columns), but --raw text can be specified to read a data file containing raw texts with one sentence per line. For models trained  on multiple datasets (as "UD" and "RTE" in Figure 3), --dataset <NAME> can be used to specify which dataset to use in order to predict all tasks within that dataset.

Hyperparameter Tuning
In this section we describe the procedure how we determined robust default parameters for MACHAMP; note that the goal is not to beat the state-of-the-art, but to reach competitive performance for multiple tasks simultaneously. 7 For the tuning of hyperparameters, we used the GLUE classification datasets (Wang et al., 2018;Warstadt et al., 2019;Socher et al., 2013;Dolan and Brockett, 2005;Cer et al., 2017;Williams et al., 2018;Rajpurkar et al., 2018;Bentivogli et al., 2009;Levesque et al., 2012) and the English Web Treebank (EWT 2.6) (Silveira et al., 2014) with multilingual BERT 8 (mBERT) as embeddings. 9 For each of these setups, we averaged the scores over all datasets/tasks and perform a grid search. The best hyperparameters across all datasets are reported in Table 1 and are the defaults values for MACHAMP. 7 Compared to MACHAMP v0.1 (van der Goot et al., 2020) we removed parameters with negligible effects (word dropout, layer dropout, adaptive softmax, and layer attention). 8 https://github.com/google-research/ bert/blob/master/multilingual.md 9 We capped the dataset sizes to a maximum of 20,000 sentences for efficiency reasons.  Table 3: Average results over all development sets. Dataset embeddings and a separate decoder have not been tested in GLUE, because each dataset is annotated for a different task. * includes dataset smoothing.

Single task evaluation
As a starting point, we evaluate single task models to ensure our implementations are competitive with the state-of-the-art. We report scores on dependency parsing (EWT), the GLUE classification tasks, and machine translation (WMT14 DE-EN (Bojar et al., 2014), IWSLT15 EN-VI (Cettolo et al., 2014)) using mBERT as our embeddings. 10 Table 2 reports our results on the test sets compared to previous work. For all UD tasks, we score slightly higher, whereas for GLUE tasks we score consistently lower compared to the references. This is mostly due to differences in fine-tuning strategies, as implementations themselves are highly similar. Scores on the machine translation tasks show the largest drops, indicating that task-specific finetuning and pre-processing might be necessary.

Multi-dataset evaluation
We evaluate the effect of a variety of multi-dataset settings on all GLUE and UD treebanks (v2.7) on the test splits. It should be noted that the UD treebanks all have the same tasks, as opposed to GLUE. First, we jointly train on all datasets (ALL), then we attempt to improve performance on smaller sets by enabling the sampling smoothing (SMOOTHED, Section 3.2.2, we set α = 0.5). Furthermore, we attempt to improve the performance by informing the decoder of the dataset through dataset embeddings (DATASET EMBED., Section 3.2.1) or by giving each dataset its own decoder (SEP. DECODER). Results (Table 3) show that multi-task learning is only beneficial for performance when training on the same set of tasks (i.e., UD), dataset smoothing is helpful, dataset embeddings and separate decoders do not improve upon smoothing on average.  For analysis purposes, we group the UD treebanks based on training size, and also evaluate UD treebanks which have no training split (zero-shot). For the zero-shot experiments, we select a proxy parser based on word overlap of the first 10 sentences of the target test data and the source training data. 11 Results on the UD data (Table 4) show that multi-task learning is mostly beneficial for mediumsized datasets (<1k and <10k). For these datasets, the combination of smoothing and dataset embeddings are the most promising settings. Perhaps surprisingly, the zero-shot datasets (<1k) have a higher LAS as compared to the small datasets and using a separate decoder based on the proxy treebank is the best setting; this is mainly because for many small datasets there is no other in-language training treebank. For the GLUE tasks (Table 5, Appendix), multi-task learning is only beneficial for the RTE data. This is to be expected, as the tasks are different in this setup, and training data is generally larger. Dataset smoothing here prevents the model from dropping too much in performance, as it outperforms ALL for 7 out of 9 tasks.

Conclusion
We introduced MACHAMP, a powerful toolkit for multi-task learning supporting a wide range of NLP tasks. We also provide initial experiments demonstrating the usefulness of some of its options. We learned that multi-task learning is mostly beneficial for setups in which multiple datasets are annotated for the same set of tasks, and that dataset embeddings can still be useful when employing contextualized embeddings. However, the current experiments are just scratching the surface of MACHAMP's capabilities, as a wide variety of tasks and multi-task settings is supported.

Acknowledgments
We would like to thank Anouck Braggaar, Max Müller-Eberstein and Kristian Nørgaard Jensen for testing development versions. Furthermore, we thank Rik van Noord for his participation in the video, and providing an early use-case for MACHAMP (van Noord et al., 2020). This research was supported by an Amazon Research Award, an STSM in the Multi3Generation COST action (CA18231), a visit supported by COSBI, grant 9063-00077B (Danmarks Frie Forskningsfond), and Nvidia corporation for sponsoring Titan GPUs. We thank the NLPL laboratory and the HPC team at ITU for the computational resources used in this work.  Table 5: The scores (accuracy) per dataset on the GLUE tasks (dev) for a variety of multi-task settings (ordered by size, indicated in number of sentences in training data).