Less Is More: Domain Adaptation with Lottery Ticket for Reading Comprehension

In this paper, we propose a simple few-shot domain adaptation paradigm for reading comprehension. We ﬁrst identify the lottery subnet-work structure within the Transformer-based source domain model via gradual magnitude pruning. Then, we only ﬁne-tune the lottery subnetwork, a small fraction of the whole parameters, on the annotated target domain data for adaptation. To obtain more adaptable sub-networks, we introduce self-attention attribution to weigh parameters, beyond simply pruning the smallest magnitude parameters, which can be seen as combining structured pruning and unstructured magnitude pruning softly. Experimental results show that our method outperforms the full model ﬁne-tuning adaptation on four out of ﬁve domains when only a small amount of annotated data available for adaptation. Moreover, introducing self-attention attribution reserves more parameters for important attention heads in the lottery subnetwork and improves the target domain model performance. Our further analyses reveal that, besides exploiting fewer parameters, the choice of subnetworks is critical to the effectiveness. 1


Introduction
Reading comprehension (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 obtains great attention from both research and industry for its practical value. State-of-the-art systems based on pre-trained language models (Devlin et al., 2019;Dong et al., 2019;Joshi et al., 2020) have achieved remarkable performance on the task. Despite pretraining, they still rely on large amounts of annotated data (Rajpurkar et al., 2018;Trischler et al., 2017;Kwiatkowski et al., 2019) to reach the desired task performance. Manually collecting such high-quality datasets is costly and time-consuming, especially for cases that require specific domain knowledge. It hinders us from applying the datadriven solutions directly to scenarios or domains without sufficient annotation data. In this case, domain adaptation (Golub et al., 2017;Shakeri et al., 2020) is used to obtain a reasonable target domain performance.
Unsupervised domain adaptation Cao et al., 2020) exploits the unlabeled context passages for adaptation. However, these methods have difficulties in adapting to the desiderata of questions and question-context reasonings in the target domain. In this paper, we focus on supervised domain adaptation for reading comprehension in the few-shot settings. We are devoted to transfer a model trained on a large amount of source domain data to the target domain with only limited annotated data. It is generally feasible to annotate a small amout of question answering pairs.
Typical reading comprehension models based on pre-trained language model contain at least hundreds of millions parameters, e.g., size of BERTbase is 110M. Previous works (Voita et al., 2019a;Michel et al., 2019; show that dense neural networks are over-parameterized and considerable parameters of a trained model can be pruned with marginal or even no loss in performance. Meanwhile, "The Lottery Ticket Hypothesis" (Frankle and Carbin, 2019) argues that the initialization of over-parameterized neural networks contains sparse sub-network at initialization, which, when trained in isolation, rival the original network in task performance. On the other hand, our preliminary analysis ( Figure 2) using an effective attribution method (Hao et al., 2020) shows that important attention heads are highly correlated across various domains.
In view of the over-parameterized source domain model and our preliminary findings on attention head dynamics, we assume fine-tuning a small fraction of deliberately selected parameters is both more efficient and more effective for fewshot domain adaptation. Specifically, we first prune the source domain model via magnitude pruning gradually. In addition, we introduce self-attention attribution (Hao et al., 2020) to reserve more parameters for important heads. The corresponding connections of the survived parameters after pruning depict the exact sparse structure of the lottery network. Then, we only fine-tune the lottery subnetwork, which consumes much less parameters, on the annotated target domain data for adaptation. The remaining parameters are frozen and will not be updated, but they also contribute to the predictions by participating in the forward computation.
Experimental results show that our method, exploiting small lottery subnetworks for few-shot domain adaptation, outperforms the full model finetuning on four out of five various domains with a range number of training examples. Further analyses reveal several intriguing findings. First, introducing attention head importance yields better lottery subnetworks in highly sparse regimes in the source domain. and improves the performance regardless of the sparsity. Secondly, the better source domain lottery subnetworks lead to the improved domain adaptation performance. Finally, in addition to using fewer parameters, the choice of subnetwork structure is critical to effectiveness.

The Transformer
Transformer (Vaswani et al., 2017) is a widely used model architecture that relies heavily on at-tention mechanism. A Transformer-based model consists of L stacked identical Transformer blocks. The model first embeds and then encodes the inputs through L-layer Transformer blocks H l = Transformer l (H l−1 ), l ∈ [1, L]. Each Transformer block consists of two sub-layers, a multihead self-attention mechanism and a feed-forward network. A residual connection (He et al., 2016) followed by layer normalization (Ba et al., 2016) is employed around each of the two sub-layers.
The core component of a Transformer block is multi-head self-attention. For the l-th layer, the previous layer's output H l−1 is linearly projected to a triple of queries Q, keys K and values V using parameter matrices W l Q , W l K , W l V ∈ R d k ×d k respectively. Then the attention of the i-th head is computed via: where d k is the size of the hidden states. At last, the output of multi-head self-attention is h is the number of heads, [·] means concatenation.

Self-Attention Head Importance
Many works (Clark et al., 2019;Kovaleva et al., 2019) have tried to interpret Transformer models' behaviors. Recently, Hao et al. (2020) propose a self-attention attribution (ATTATTR) method by running an integrated gradients (Sundararajan et al., 2017) procedure over all the attention links. A higher attribution score indicates greater contribution to the model prediction.
Concretely, given input x of n tokens, the attribution score of each attention link within the i-th head is computed as: where is element-wise multiplication, attention map A i is computed as in Equation 1, A = [A 1 , · · · , A h ], and ∂F (x,αA) ∂A i computes the gradient of model F(·) along A i with the manipulated attention weight matrix. Then, the importance score of the i-th attention head can be estimated via:

The Lottery Ticket Hypothesis
The Lottery Ticket Hypothesis (Frankle and Carbin, 2019) suggests that we can find small and sparse subnetworks that rival the original network in performance, when trained in isolation from "lucky" initializations, often referred to as "winning lottery tickets". The connections of the winning lottery tickets are initialized to be particularly effective for training. Magnitude pruning (Han et al., 2015) is an effective method widely used to identify the winning lottery ticket by pruning the smallest magnitude weights.

Reading Comprehension Task and Domain Variance
In this work, we focus on extractive reading comprehension, which aims to extract a continuous span from the text context c as the answer a to a question q. It has been a prevalent format since SQuAD v1.1 (Rajpurkar et al., 2016) and widely adopted by several other reading comprehension datasets (Joshi et al., 2017;Trischler et al., 2017;Yang et al., 2018;Kwiatkowski et al., 2019) in various domains. The differences between the domains are mainly derived from: a) the styles and the sources of the context passages, including Wikipedia, news articles, science articles, Web snippets, Tweets, b) the types of questions being asked, e.g., factoid, conversational, entity-centric, multi-hop reasoning, search queries, and c) the methodology under which the questions were collected, including manually written by crowdworkers, domain experts, and automatically mined from the web or search logs.
Our preliminary experiments explore the dynamics of important self-attention heads across different domains. We fine-tune BERT-base on each domain dataset independently to obtain domainspecific models. Then we employ ATTATTR, in Section 2.2, to get the importance scores of attention heads using Equation 2. We take three representative datasets, SQuAD v1.1 (Rajpurkar et al., 2016), NQ (Kwiatkowski et al., 2019) and NewsQA (Trischler et al., 2017), that differ in the sources of the context passages and question types. The heatmap of head importance on SQuAD v1.1 and the correlation of importance scores between each two of the three datasets are shown in Figure 2. Given the same BERT initialization, we can see that, despite the domain differences, the important heads are highly correlated. The preliminary results uncover the value of exploiting important heads for efficient domain adaptation.

Method
In this section, we describe our few-shot domain adaptation method for machine reading comprehension in detail. In the source domain, we have a model trained on a large-scale annotated dataset. We fine-tune BERT-base (Devlin et al., 2019), a representative Transformer-based pre-trained language model with tremendous number of parameters, as our source domain model. In the target domain, only limited annotated data, 1k examples at most, can be used for domain adaptation. The mismatch between a small amount of data and a large number of parameters makes it challenging to adapt all source domain model parameters to the target domain. Thus, we exploit a small fraction of deliberately selected parameters for domain adaptation by first identifying and then fine-tuning the lottery subnetwork.

Identifying the Lottery Network
Neural networks are over-parameterized (Allen-Zhu et al., 2019), a great fraction of the parameters are redundant and can be pruned with minimal or Algorithm 1 Identifying the Lottery Subnetwork with Self-Attention Head Importance

Require:
1: Source domain model F(x; M θ0) 2: Initial pruning mask M = 1 |θ 0 | 3: Target sparsity s, pruning frequency ∇t and steps N 4: Importance factor λ 5: for n ← 1 to N do 6: Estimate attention head importance In Eq. 2 7: Trim magnitudes with normalized importance score, θ (n−1)∇t ← AttrMagnitude(θ (n−1)∇t ,În) 9: sn ← s − s(1 − n N ) 2 sparsity of step n 10: Prune the lowest magnitudes parameters in group from θ (n−1)∇t to sparsity sn 11: Update the pruning mask M 12: Train the model for ∇t steps, producing F(x; M θn∇t) 13: end for 14: Train the model util stopping criterion is met, producing F(x; M θT ) 15: return Lottery Subnetwork M even no compromise in task performance. And The Lottery Ticket Hypothesis (Frankle and Carbin, 2019) suggests the existence of sparse subnetworks, trained from "lucky" initializations, that match the performance of the full model.
Magnitude pruning It is a simple and effective unstructured pruning method that prunes the smallest magnitude parameters (Han et al., 2015), which also used to find the winning lottery ticket (Frankle and Carbin, 2019). It requires several tricks to find lottery tickets for complicated architectures (Morcos et al., 2019). In our work, we employ a simple gradual pruning algorithm without iteratively rewinding parameters. It prunes a portion of the parameters each time and gradually increases the sparsity of the model. Training between pruning steps allows the model to recover from the pruninginduced task performance degradation. We follow Zhu and Gupta (2018) but use a square sparsity scheduling for magnitude pruning. The corresponding connections of the survived parameters after pruning depict the exact sparse structure of the lottery network. For the Transformer-based source domain model, we only prune the parameter matrix of the linear projections and feed-forward networks,

and keep the rest intact.
Pruning Strategy Pruning can be performed in two different ways: locally and globally. In local pruning, parameters magnitudes are compared within each parameter matrix separately, such that every parameter matrix will have the same fraction of pruned parameters. In global pruning, all parameters are pooled together prior to pruning, allowing the pruning fraction to vary across parameter matrices and layers.
Considering the intrinsic metric for magnitude pruning, the component importance may be overwhelmed by parameter magnitudes in global pruning. In cases that more parameters are pruned in important components due to their relative lower magnitudes. We observe that the magnitudes of Transformer parameter matrices are distributed uniformly across layers, but distantly across parameters matrices. Therefore, we propose a "divideand-conquer" group pruning strategy, which divide the parameter matrices in groups according to their mean magnitudes and prune locally inter-group and globally intra-group.
Pruning with Self-Attention Head Importance  points that magnitude pruning is effective, but it is insufficient to determine the parameter importance using magnitude alone. Meanwhile, in Section 2.4, we find that attention heads are not equally important to the model predictions, and the important heads are highly correlated across various domains.
Thus, we introduce self-attention attribution (ATTATTR; Hao et al., 2020) into magnitude pruning to identify more adaptable subnetworks when the sizes remain identical. In each pruning step, we first estimate the importance scores I of all attention heads using Equation 2. Then we scale the importance scores with MinMax(λ, I) normalization, where λ is the importance factor that negatively indicates the intensity of importance intervention. At last, we scale the parameters magnitudes accordingly, which may reverse the rankings previously determined by the magnitudes alone. Note that the parameters of an attention head are scattered in four parameter matrices. We apply the same importance scores to each parameter matrix and the slices of the same head are scaled identically within a layer.
In conclusion, we reserve more parameters for important heads, which are highly correlated across domains, due to its high self-attention attribution scores under the same pruning budget, and vice versa. That is we have lottery networks that are potentially more adaptable to target domains. Our lottery networks identification method is shown in Algorithm 1.

Adapting the Lottery Subnetwork
In Section 3.1, we have identified the sparse structure of the lottery subnetwork for adaptation. When adapting to the target domain, we use the original source domain model parameters and only update the lottery subnetwork parameters with limited annotated data, 1k examples at most. In this way, we adapt from an integrated source domain model without potential performance loss induced by pruning. Note that the pruned parameters are frozen and will not be updated, but they participate in the forward computation.

Datasets
We simulate few-shot domain-adaptation scenarios by sampling subsets from larger training sets. We use SQuAD v1.1 (Rajpurkar et al., 2016) as the resource-rich source domain and five various datasets, in Table 1, as the target domains:   QuAC (Choi et al., 2018): QuAC contains conversational questions in the context of multi-turn information-seeking dialogues. We filter out yes/no questions and unanswerable questions.

Baselines
We compare our method, ALTER (Adaptable Lottery), against the following baselines: Zero-Shot We apply the source domain model to the target domain without adaptation.
Fine-tuning We fine-tune the full source domain model on the target domain data.
EWC Elastic Weight Consolidation (Kirkpatrick et al., 2017) is a regularization algorithm that constrains parameters to stay close to their original values and prevents large deviations.
Layer Freeze We only fine-tune the top layers of the source domain model on the target domain data and freeze the rest.
Adapter Houlsby et al. (2019) proposes adapters for efficient transferring by adding only a few trainable parameters. We add adapters within transformer blocks and only update adapters.

Implementation Details
We experiment with BERT-base-uncased 2 (Devlin et al., 2019), a Transformer-based pre-trained model with roughly 110M parameters. Fine-tuning embedding layer in the target domain yields no consistent differences. We thus freeze the embedding layer and reported sparsity percentages are relative to model without embedding layer, i.e., 84M parameters. We set maximum sequence length 384 with document stride 128. Adam (Kingma and Ba, 2015) with linear learning rate decay is used for optimization. The source domain model is BERT   fine-tuned on SQuAD v1.1 with learning rate of 3e-5 and batch size 12 for 2 epochs. We search for the best learning rate out of [3e-5, 6e-5] and select epoch out of [2,3] in the target domain. Attention head importance are estimated with 200 source domain examples, using model predictions instead of the gold answers. Importance factor λ is set to 0.2 for the best performance. Table 2 shows the exact match (EM) and F1 scores on five target domains with 1024 training examples. We use magnitude pruning together with selfattention head importance to identify the lottery subnetworks, which contain 21M parameters and correspond to approximately 25% of all parameters. We fine-tune the top 3 layers in LayerFreeze baseline and set the adapter size to 128. Experimental results show that ALTER outperforms the full model fine-tuning baseline and EWC regularized baseline on four out of five target domains. Layer-Freeze and Adapter use roughly the same number of parameters as our method. However, they both perform worse than the fine-tuning baseline in most cases, which indicates that the structure to accommodate parameters is important. ALTER of this size performs worse than fine-tuning baseline on NQ, but competitively when using 42M parameters.

Domain Adaptation Results
In Figure 3, we plot the F1 score of ALTER against all baselines on four domains in a range number of few-shot settings. EWC performs competitively with the fine-tuning baseline and occasionally yields slightly better results. Our method is orthogonal to EWC and can be exploited together, which we leave it to the future work. As in Table 2, FreezeLayer and Adapter are less competitive, except for TriviaQA in Figure 3b. However, Adapter consistently performs more robustly than other methods. We can clearly see that ALTER obtains superior performance in three domains with 64 to 1024 examples. Results on NQ are shown in Figure 3d, ALTER matches the fine-tuning baseline with only a half of the parameters. Besides, we present our method with the best performing lottery subnetworks and the optimal sizes in each domain are not identical. We find 20% ∼ 30% parameters are satisfactory, the only exception is 50% for NQ. In conclusion, ALTER is shown to be both effective and efficient for few-shot domain adaptation.

Analyses
Does structure-aware pruning deliver better lottery subnetworks? In Figure 4, the F1 scores of lottery networks identified with or without attention head importance in the source domain are shown. Since local pruning and global pruning perform competitively, we only present the results using local pruning and our group pruning (Section 3.1). At low sparsity (more than 30% of remaining weights), two pruning methods perform equally well and head importance has little effect in varying F1 score. However, at high sparsity, pruning with head importance maintains the performance of subnetworks within 90% of the full model with only 20% of remaining parameters. Meanwhile, group pruning works better with structureaware importance determination.
Next, we investigate to what extent should we exploit attention head importance scores for pruning. Smaller importance factors λ in Algorithm 1 means that we can alter the parameters magnitudes more dramatically. That is the importance of parameters is more determined by its attention head importance. In Figure 4, we find that setting λ to 0.2 consistently leads to better lottery subnetworks of different sizes.  better source domain performance lead to more efficient adaptation in the target domain? To answer this question, we present the difference of F1 score of lottery subnetworks identified with or without self-attention head importance in Figure 5. It shows consistent improvement with different number of target domain examples. The improvement tends to be magnified at higher sparsity, which is in tune with the trends in Figure 4.
What about other alternatives to lottery networks identification? We have investigated several heuristic methods to explore the choice of subnetwork structures for domain adaptation: RANDOM chooses parameters to constitute subnetworks randomly.
MAGNITUDE selects the highest magnitudes parameters in one-shot.
SALVAGE reuses the pruned redundant parameters, which operates conversely with our method.
ATTRHEAD prunes the whole attention head with structured pruning, and applies unstructured magnitude pruning in feed-forward layers.
In Table 3, the sizes of subnetworks are identical. Methods in the second group work without structure importance priors. They perform similarly and outperform the full-model fine-tuning baseline surprisingly, which shows adapting all parameters to the target domain is not optimal when given few examples. We put the structure-aware methods in the third group. Comparing SALVAGE and ALTER, we find using important parameters instead of the redundant parameters are more effective. Results on ATTRHEAD show that high magnitude parameters in less important heads are also useful.

Related Work
Domain Adaptation and Generalization in MRC Previous domain adaptation works (Nishida et al., 2020) are mainly unsupervised and require plenty of unlabeled text. Most of them are devoted to generate synthetic questions (Golub et al., 2017). Adversarial training Cao et al., 2020), self-training (Rennie et al., 2020) and several filtering methods (Shakeri et al., 2020;Rennie et al., 2020) are explored in this direction. But they have the inherent difficulty to accommodate the question and reasoning types desired in the target domain.
Several works have explored the domain generalization in reading comprehension. Talmor and Berant (2019), Khashabi et al. (2020) and Lourie et al. (2021) improve the generalization by training on multiple datasets. Su et al. (2020) introduces Adapters (Houlsby et al., 2019) to accommodate each domain. Theses method requires a quite amount of annotated data to work. We focus on more efficient few-shot domain adaptation. Ram et al. (2021) explores few-shot question answering via pre-training, which is orthogonal to our work.
Analyzing and Pruning Transformer Analyses (Clark et al., 2019;Mareček and Rosa, 2019;Voita et al., 2019b;Brunner et al., 2020;Hao et al., 2020) on Transformer mainly focus on understanding the multi-head self-attention mechanism. Michel et al. (2019); Voita et al. (2019a,b) show that most self-attention heads can be pruned with marginal performance loss. Structured pruning on more components are also explored (McCarley et al., 2019;Fan et al., 2020). We are inspired to treat self-attention heads unequally for domain adaptation. Unstructured magnitude pruning (Han et al., 2015) with tricks (Zhu and Gupta, 2018; Frankle et al., 2020) can reduce more parameters Gordon et al., 2020). In this work, we exploit both structured and unstructured pruning to find sparse structures.
Lottery Ticket in NLP The Lottery Ticket Hypothesis (Frankle and Carbin, 2019) is largely researched in Vision. Recent works (Yu et al., 2020;Prasanna et al., 2020;Chen et al., 2020) in NLP explore the existence of lottery subnetworks at pretrained initialization and after training on downstream tasks. In our work, we identify and fine-tune lottery subnetworks for domain adaptation.

Conclusions
In this work, we propose ALTER, a simple and effective domain adaptation paradigm for few-shot reading comprehension. We exploit a small fraction of parameters of the over-parameterized source domain model to adapt to the target domain by first identifying and then fine-tuning the lottery subnetwork. We introduce self-attention attribution, an interpreting method for Transformer, to identify better subnetworks and improve the target domain performance. Further exploration on using several heuristic methods to reveal subnetwork structures find that subnetwork structures are critical to the effectiveness besides using fewer parameters.