Unsupervised Domain Adaptation for Event Detection using Domain-specific Adapters

Due to the multi-dimensional variation of textual data, detection of event triggers from new domains can become a lot more challenging. This prompts a need to research on domain adaptation methods for event detection task, especially for the most practical unsupervised setting. Recently, large transformer-based language models, e.g. BERT, have become essential to achieve top performance for event detection. However, their unwieldy nature also prevents effective adaptation across domains. To this end, this work proposes a D omain-speciﬁc A dapter-based A daptation ( DAA ) framework to improve the adaptability of BERT-based models for event detection across domains. By explicitly representing data from different domains with separate adapter modules in each layer of BERT, DAA introduces a novel joint representation learning mechanism and a Wasserstein distance-based technique for data selection in adversarial learning to substantially boost the performance on target domains. Extensive experiments and analysis over different datasets (i.e., LitBank, TimeBank, and ACE-05) demonstrate the effectiveness of our approach.


Introduction
Event detection (ED) is an important component in the overall event extraction pipeline, which plays a crucial role in any natural language understanding system. The goal of ED is to identify event triggers in a given text and classify them into one of several pre-defined types. Formally, according to the ACE-05 annotation guideline , each event trigger is a phrase (usually a single verb or nominalization), which evokes that event in the context of the associating event mention. For example, the word "fired" is the trigger word for an event of type Attack in the following sentence: "The police fired tear gas and water cannons in street battles with activists." Tackling ED problem involves both locating the event triggers and categorizing them into specific event types, therefore can be a quite challenging task due to the intricate dependency among triggers, events, and contexts in linguistic data. The complication is further amplified by domain shift problem when text are collected from multiple different domains.
The majority of prior approaches on ED relied on the basic supervised learning assumption where training and testing data follow the same distribution. Several works further evaluated their methods on cross-domain setting where their models were trained using data from one domain and tested on another, without leveraging any adaptation mechanism to alleviate the domain shift problem (Nguyen and Grishman, 2015;Yubo et al., 2015;Hong et al., 2018b). To this end, our work explores the general problem of domain adaptation for ED where data comes from two different source and target domains. In particular, we focus on the unsupervised setting that requires no annotations for target data, and the model has to learn to make use of both labeled source and unlabeled target samples to improve its performance on target domain. To our knowledge, this is the first work on unsupervised domain adaptation (UDA) for ED in the literature.
The most prominent approach for UDA is a representation learning method based on the theory of learning from different domains developed by Ben-David et al. (2010). The main result provided a way to bound the loss of a model on target domain with its performance on source domain using a domaindivergence term and an optimal joint error term (which is assumably negligible). Ganin et al. (2016) adopted this idea for deep learning architecture in their domain-adversarial neural network (DANN). They employed a domain-adversarial training procedure in which a domain classifier is learned concurrently and adversarially with the network's fea-ture extractor, resulting in a not only discriminative but also domain-invariant joint representation for data from both domains. While DANN and its variants are very well-studied in computer vision's domain adaption researches, their NLP counterparts are pale in comparison, especially for a newly established architecture like BERT. There have been only several works that adopted DANN to align the contextualized representations learned by BERT across domains (Lin et al., 2020;Naik and Rosé, 2020;Wright and Augenstein, 2020). Lin et al. (2020) even observed negative effect when applying adversarial training compare with simply finetuning BERT on in-domain data. One explanation is that the pre-training of BERT on massive corpora has already induces a somewhat general representation, thus DANN has little effect while the finetuning process using source dataset could cause over-fitting on the corresponding domain due to the immense capacity of the model. To this end, we propose fixing the parameters of the already universal language model while leveraging multiple adapter modules for domain-adversarial training process. More specifically, inspired by the works of Liu et al. (2017a) and Houlsby et al. (2019) on effective multi-task learning, we augment the pre-trained BERT model by adding three different adapters to create a shared-private architecture. Two source and target adapters which take as inputs data from their respective domains to capture private properties of each, and a joint adapter that encodes every sample in a subspace shared across domains through adversarial training. Orthogonality constraints together with a self-supervised auxiliary task are employed to ensure the representations of all adapters are informative while also attaining the above desired properties.
Recently, Ma et al. (2019) and Aharoni and Goldberg (2020) have shown that BERT's representations are extremely effective at clustering text to their respective domains, and a small subset of good in-domain data can already provide significant boosts in target performance while the rest only provide little to no improvement, in some cases even degrade model's out-of-domain generalization. Considering this, we explicitly find hard instances to leave out when learning to extract the domain-invariant features. Our data selection component estimates and minimizes the cost of transport between source and target marginal representation distributions based on the Wasserstein-1 distance (also refer to as Earth Mover distance).  pointed out that the relative strength of the topologies induced by this distance is much weaker than that of KL-divergence used by adversarial training. Therefore, it could serve as a good necessary condition for DANN component to achieve optimal alignment. The faraway source instances that induce the highest transportation costs are those out-of-distribution samples that may introduce noise and hurt adaptation performance. Accordingly, they are omitted from the domain-adversarial training process. The entire computation makes use of representations from source and target adapters, thus implicitly provides informative signals from domain-specific adapters to joint adapter without interrupting the joint representation learning procedure.

Related Work
Prior ED works have focused on the in-domain setting (Li et al., 2013;Chen et al., 2015;Yang and Mitchell, 2016;Nguyen and Grishman, 2018;Sha et al., 2018;Liu et al., 2017b;Tong et al., 2020;Nguyen et al., 2021), the crossdomain evaluation Hong et al., 2018a), the few/low-shot learning scenario (Lai et al., 2020a,b). Our work is different from those prior work as we explore a new formulation for ED with unsupervised domain adaptation where unlabeled data in the target domain is utilized to improve domain-invariant representation learning.
Recently, some efforts have been made to study the domain-related knowledge encoded in BERT's representations (Aharoni and Goldberg, 2020), and methods to leverage it to improve performances on domain-specific tasks, such as pre-training on additional data (Gururangan et al., 2020), fine-tuning using intermediate tasks (Phang et al., 2018;Garg et al., 2020), and data selection (Ma et al., 2019;Aharoni and Goldberg, 2020). Another line of research regarding multi-task learning shares a common goal of creating a universal representation space for all data with domain adaptation. Previous approaches made use of multiple encoders to set up a shared-private architecture (Bousmalis et al., 2016;Liu et al., 2017a), which usually is impractical for BERT-based models because of theirs sizes. By fixing a pre-trained BERT as the base for general representations, Houlsby et al. (2019) and Stickland and Murray (2019) proposed to adapt the model to each task by adding small task-specific layers between BERT's layers and updating only them when fine-tuning on the corresponding task.

Model
Throughout this work, we formulate ED task as a token-level multi-class classification problem (Nguyen and Grishman, 2015;Ngo et al., 2020). For UDA setting in particular, we have a labeled source dataset D s i is a pair consists and an event mention W s i = (w s i1 , w s i2 , · · · , w s im ) (m is the fixed number of words), and a trigger position u (1 ≤ u ≤ m) corresponding to the word w s iu . An encoder computes its latent representation x s i , which is then used by the event classifier to predict an event of type y s i . For target domain, parallel notations are used x t i and y t i (only accessible in target domain's test dataset)

Baseline Model
As this is the first work on UDA for ED, this section aims to establish a baseline of the task for further research. Recent works have shown a substantial boost in performance for the standard supervised setting of ED by leveraging contextual embedding of large self-attention based language models Lai et al., 2020c). Accordingly, we utilize a pre-trained BERT's encoder, together with its domain-adversarial variant to create a strong baseline for the UDA setting.
Without any domain adaptation mechanism, our BERT baseline only follows cross-domain evaluation setting as previous works. The model is fully fine-tuned on source domain dataset while at test time, data from target domain is used to evaluate its performance.
On the other hand, the BERT+DANN baseline takes advantage of the availabel unlabeled target data through adversarial training. Specifically, a domain classification task is learned concurrently with the main downstream task, using unlabeled samples and their domain labels from both domains. By pushing the encoder to both minimize the event classfication loss and maximally misdirect domain predictor, the resulting representation can be indiscriminate with respect to the shift between the domains while also discriminative for the main learning task.
Finally, to demonstrate to ability of adapterbased tuning approach to retain the original's model performance, we also evaluate a BERT+Adapter baseline. Following recommendation from Pfeiffer et al. (2021), we augment a pre-trained BERT model by injecting a single bottleneck adapter module between the encoder's layers. Then, the finetuning process proceeds in the same manner as that of the BERT baseline, but only parameters of the adapter modules get updated in this case.

Adapter-based Domain Representation
Pre-trained BERT model was previously optimized for the task of masked language model in unsupervised manner on several extremely large corpora. The diversity of these unlabeled text also pushes the network to be a good starting point for learning domain-invariant features, which would be lost if we fully fine-tune it on source domain task. Accordingly, we make use of a fixed pre-trained BERT model as the base of our adapters. An adapter for each domain To explicitly create a shared-private representation subspace of each domain, we inject three adapters into the same base encoder. Formally, adapter modules a s i , a t i , a j i are added on top of each BERT's layer. While these modules share the same architecture, they take in as inputs data only from their corresponding sources: The joint adapter A j is our main representation which will be used by event detection head h c for source classification task: On the other hand, the domain-specific adapters A s and A j will only be used to help A j find to the optimal joint-domain space while simultaneously retain good performance on downstream task.
Adapter architecture: There are a variety of ways that one can design the adapter modules' architecture. Following the observations from Pfeiffer et al. (2021), we choose ours to be the most efficient but also effective, which is a singular bottleneck neural network with skip-connection, taking features computed by BERT's feed-forward sub-layer as inputs. The adapter module in layer l can be decoupled into two parts a figure 1). Despite tripling the added parameters from adapter modules, by setting c d model , the amount needed to be tuned is still only less than 10% that of the original network. Additionally, the factorized features enable effective adaptation by making use of the low-dimensional down-sampled representation, while also boosting classification performance by leveraging the free parameters of the up-sampling projection, as described in the next sections.

Joint Representation Learning
To learn a joint representation that is as general as possible while also maintaining its discriminative property, we propose to combined two mechanisms with complementary effects : a layer-wise domainadversarial (LDA) component and an adapter-wise domain disentanglement (ADD) component.

Domain-adversarial Training
LDA apply domain-adversarial training to the representation of A j . Multiple refinements to the original DANN are introduced to mitigate its flaws and learn better domain-invariant features.
Dimension Reduction It is known that discriminative features computed by high-level layers usually lie on low dimensional manifolds. As a result, naively applying adversarial training for BERT's representations, which require high dimension to capture contexts, can lead to gradient vanishing problem. We leverage the adapter's architecture to tackle this issue. Instead of the full dimension outputs of layers, we align domains based on the downsampled version of the representations, computed by a j,dw i . In consequence, an adapter module can be viewed as a two-step adaption: a down-sampling projection step that extracts domain-invariant features and a following up-sampling projection step which transforms the extracted general features into task discriminative ones.
Layer-wise Alignment To enhance the alignment capability of our model, domain-adversarial training is applied on every layer's output. In particular, we incorporate the asymmetric relaxation of DANN (Wu et al., 2019): is minibatch size, and β l ≥ 0 is a hyperparameter controlling the maximal difference of the two marginal distributions (β l = 0 is the original formulation). This modification addresses the target shift scenario where domain-adversarial training is unable to achieve optimal solution. As outlined by Rogers et al. (2021), lower-level layers of BERT contain quite broad knowledge, thus encode more random distribution when projected into label space. In contrast, high-level ones are gradually more task-specific, effectively reducing the possible amount of label shift between the two domains. Therefore, we adopt the following relaxation annealing strategy: where each term on the right-hand side is a different relaxed domain classification loss computed by a separate domain classifier h j d,l .

Adapter-wise Domain Disentanglement
The role of ADD component is to ensure the sharedprivate relationship among adapters. We want the joint adapter A j to encode a shared representation space containing common information between domains and no domain-specific information, while the private adapters A s and A t should only accommodate distinct knowledge that belong exclusively to their corresponding domains. Following the work of Liu et al. (2017a) and Bousmalis et al. (2016), an orthogonality constraint is imposed using the following similarity loss function: where . F is the Frobenius norm and A d 1 d 2 is a matrix whose rows are the outputs of adapter A d 1 taking inputs from domain d 2 . Minimizing L s will force A j to be in a complementary subspace with A s and A t , encouraging independency among adapters and removing domain-specific noises that may contaminate the joint representation. However, whereas A j is trained to be informative for the downstream classification, A s and A t are not constrained by any task, which potentially could lead to a trivial solution where the network learns to map each representation into the same orthogonal space with A j while not having any expressive capability of their corresponding domains. To address this issue, we incorporate a self-supervised component, using the popular Masked Language Modeling (MLM) as our unsupervised task. The token predictor h m : R d model → R V (V is the vocabulary size), is shared between source and target domains: where N mask is the number of randomly masked input tokens, following the original procedure in Devlin et al. (2019). The benefit of adding the MLM component is twofold. On one hand, it serves as a constraint to learn informative representations for domain-specific adapters. On the other hand, it also help conditioning joint adapter A j on unsupervised knowledge of unlabeled target data, which can have a positive impact on target domain's performance.

Data Selection
Considering the Wasserstein-1 distance between the distributions generating source and target marginal representations P s X and P t X , which can be written as: There are several advantages of using this distance as the proxy for data selection mechanism. First, Wasserstein distance takes into account the geometry of the actual data distributions. Thus, it is intuitive to use it to evaluate the discrepancy between marginal distributions and pick source samples that are geometrically close to samples from target distribution. Furthermore, it has been proven by  that the minimization of KL-divergence, on which LDA component based to update A j , also implies the minimal Wasserstein distance between the corresponding distributions. Therefore, leaving out the most far-a-way samples based on this distance should provide a good necessary condition for LDA to achieve optimal alignment from source to target domain.
Approximate Wasserstein Distance Following the approximation from Shen et al. (2018), we employ a data selection head h w to estimate the Wasserstein distance between two representation distributions of A s and A t by maximizing the following empirical loss with respect to θ w : For the above approximation to work, we need to enforce the Lipschitz constraint, which will force the hypothesis class of h w to be 1-Lipschitz. Following Gulrajani et al. (2017), a gradient penalty L gr is added to the loss, resulting in the overall estimation problem for the Wasserstein distance as where d ∈ {s, t} and λ gr is a hyper-parameter.

Data Selection based on Wasserstein Distance
To avoid negative transfer problem in case of highly dissimilar domains, we propose to use a data selection mechanism based on the estimated Wasserstein distance. By minimizing the empirical distance using A s and A t , we find the representations that achieve the shortest transport distance between source and target samples. Then, a subset ofn s source samples is selected with the lowest h w (.) scores, which corresponds to then s shortest distances to target domain. These source instances will be used by the joint adapter A j , together with target unlabeled data, to learn domain-invariant features in LDA.

Alternating Minimization
Taking it all together, our final training objective is given as: where λ d , λ w , λ s , λ m are hyper-parameters which help to balance the importance of the corresponding loss with the main event detection loss. Of the five terms on the right-hand side, the domain discrepancy losses (L d and L w ) require optimization of different directions with respect to the added heads and the adapters, resulting in a min-max optimization problem. Previous works that made used of domain-adversarial training usually applied gradient reversal layer to train the feature extractors. We find this approach to be unstable and cause performance degradation. Following suggestions from Goodfellow et al. (2014) and Shu et al. (2018), we design an alternating minimization process that is compatible with our learning algorithm whilst also stabilizing the domain-adversarial training. In the first stage, all parameters are fixed except for those of domain-adversarial heads and data selection head. This step corresponds to the estimation of corresponding distance functions between domains given the current representations. After repeatedly updating for k times (k is a hyper-parameter that controls the trade-off between computation and accuracy of the divergence estimations), a subset of source minibatch can be selected based on the approximated Wasserstein distance, which will be used for domain-adversarial training of joint adapter in next step. The following stage, while keeping the previously updated heads fixed, updates the rest of the model's parameters, using the standard gradient descent algorithm. All maximization problems of discrepancy losses are converted into minimization using reversed domain labels.
At test time, a new sample x test will go through the trained joint adapter A j to produce domaininvariant representation A j (x test ), which is then used by prediction head h c to produce the corresponding event label.

Experiments
We evaluate our model on two related tasks: event identification and event detection. Given a trigger word in the context of an event mention, the former is formulated as a binary classification problem in which the goal is to determine if the trigger word expresses an event, while the latter is a multi-class classification task that requires model to assign the predicted label into one of the pre-defined 34 event types (include 1 negative type).

Datasets
TimeBank dataset (Pustejovsky et al., 2003) a fine-grained temporally annotated corpus of events and their positions and ordering in time. The text of the dataset were chosen from a wide range of sources from the news media domain. Events are annotated in a binary manner.
LitBank dataset (Sims et al., 2019) a recently introduced corpus of literary events. The dataset contains excerpts of 100 literary works from the Project Gutenberg corpus. Labels for events are binary.

Unsupervised Domain Adaptation Setting
To formulate the unsupervised domain adaptation setting from the origin dataset of each task, we split the target domain's documents into two parts at the ratio of 1 to 4, a training dataset without labels which models have access to when learning, and a test dataset that models are evaluated on. For event identification, transfer experiments are performed in two ways: LitBank-to-TimeBank, and the reserve direction, TimeBank-to-LitBank. In event detection experiments, we combine samples from two closely related domains, nw and bn, to create a sizeable labeled training source dataset. Then, each of the other domains is considered the target domain of a single adaptation setting.

Implementation and Hyper-parameters
Our model leverage the pre-trained BERT-base model as the fixed foundation for all adapters, each of which has a down-sampled dimension of 96. All of the downstream heads are implemented as feed-forward networks with activation functions between layers. We train all models using batch size of 150, which composes of 90 source samples (60 of which will be used for domain-adversarial training) and 60 target samples. Weights of the losses are chosen from a grid-search of range [0.01, 0.05, 0.1, 0.2, 0.5, 1, 5] using bc domain as development dataset. Every experiment is run 5 times epochs with different random seeds and the performance is reported using the average result of the 5 runs.

Baseline
We compare the proposed model DAA with several other baselines. In particular, for the task of event identification, the performance of domainadversarial models implemented in Naik and Rosé (2020) are considered. Regarding the event detection task, our baselines include adaption results of BERT and BERT+Adapter models fine-tuned using only source dataset, and finally BERT+DANN which making use of unlabeled target data through adversarial training.

Experimental Result
Event Identification The results of our event identification experiments are presented in tables 2 and 3. In both settings, our proposed model DAA outperforms naive implementation of domainadversarial on BERT by about 10 points in F1. We also note that high precision is observed from models transferring from LitBank-to-TimeBank, while the other direction has high recall. This imbalance is caused by the extreme disparity between the two adaptation settings, which our model manages to address and thus significantly improves out-of-domain performance in both cases. Table 1 showcases the results of our event detection experiment. The main conclusions from the table include: (1) The BERT baseline performs decently without using any mechanism to address the discrepancy between domains. This is due to the generalization potential of large unsupervised pre-trained language model. However, naively adopting DANN for BERT has an adverse effect, notably reducing the performance of BERT+DANN on all target domains. This outcome is consistent with results from Lin et al. (2020), further emphasizing the need for a compatible implementation method for domain-adversarial training on BERT's representations.

Event Detection
(2) The results of BERT+Adapter proves that adapter-based tuning procedure is not only able to retain performance but also prevent over-fitting through capacity reduction, therefore performing better than the

Ablation Study
To examine the effect of each of the proposed component individually, We perform an extensive ablation analysis for our DAA model by measuring domain adaptation ability of each trained model, with a single main component discarded (by setting the weight of its associated loss to 0), on ACE-05.
In table 4, DAA-D, DAA-W, DAA-M, and DAA-S correspond to performances of partial models with domain-adversarial training, data selection component, self-supervised task, and orthogonality constraint removed, respectively. Results from the study show that every incomplete model performs consistently worse compare to the full model. In particular, while in-domain performances are retained across settings, different domains experience varying degree of reduction in target performance depending on its relation with the source domain. Especially, data drawn from the domains of wl and un are substantially diverged from the source domain. Therefore, components that address domains' dissimilarity play important roles in improving adaptation capability, which is confirmed by the fact that models such as DAA-W and DAA-D have the lowest results.

Domain-adversarial Analysis
The central component of our architecture is undoubtedly LDA whose responsibility is to ensure joint adapter extracts domain-invariant features for classifying event triggers. From the negative results of BERT+DANN, finding an appropriate way to implement domain-adversarial training for BERT is an important question. This section aims to demonstrate the effectiveness of our layer-wise implementation of DANN. We apply domain alignment to different portions of BERT. Specifically, we partitioned 12 layers of the BERT-base encoder into 3 levels -Lower, Middle, Upper -each corresponds to the only 4 layers whose representations are used by domainadversarial training. In addition, we present results of Last and Up-Dim. The former is original implementation where last layer's output is aligned , while the latter is similar to our model Full except the representation with full dimension (768) is used instead of the down-sampled ones. Finally, No-Rel is the same as Full but no relaxation is used. Table 5 showcases the results of our experiment. Overall, we observe performance degrades on all three partial adaptation settings However, the changes vary across domains in each situation, probably stemming from the fact that adversarial training addresses different degrees of domain shifts in each layer. Moreover, taking only the last layer's representation as input for DANN component performs worse compare to all other multi-layer counterparts. Notably, using representations with full dimension significantly reduces out-of-domain performances of model. This result confirms the benefit of the bottleneck architecture. Not only the alignment of down-sampled representations is more effective, but the free parameters of up-sampling layers also increase model's capacity for the main downstream task.

Domain Discrepancy Analysis
To verify the effect of our method on alleviating the negative impact of the domain shift problem on the learning process, we compare each model's performance on different settings with varying shift magnitudes. Specifically, for each target domain, based on the learned Wasserstein distance between the two domains, we quantify the distance of each target domain sample (in evaluation dataset) to the source dataset and group them into 2 disjoint sets: FAR -25% of target samples that are farthest from the source dataset, and CLOSE -25% of those closest to source dataset. The domain adaptation performances on these sets for 2 target domain bc and wl, together with the set of in-domain examples IN-DOM from bnnw domain, are provided in Table 6. When adapting to bc domain which has a low discrepancy to source domain, the results for each setting show little variance, but we still observe the over-fitting of BERT as performance of out-of-domain settings is lower compared to its in-domain score. Moreover, BERT+DANN is able to improve on FAR set, however at the cost of degradation in the other two settings. In contrast, the negative effect of high discrepancy between domains is apparent in the case of wl domain, as the gaps between each setting are all above 10 points. Notably, the results of BERT+DANN are lower than that of BERT, indicating that naive implementation of DANN is not only unable to align between source and target domains, but also causes negative transfer when trying to learn domain-invariant representation. On the other hand, in both case, DAA is able to address the weakness of the baseline and improves the performance on FAR and CLOSE simultaneously.

Conclusion
We present a novel framework for ED in UDA setting that effectively leverages the generalization capability of large pre-trained language models through a shared-private adapter-based architecture. A layer-wise domain-adversarial training process combined with a Wasserstein-based data selection addresses the discrepancy between domains and produces domain-invariant representations. The proposed model achieves state-of-the-art results on several adaptation settings across multiple datasets.
In the future, we plan to extend our approach in the several directions: (1) We will devise a method to incorporate target domain's private adapter to further improve model's out-of-domain performance.; (2) We will adapt our framework to more general settings such as multi-source domain adaptation and domain generalization.; and (3) We will extend our work to novel domains for ED (Trong et al., 2020).