APGN: Adversarial and Parameter Generation Networks for Multi-Source Cross-Domain Dependency Parsing

Thanks to the strong representation learning capability of deep learning, especially pre-training techniques with language model loss, dependency parsing has achieved great performance boost in the in-domain scenario with abundant labeled training data for target domains. However, the parsing community has to face the more realistic setting where the parsing performance drops drastically when labeled data only exists for several ﬁxed out-domains. In this work, we propose a novel model for multi-source cross-domain dependency parsing. The model consists of two components, i.e., a parameter generation network for distinguishing domain-speciﬁc features, and an adversarial network for learning domain-invariant representations. Experiments on a recently released dataset for multi-domain dependency parsing show that our model can consistently improve cross-domain parsing performance by about 2 points in averaged labeled attachment accuracy (LAS) over strong BERT-enhanced baselines. Detailed analysis is conducted to gain more insights on contributions of the two components.


Introduction
Dependency parsing aims to derive syntactic and semantic tree structures over input words (Mc-Donald et al., 2013). Given an input sentence s = w 1 w 2 . . . w n , a dependency tree, as depicted in Figure 1, is defined as d = {(h, m, l), 0 ≤ h ≤ n, 1 ≤ m ≤ n, l ∈ L}, where (h, m, l) is a dependency from the head word w h to the child word w m with the relation label l ∈ L.
Recently, supervised neural dependency parsing models have achieved great success, leading to impressive performance (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017;Li et al., 2019a). Remarkably  and a LAS of 95.03 on standard Penn Treebank benchmark for the English language. In order to obtain competitive performance, supervised dependency parsing models rely on a sufficient amount of training data, which is inevitably dominated to several fixed domains. When the test data is sourced from similar domains, good performance could be achieved. However, the performance could be decreased significantly when the test data is from a different domain which has a large gap between the training domains. Thus domain adaptation for dependency parsing has been concerned by a number of studies (Koo et al., 2008;Yu et al., 2013;Sato et al., 2017;Clark et al., 2018;Li et al., 2020b). These works mostly focus on single-source cross-domain dependency parsing, assuming the training data is from a single source domain (Yu et al., 2013;Sato et al., 2017). In fact, multi-source cross-domain dependency parsing is a more practical setting, considering that several dependency parsing corpora from different domains have been developed (Peng et al., 2019). Intuitively, an effective exploration of all these corpora can give better performance for the target domain compared with the single-source domain adaptation.
Separating domain-invariant and domainspecific features is one popular way for domain adaptation to distinguish the similarity and discrepancy of different domains (Daumé III, 2007;Kim et al., 2016;Sato et al., 2017). Domain-invariant features indicate the shared feature space across domains, which have been widely-adopted as knowledge transferring. Domain-specific features imply the differences between domains, which could be helpful if the domain gaps could be accurately measured and effectively modeled. The learning of domain invariant and specific features are actually complementary because of mutual exclusivity, especially for single-source domain adaptation.
Although single-source and multi-source settings are easily separated domain-invariant and domain-specific features via independent BiLSTMs, the in-depth relevance of domain-specific representations becomes more complicated with the increasing of source domains. Hence, how to construct the relationship between different domain-specific features after a simple feature separation becomes more challenge for multi-source dependency parsing.
In this work, we for the first time apply adversarial and parameter generation networks (APGN) to multi-source cross-domain dependency parsing for extracting domain-invariant and domain-specific features. Experiments on a benchmark dataset show that our proposed model can boost the parsing performance significantly, leading to averaged LAS improvements by 2 points over strong BERTenhanced baselines. First, explorations on different unlabeled data sizes reveal that unlabeled data is an useful resource and proper utilization of unlabeled data further improves our model performance by a large margin. Then, we conduct in-depth analysis to gain crucial insights on the effect of adversarial and parameter generation networks, finding the two components are complementary and both have the capability of modeling short-or long-range dependencies. Finally, detailed comparative experiments on alternative domain representation strategies show that our designed distributed domain representation can accurately measure domain gaps and extract more reliable domain knowledge that benefits the dependency parsing task. We will release our code at https://github.com/ suda-yingli/EMNLP2021-apgn for facilitating future researches.

Baseline Model
In this work, we adopt the state-of-the-art deep BiAffine parser (Dozat and Manning, 2017) as our baseline model. Figure 2 shows the framework of the parser, which mainly contains four components, i.e., Input layer, Encoder layer, MLP layer, and BiAffine layer.  Input layer. The input layer maps each word w i into a dense vector representation x i . First, we apply a BiLSTM to encode the constituent characters of each word w i into its character representation rep char i . Then, we concatenate rep char i with emb word i as the input vector x i .
where emb word i is the pre-trained word embedding, and ⊕ indicates vectorial concatenation. In addition, we also use BERT representation to enhance our model, denoted as rep BERT i , where emb word i is substituted by rep BERT i simply. Encoder layer. Following Dozat and Manning (2017), we employ a three-layer BiLSTM to sequentially encode the inputs x 0 . . . x n and generate context-aware word representations h 0 . . . h n . We omit the detailed computation of the BiLSTM due to space limitation.
MLP layer. The MLP layer uses two independent MLPs to get lower-dimensional vectors of each position 0 ≤ i ≤ n.
where r H i is the representation vector of w i as a head word, and r D i as a dependent. BiAffine layer. The score of a dependency i ← j is computed by a BiAffine attention as follows, where the weight matrix W b determines the strength of a link from w j to w i .
Parsing loss. The parsing loss for each position i is computed as: where w j is the gold-standard head of w i , and l is the corresponding gold relation label.

Proposed APGN Approach
The goal of multi-source cross-domain dependency parsing is to train a parser that generates well to the target domain with labeled training data from multiple source domains and unlabeled data from the target domain. The most straightforward approach is training a parser with the concatenation of all source-domain training data. This can extract common features across different domains but fail to capture domain-specific knowledge. To address this issue, we propose an APGN approach for modeling the discrepancy and commonality between different domains simultaneously. As shown in Figure 3, our APGN model mainly contains two components, i.e., a PGN for distinguishing domainspecific features, and an adversarial network for learning domain-invariant representations.
In this section, we first give a detailed illustration of the parameter generation network which takes distributed domain embedding as input to alleviate potential domain conflicts caused by the fixed one. Then, we introduce the adversarial network which encourages the BiLSTM to extract more pure shared information by fooling the domain classifier. Finally, we propose a new strategy for our model training to make full use of all labeled and unlabeled data. Jia et al. (2019) first propose PGN to generate BiL-STM parameters based on fixed task and domain embeddings for NER domain adaptation, finding that the PGN can effectively extract domain differences. However, the vanilla PGN requires crossdomain language model task as a bridge to help fixed domain embeddings training. Considering the development of pre-training techniques with language model loss and computational complexity, we first remove the language model from the PGN component and use pre-trained BERT to enhance model performance in the final experiments. Intuitively, each input word has its unique domain distributions, initializing these words with the same fixed domain embedding may lead to potential domain conflicts. We then improve the PGN via replacing the fixed domain embedding with distributed one to more accurately integrate multidomain information. As shown in the right part of Figure 3, our PGN takes distributed representations as inputs and dynamically generates the domain-related PGN-BiLSTM parameters.

PGN
PGN-BiLSTM encoder. To better capture domain-specific features, we exploit the PGN-BiLSTM instead of a standard BiLSTM encoder. For convenience, we directly formalize the vanilla BiLSTM encoder as follows: where V ∈ R U can be regarded as a flattened vector which contains all the BiLSTM parameters. Different from a vanilla BiLSTM which use statically allocated parameters and update them during training, PGN-BiLSTM dynamically generates BiLSTM parameters in order to reflect domain differences as follows.
where ⊗ denotes matrix multiplication; W ∈ R U×D is a parameter matrix to be trained; E ∈ R D is distributed domain-aware sentence representation vector and will be explained later.
Distributed domain-aware sentence representation. The distributed domain-aware sentence representation vector can be regarded as a sum of weighted domain embeddings, where higher weights are expected to be assigned to domains that are more similar to the input sentence.
First, we compute domain distribution probabilities of each word via simple domain classification.
where h dom i is the representation vector of the i-th word generated by a separated standard BiLSTM.
Then, we compute a distributed domain-aware word representation vector for each word via aggregating domain embeddings according to the do- where emb dom j is the embedding vector of the jth domain; z i,j is the probability of the i-th word belonging to the j-th domain.
Finally, we utilize an average pooling to yield distributed domain-aware sentence representation vector, i.e., E, which is used to generate BiLSTM parameters.
Domain classification loss. The domain classification module, as shown in the right part of Figure 3, is trained via minimizing a standard crossentropy loss.
where m is the number of source domains (plus a target domain); n is the word number of the input sentence;ẑ i is the gold-standard domain distribution vector, where only one element is 1 corresponding to the domain index where the sentence comes from.

The Adversarial Network
The goal of adversarial learning is to encourage the shared BiLSTM to extract domain-invariant features that are not specific to a particular domain as much as possible (Ganin et al., 2017). During training, we expect the BiLSTM to make it difficult for the domain classifier to correctly distinguish domain categories. The architecture of adversarial network is shown in the left part of Figure 3. First, input words from different domains are encoded by the same standard BiLSTM. Before feeding the BiLSTM output h inv i to the domain classifier, h inv i goes through the gradient reversal layer (GRL). Following Ganin and Lempitsky (2015), the forward and backward propagations for the GRL are defined as follows: where λ is a hyper-parameter. Over the GRL, the domain classifier is applied to identify the domain of input word. Finally, the adversarial network is trained via minimizing the cross-entropy loss L adv .

Joint Training
In this work, we design a joint training strategy to make full use of all available training datasets, shown as Algorithm 1. In the first k iterations, mini-batches of source-domain and target-domain take turns to train. If the mini-batch comes from the source-domain labeled data, we jointly train the model with the parsing, adversarial, and domain classification losses. Otherwise, the model is trained with the adversarial and domain classification losses. In the first stage, all data is used to select domain-invariant and domain-specific features via the adversarial and parameter generation networks. In the second stage, only source domain labeled data is available and the model is updated with the parsing loss until convergence after k iterations, which is helpful to deal with the overfitting problem of domain classifications. Take turns to sample a mini-batch x from S and T. 4: if x ∈ S do 5: Accumulate loss L = L par + αL adv + βL dom 6: else 7: Accumulate loss L = αL adv + βL dom 8: else 9: Sample a mini-batch x ∈ S 10: Accumulate loss L = L par 11: Updating parameters via minimizing L.  Data. We use the Chinese multi-domain dependency parsing datasets released at the NLPCC-2019 shared task 1 , containing four domains: one source domain which is a balanced corpus (BC) from news-wire, three target domains which are the product comments (PC) data from Taobao, the product blog (PB) data from Taobao headline, and a web fiction data named "ZhuXian" (ZX). Table  1 shows the detailed illustration of the data statistics. In this work, we pick one target dataset as the target domain, and the rest are the source domains. For example, if the target domain is PC, source domains are BC, PB, and ZX. Evaluation. We use unlabeled attachment score (UAS) and labeled attachment score (LAS) to evaluate the dependency parsing accuracy (Hajic et al., 2009). Each model is trained for at most 1, 000 iterations, and the performance is evaluated on the dev data after each iteration for model selection. We stop the training if the peak performance does not increase in 100 consecutive iterations.
Baseline models. To verify the effectiveness and advantage of our proposed model, we select the following approaches as our strong baselines.

Hyper-parameter Choices
We mostly follow the hyper-parameter settings of Dozat and Manning (2017), such as learning rate, dropout ratios, and so on. The loss weights both α and β are set as 0.01. The domain embedding size is set as 8. The Chinese character embeddings are randomly initialized, and the dimension is 100. For pre-trained word embeddings, we train word2vec (Mikolov et al., 2013)    Preliminary experiments show that our model is insensitive to most of the above parameters, while the setting of joint training iteration has a larger impact on the performance as shown in the following results.
Joint training iteration k. Table 2 shows the results with different joint training iteration k on the dev data. Increasing the iterations from 10 to 20 consistently improves the performance on all domains. The performance drops significantly when using iteration k above 20. These results indicate that more joint training iterations not only increase the complexity of the model, but also make the model prone to overfit the training data. Table 3 shows the final results and makes a comparison with multiple baselines on test data. First, we can see that our proposed APGN model achieves the best results on all domains, demonstrating that the APGN is extremely useful for multi-source cross-domain dependency parsing. Second, compared the results of ADE and PGN, we find that both adversarial and parameter generation networks have the capability of capturing useful information to improve the parsing accuracy. Finally, although the performance of different models is obviously improved by utilizing BERT representations, our model still achieves consistently higher accuracy than other baselines, further demonstrating the effectiveness of our proposed method.

Utilization of Unlabeled Data
Considering the lack of target-domain labeled data, we directly use unlabeled data for the model training. For unlabeled sentences, the model discards the parsing loss and updates the parameters with only adversarial and domain classifier losses. Figure 4 illustrates the influence of target-domain unlabeled data sizes on dev data. In each curve, we fix the size of labeled data and incrementally add a random subset of unlabeled data. Considering a large-scale unlabeled data may lead to the sample unbalance problem, we randomly sample unlabeled data with the ratios less than 1. On the one hand, we can see that using unlabeled data leads to consistently higher performance for all three domains, indicating that the unlabeled data is an important resource that contributes the target-domain dependency parsing. On the other hand, we find that the improvement of parsing accuracy is obviously steady when the ratio is set as 0.75, showing that the APGN model can achieve best performances with a suitable amount of unlabeled data.

Analysis
Ablation study. The results of ablation study on dev data are shown in Table 4. We can see that removing any component from the APGN causes obvious performance degradation. First, compared with the accuracy of "w/o two", "w/o PGN" can further improve parsing performance, showing the usefulness and importance of domain-invariant features generated by adversarial network. Second, it is clear that "w/o Adv" achieves better performance than "w/o PGN", indicating that the parameter generation network is crucial. The reason may be that  Table 4: Ablation study on reducing the component of the APGN model on dev data. "w/o Adv" and "w/o PGN" mean removing adversarial network or parameter generation network. the parameter generation network enable correctly construct domain relations and extract practical domain-specific features, which is significant for dependency parsing. Finally and most importantly, we find that our proposed APGN model achieves consistently higher accuracy than "w/o PGN" and "w/o Adv", demonstrating that the two components are complementary.
Error analysis. Since ablation study only gives an overall performance trend, we conduct in-depth error analysis in order to gain more insights on the contributions of adversarial and parameter generation networks. We divide the gold-standard dependencies into seven subsets according to the absolute distance between the head word and the modifier word, and calculate the accuracy for each subset. The group whose dependency distance is 0 means the words which take the pseudo node "root" as their head words. Figure 5 compares the accuracy curves of ADE ("w/o PGN"), PGN ("w/o Adv"), and APGN models with regard to the dependency distance on the test data. First, we can see that the parsing accuracy becomes better on all models when the dependency distance is smaller. The reason may be that the contextualized information decays when the distance between two words is too far. Second, there seems slight difference between ADE and PGN performances on the same dependency distance, indicating that adversarial and parameter generation networks, as two typical feature extraction methods, both have the competitive capability of capturing short-and long-range dependencies. Finally, we find that the APGN model achieves better performances than ADE and PGN models, demonstrating that adversarial and parameter generation networks are complementary and can certainly benefit from each other.

Comparisons on Alternative Domain Representation Strategies
Most previous works use a fixed domain embedding to indicate the domain of each input word (Jia  Li et al., 2019b). However, the fixed representation may lead to potential domain conflicts when a word belongs to multiple domains. As shown in Figure 6, we can see that each word has its unique domain distribution and it is difficult to define all word with an explicit fixed representation. Hence, it is necessary to design a more accurate representation, named as distributed domain embedding, which can be regarded as weighted sum of the fixed domain embeddings and its distributional probabilities.
Detailed comparative experiments are conducted to verify the effectiveness of two domain representation strategies on various models, and results are shown in Table 5. First, we find that the APGN with fixed domain representations like Jia et al. (2019) achieves lower performance than other models. The main reason may be that without cross-domain language model as a bridge, it is difficult for the  PGN to model the relationships of different domains. Second, the APGN with distributed domain representations achieves best performance among all models, revealing that the PGN with distributed domain embeddings can accurately measure the domain similarity and significantly improve our model performance. Finally, we can see that all models with distributed domain representation outperform them with the fixed one by a large margin, demonstrating that distributed domain representation is helpful to reduce potential domain conflicts and extracts more reliable domain knowledge that benefits the parsing task.

Related Work
Domain adaptation has been extensively studied in many research areas, including machine learning (Wang et al., 2017;Kim et al., 2017), computer vision (Ganin and Lempitsky, 2015;Rozantsev et al., 2019) and natural language processing (Kim et al., 2016;Sun et al., 2020). Here, we first simply review single-source domain adaptation researches, and then give more detailed illustration about the studies of multi-source domain adaptation.
Single-source domain adaptation. Singlesource domain adaptation assumes training data comes from a source domain. Due to lacking targetdomain labeled data, previous researches mainly investigate unsupervised domain adaptation, which attempt to create pseudo training samples by selftraining (Charniak, 1997;Steedman et al., 2003;Reichart and Rappoport, 2007;Yu et al., 2015), co-training (Sarkar, 2001), or tri-training (Li et al., 2019c). However, selecting high confidence samples is a challenge.
Thanks to large-scale labeled web data released by parsing communities, recent existing works pay more attention to semi-supervised scenario. Yu et al. (2013) give detailed error analysis on crossdomain dependency parsing and solve the ambiguous features problem. Sato et al. (2017) propose to separate domain-specific and domain-invariant features via applying adversarial learning on sharedprivate model, but find that there is little gains and even damage the performance, specially when the scale of target-domain training data is small. Most recently, Li et al. (2019b) propose to leverage an extra domain embedding to indicate domain source and achieve better performance on semi-supervised domain adaptation. In this work, we adjust the domain embedding method as our strong baseline.
Multi-source domain adaptation.
Multisource domain adaptation assumes the training data comes from multiple source domains. Many approaches of multi-source domain adaptation focus on leveraging domain knowledge to extract domainrelated features, thus boosting the performance of target domain (Daumé III, 2007;Guo et al., 2018;Li et al., 2020a;Wright and Augenstein, 2020). Zeng et al. (2018) design a domain classifier and an adversarial network to capture domain-specific and domain-invariant features, achieving good performances on machine translation. Guo et al. (2018) apply meta-training and adversarial learning to compute the point-to-set distance as the weights of multi-task learning network, leading to improvement on classification tasks.
As another interesting direction, Platanios et al. (2018) propose a parameter generation network to generate the parameters of the encoder and decoder by accepting the source and target language embeddings as input. Recently, a number of works attempt to use the parameter generation network to improve the cross-domain or cross-language performance (Cai et al., 2019;Stoica et al., 2020;Jin et al., 2020;Nekvinda and Dusek, 2020). Particularly, Jia et al. (2019) propose to generate BiLSTM parameters based on task and domain representation vectors, leading to very promising performances on cross-domain NER task.
Due to the limitation of annotation corpus and the essential difficulty of multi-source domain adaptation, there still lacks such studies on dependency parsing. Inspired by these prior works, we propose a novel approach to separate domain-invariant and domain-specific features by the utilization of adversarial and parameter generation networks.

Conclusion
This work for the first time apply the APGN approach to multi-source cross-domain dependency parsing, obtaining better performance than multiple baselines, even when all models are enhanced with BERT representations. The ablation study reveals that both adversarial and parameter generation networks are equally important and complementary in capturing domain-related features, which motivates us to make a deep analysis to gain crucial insights on the effectiveness of the two components. Based on the in-depth error analysis, we find that in spite of local divergences, domain-invariant and domain-specific features generated by adversarial and parameter generation networks actually both have the power of modeling short-or long-range dependencies and can certainly benefit from each other. Furthermore, detailed comparative experiments demonstrate that the distributed domain representation is extremely useful to reduce domain conflicts and accurately measure the domain similarity, thus extracting more reliable domain-specific features to boost the parsing performance.

Acknowledgments
We thank our anonymous reviewers for their helpful comments. This work was supported by National Natural Science Foundation of China (Grant No. 61876116 and 62176173), and a Project Funded by the Priority Academic Program Development (PAPD) of Jiangsu Higher Education Institutions.