A Fine-Grained Domain Adaption Model for Joint Word Segmentation and POS Tagging

Domain adaption for word segmentation and POS tagging is a challenging problem for Chinese lexical processing. Self-training is one promising solution for it, which struggles to construct a set of high-quality pseudo training instances for the target domain. Previous work usually assumes a universal source-to-target adaption to collect such pseudo corpus, ignoring the different gaps from the target sentences to the source domain. In this work, we start from joint word segmentation and POS tagging, presenting a fine-grained domain adaption method to model the gaps accurately. We measure the gaps by one simple and intuitive metric, and adopt it to develop a pseudo target domain corpus based on fine-grained subdomains incrementally. A novel domain-mixed representation learning model is proposed accordingly to encode the multiple subdomains effectively. The whole process is performed progressively for both corpus construction and model training. Experimental results on a benchmark dataset show that our method can gain significant improvements over a vary of baselines. Extensive analyses are performed to show the advantages of our final domain adaption model as well.


Introduction
Chinese Word Segmentation (CWS) and Part-Of-Speech (POS) tagging are two fundamental tasks for natural language processing (NLP) in Chinese (Emerson, 2005;Jin and Chen, 2008), serving as backbones for a number of downstream NLP tasks. The joint models of the two tasks can lead to better performance because they are closely-related and the pipeline models suffer from the error propagation problem (Ng and Low, 2004;Zhang and Clark, 2008;Wang et al., 2011;Zeng et al., 2013;Zhang et al., 2018;Tian et al., 2020a), which can be alleviated in the joint architecture. * Corresponding author. Currently, joint CWS and POS tagging has gained great achievements with BERT inputs (Tian et al., 2020a,b). Our preliminary results show that the F1-score of joint POS tagging can be close to 95% when the training and test corpus both belong to a standard newswire domain. Unfortunately, it is not always the case in real applications. The performance might be degraded dramatically when the source and target domains are highly different. Taken the ZhuXian (a novel from Internet) as an example , the same model can only obtain an F1-score of 89% for POS tagging according to our results.
It is a typical domain adaption problem targeted to joint CWS and POS tagging. Self-training could be one promising solution (Inoue et al., 2018;Zou et al., 2019;Saito et al., 2020) which can accomplish the goal in a fully-automatic manner without any human intervention (Liu and Zhang, 2012). By using a source model to automatically label a largescale raw corpus of the target domain, and then selecting a set of high-confidence pseudo-labeled instances as additional training data, we can obtain boosted performance on the target domain. The quality of pseudo corpus is the key to success. For the target sentences which are far from the source domain, the generated corpus based on them might be of extremely-low quality (Shu et al., 2018;Zhao et al., 2019). Thus, these sentences should be either filtered, resulting in a biased corpus to the target domain, or be kept with great noises to degrade the overall target performance.
In this work, we suggest a fine-grained domain adaption method to alleviate the above problem of self-training. We define a simple and intuitive metric to measure the distance (gap) of a target sentence to the source domain. Based on the metric, we create a set of high-quality training corpora incrementally according to the distances of the target sentences to the source domain. Figure 1 shows the main idea. The process is conducted by several iterations in a progressive manner, where at each new iteration, we add a small set of high-quality instances which are not as distant from the previous iteration. Finally, we arrive at a training corpus covering the target domain of various distances fully. At each iteration, we go only a little further by the distance, thus the quality of the pseudo corpus can be greatly ensured by the previous model. By the fine-grained domain adaption, we can obtain a training corpus of multiple types from different iterations, where each type differs from the other in both quality and input distribution. During the early iterations, the produced instances are possible with higher quality and close to the source domain, while for the later iterations, the quality might be lower and the distance to the source domain is larger. To make full use of the corpus together with the source training set, we present a domain-mixed model for sophisticated representation learning to capture domain-aware and domaininvariant features (Daumé III, 2007;Ganin et al., 2016;Tzeng et al., 2017), which is also strengthened progressively by the incremental style of the fine-grained domain adaption.
We conduct experiments on a benchmark ZhuXian dataset  to show the effectiveness of our method. In detail, the Penn Chinese Treebank version (Xue et al., 2005) 6.0 (CTB6) is used as the source corpus, belonging to the newswire domain, while the target ZhuXian corpus is from an Internet novel. Experimental results show that our fine-grained domain adaption is significantly better than previous self-training studies. Moreover, we find that our domain-mixed representation learning model suits the fine-grained framework perfectly. We also conduct extensive analyses to understand our model comprehensively. We will release our codes at github.com/JZX555/FGDA under Apache License 2.0 to help the reproduction.

Joint CWS and POS Tagging
This section describes the basic model of our joint CWS and POS tagging. Concretely, we regard our joint task as a character-level sequence labeling problem following Tian et al. (2020a). Given an input character sequence X = [x 1 , ..., x n ], the output labels Y = [y 1 , ..., y n ] are concatenations of word boundaries (i.e., BMES) and POS tags for all sentential characters. We exploit an ADBERT-BiLSTM-CRF model as our basic model, which is very strong in performance and highly parameter efficient. The model includes two parts sequentially: (1) ADBERT for character representation, (2) BiLSTM-CRF for feature extraction, label inference and training. Below, we introduce the AD-BERT directly and the BiLSTM-CRF is exactly the same as Tian et al. (2020a) which can be referred to in their work for the details.
Adapter • BERT We exploit BERT (Devlin et al., 2019) to derive character representations for a given sentence X = [x 1 , ..., x n ], as it brings state-of-the-art performances for a range of Chinese language processing tasks. In particular, we patch BERT with adapters (Houlsby et al., 2019) inside all the included transformer units. By this way, fine-tuning BERT parameters is no longer necessary across different tasks, and we only need to tune the adapter parameters. More particularly, we let all adapters across different transformer units use a shared set of parameters to reduce the scale of tunable model parameters of our joint task. Here we refer to this method as ADBERT: e 1 , ..., e n = ADBERT(X = [x 1 , ..., x n ]), (1) where the detailed network of transformer with adapters is illustrated in our Appendix A.

Our Method
The above joint CWS and POS tagging model can perform well on the standard setting when the test domain is similar to the training domain (Tian et al., 2020a,b). However, the performance might be degraded dramatically when the test (i.e., target) domain differs from the training (i.e., source) domain significantly. There have two studies for crossdomain of joint CWS and POS tagging (Liu and Zhang, 2012;, both of which have exploited self-training due to its effectiveness as well as simplicity for domain adaption. The selftraining aims to produce a set of high-confidence training instances of the target domain which are used to train a target model. Here we follow this line of work, presenting a novel fine-grained domain adaption strategy. The fine-grained domain adaption is an extension of the standard self-training, aiming to produce a helpful pseudo training corpus of the target domain. The line of work is essentially orthogonal to the representation learning methods which aim to learn sophisticated (e.g., domain-aware and domain-invariant) features for domain adaption. Thus, we also present a novel domain-mixed model based on the basic ADBERT-BiLSTM-CRF for effective exploration of our fine-grained domain adaption. In the following, we first describe the fine-grained domain adaption method in detail, and then introduce our representation learning model.

Fine-Grained Domain Adaption
The overall flow of self-training includes three steps: (1) first, we train an initial model by the source corpus; (2) second, we apply the source model onto a large-scale raw corpus, obtaining auto-labeled pseudo instances of the target domain; (3) finally, we select a set of high-confidence instances from the pseudo corpus which would be added to train the target model. The flow can be conducted repeatedly by several iterations, where the model in step 1 is trained by the progressively added step-3 instances. However, according to our preliminary results, the plain iterative self-training can only achieve very marginal improvement.
The reason may lie in that the above process is difficult to ensure the quality of the selected instances, especially when the input target sentences are very distant from the source domain (Sohn et al., 2020). The step-1 models do not perform well on these sentences without any specialization. If these sentences are excluded because of their low quality, the final target model would be trained on a biased corpus, while these sentences are added into the target training corpus, great noises are introduced which would degrade the overall performance. Aiming for the problem, we propose a fine-grained domain adaption strategy to alleviate the influence of the large gaps during the automatic corpus construction.
Concretely, we guide the iterative self-training by a specific explicit distance metric. At each iteration, we add a set of high-confidence pseudo instances whose distances are only a little larger than the previous iteration. The sentences during each selection can be regarded as from a special fine-grained subdomain of the target domain. By this way, the target model is gradually adapted to Progress ith auto instances: the distant sentences far away from the source domain, producing a higher-quality corpus of various distances. Compared with the direct sourceto-target adaption, we adopt the OOV (i.e., the newly-generated words which are out of the training vocabulary) number as the distance measurement, which is highly simple and intuitive. We construct a set of high-quality automatic corpora by choosing from the zero/one-number-OOV target sentences to the large-number-OOV target sentences progressively.
Algorithm 1 shows the pseudo codes of finegrained domain adaption. Initially, we set the firstiteration training dataset by the source corpus S, and then execute the pseudo codes of lines 3-16 repeatedly. First, we train a model M i by currentiteration training dataset T i , and apply the model to the remaining raw corpus of the target domain, resulting in auto-labeled corpusD i , as shown by the codes at lines 3-4. Next, we conduct a lexicon building process at line 5 which would be used for quality assurance. At each iteration, we collect a set of top-K confident word-POS pairs L top-K by their weighted frequencies inD i , 1 which are added to the target lexicon L tgt . Then, the key arrives at lines 6-15 for new training dataset selection to obtain ST i , which advances the training corpus to T i+1 . We traverse all instances inD i , and add the instances which satisfy C oov , C lex and C conf together, where C oov indicates the OOV number to control the distance to the source domain, and C lex and C conf ensure the instance quality. Finally, at line 16, we remove the selected instances from the target domain corpus and start the next iteration.

Our Domain-Mixed Model
By fine-grained domain adaptation, we can obtain a training corpus of multiple types (i.e., S, ST 1 , · · · , ST n (n denotes the last iteration) in Algorithm 1) where each type corresponds to a domain (i.e., S) or subdomain (i.e., ST n ). Thus, the exploration of the training corpus can be regarded as multi-source domain adaption (Zhang et al., 2015;Sun et al., 2015). To better explore the corpus, we propose a novel domain-mixed model to fully benefit from the fine-grained domain adaptation.
Our domain-mixed model follows a standard representation learning framework of domain adaption, which attempts to capture effective domainaware and domain-invariant features. Figure 2 shows the overall architecture of the model, where two individual ADBERT-BiLSTM-CRF components are included, which are used for domainaware and domain-invariant feature learning, respectively. The feature learning modules are both adapted at the ADBERT, and a shared BiLSTM-CRF is exploited across the two components. In the below, we introduce the (sub)domain-aware and (sub)domain-invariant components, respectively, and then describe the overall inference and training.
The (Sub)Domain-Aware Component A major problem of our basic ADBERT-BiLSTM-CRF model is that it treats all (sub)domain types of our final training corpus equally. Here we take the (sub)domain types as inputs along with the sentences deriving domain-aware features. Concretely, we follow Jia et al. (2019) and Üstün et al. (2020), exploiting Parameter Generator Network (PGN) on the adapter layers to achieve our goal, which generates (sub)domain-aware parameters for the adapters inside the ADBERT.
We pack all parameters of the adapter layers into a single vector V by reshaping and concatenation, which can be reverse unpacked perfectly for adapter calculation. As shown in Figure 2(a), we refer to ADBERT with PGN as PGN-ADBERT. Taken the input sentence and (sub)domain type pair by (X, dt), and the overall calculation of the (sub)domain-aware character representations is formalized as follow: where Θ is a learnable parameter in this component, e dt is the (sub)domain type embedding, and PGN-ADBERT is a special case of ADBERT with specified module parameters V . The resulted representations are then fed into BiLSTM-CRF for our joint task.

The (Sub)Domain-Invariant Component
The domain-invariant features have been extensively investigated because of their generalization capability across different domains (Daumé III, 2007). Here we present a (sub)domain-invariant component to learn these general features across our source domain and fine-grained target subdomains, parallel to the (sub)domain-aware component. Figure 2(b) shows the architecture of this part. Firstly, the character inputs X go through ADBERT, deriving the domain-invariant features e iv 1 , ..., e iv n , and then we reconstruct the domain-aware featuresē dm 1 , ...,ē dm n by specifying the input (sub)domain type dt, which are then fed into BiLSTM-CRF for our joint task following our basic model. The domain-invariant features e iv 1 , ..., e iv n , are learned in an adversarial manner (Ganin and Lempitsky, 2015;Ganin et al., 2016) for sentencelevel (sub)domain type classification. We derive sentence-level representation v by averaged pooling over these features, and then determine the (sub)domain type of the input sentence by a simple linear classifier. Note that we will intentionally cheat the classifier to make the v domain irrelevant, aiming to obtain good domain-invariant features.
In natural, the domain-invariant component tries to reconstruct and approximate the domain-aware component since they share the same decoding part. We unite the domain-invariant features e iv 1 , ..., e iv n and the (sub)domain type dt to reconstruct the domain-aware features, which are then used for our joint task. The advantages of this manner are that we can maximize the capacity of the domaininvariant features and further enhance the interaction between the domain-aware and domaininvariant features. Concretely, the reconstruction is implemented by a variational module with reparameterization (Kingma and Welling, 2014). Given the (sub)domain type dt and the character representation e iv i (i ∈ [1, n]), the domain-aware representation can be calculated by: where we use BiAffine operations to generate a Gaussian distribution and then sample the domainaware featuresē dm i from the distribution.

Inference and Training
We regard the (sub)domain-aware component as our major component, which outputs the final joint CWS and POS tagging results. The (sub)domaininvariant component is an auxiliary component to help the learning of the major one. Intuitively, through an alignment between the major and auxiliary components, the learned features of our major component can be naturally decomposed into domain-aware and domain-invariant features.
Inference For inference, we use the (sub)domain types of S and ST n (i.e., the last fine-grained subdomain type) to perform decoding of the source and target domains, respectively.
Training We exploit four optimization objectives for training, as shown in Figure 2: L adv (X, dt) = log p adv (dt|X), where the first two are the losses of the two components of joint CWS and POS tagging, the third one is referred to as the adversarial loss to deceive the (sub)domain type classification, and the last is to minimize the distance of the domain-aware features between our two components leading to highly-resembled (aligned) character representations from variational reconstruction. Further, we sum the four objectives together: resulting in the final objective of our domain-mixed model, where λ 1 and λ 2 are two hyperparameters.

Datasets
We use the CTB6 dataset as the source domain (newswire), splitting the dataset into training, development and test sections following Tian et al. (2020a). To verify the effectiveness of our proposed domain adaption method, we exploit the ZhuXian dataset  as the target domain, which belongs to a novel from Internet and is the only-one benchmark dataset for domain adaption of joint CWS and POS tagging. We strictly follow unsupervised domain adaptation where there is only a test corpus of the target domain. Table 1 shows the data statistics, where the detailed sentence, word as well as character numbers are reported. For the Zhuxian dataset, we use only the raw text and test corpus, which is available from .

Setting
Evaluation We adopt the standard word-level matching method to evaluate the performance of CWS and POS tagging. In particular, the joint strategy is used for POS tagging evaluation, considering word boundaries as well as POS tags as a whole. We calculate precision (P), recall (R) values, and use their F1-score as the major evaluation metric.  Considering that there is no development corpus available for the target domain in a real scenario, we use the CTB6 development set to select the best-performing models.
Hyperparameters All hyperparameters are set empirically according to the previous studies as well as our preliminary findings (Tian et al., 2020a,b). Most importantly, our fine-grained domain adaption consumes 12 iterations to reach the peak, and the values for all other hyperparameters are described in our Appendix B. model, the self-training obtains very small performance gains (including iterative self-training), i.e., only close to 0.2% which is very insignificant. The result is inconsistent with  which shows large improvements by simple self-training. The main reason might be due to the strong baseline with the BERT representations.

Main Results
With fine-grained domain adaption, we can generate a higher quality pseudo corpus. Therefore, the gains by the vanilla model are very significant over the baseline, 3 where the improvements are 0.78 and 0.93 for CWS and POS tagging, respectively, significantly better than the vanilla self-training systems due to the quality differences of pseudo corpora. By using the final domain-mixed model, our fine-grained domain adaption can be improved further, leading to another improvement of 0.42 and 0.71 for CWS and POS tagging. The observations indicate that our method is highly effective for domain adaption of joint CWS and POS tagging.
We can see that our domain-mixed model can help the normal self-training as well, showing the effectiveness of the representation learning for domain adaption. We also compare our proposed domain-mixed model with the major component alone (Domain-PGN for short), where the latter has been demonstrated to be effective in a different scenario (Jia et al., 2019). According to the results, the Domain-PGN gives slightly better performances on CWS and POS tagging for both self-training and fine-grained domain adaption compared with the counterpart baseline. Our final domain-mixed  model is much better, leading to significant performance increases on both tasks especially in finegrained domain adaption.
Interestingly, we find that our final model is capable of bringing better performances on the source CTB6 test dataset as well, unlike the observations as shown in the self-training models which can hurt the source performance to a certain extent. The finding indicates that our final model is with strong practical values, since it enables one model to perform well on multiple domains.

Analysis
In this subsection, we conduct detailed experimental analyses for a comprehensive understanding of our method in-depth.
The Exploration of BERT Our work exploits ADBERT instead of the standard exploration of BERT finetuning. Here we examine the differences between them considering both performance and the size of trainable model parameters. Since ADBERT freezes all parameters of BERT, the number of trainable model parameters would be reduced greatly. Table 3 shows the comparison results, where Finetuning indicates the standard BERT-CRF model with BERT parameters tunable, Adapter denotes the ADBERT model that all adapters own separate parameters, and Adapter (shared) indicates our final ADBERT that all adapters across different transformer layers share the same parameters. As shown, we can see that our final choice can achieve comparable performance to the others with much fewer number of trainable parameters, thus our final ADBERT is highly parameter efficient.
The Instance Selection Strategy As mentioned in Algorithm 1, we include three conditions for instance selection at each iteration: C oov , C lex and C conf . Here we conduct ablation experiments to check the necessity of them. Note that when C oov is excluded, we select at most 2K instances at each   iteration by the probabilities from high to low. Table 4 shows the results. As shown, we can see that all conditions are useful, and in addition, all results outperform the plain iterative self-training method. In particular, the model −C oov − C lex is degraded into the self-training with iterative adaption combined with the domain-mixed model. The comparison further demonstrates the advantage of our domain-mixed model.
The Size of Pseudo Training Corpus It is very interesting to compare the fine-grained domain adaption with (one-iteration) self-training under the view of the pseudo training dataset size. We align the iteration of fine-grained domain adaption with self-training by the added training corpus size of the ZhuXian domain. Figure 3 shows the comparison results. As shown, the performance of self-training would be hardly increasing after 3K instances, while our fine-grained method can give significant improvements continually until iteration 12 (consuming 20K corpus). The comparison shows that our fine-grained domain adaption is much more effective than self-training. However, our iterative fine-grained domain adaption needs more time to training than non-iterative selftraining 4 .  Table 5: The results of independent CWS task using our method on ZhuXian dataset.
The Independent CWS Task Our major goal is for joint CWS and POS tagging, while it is expected to examine our method for the CWS task alone. Here we also use the CTB6 dataset as the source corpus and the ZhuXian dataset as the target domain. The basic model can be exactly the same. Table 5 shows the final results. Our method can achieve significant improvements on the CWS alone, resulting in increases of 94.86 -93.77 = 1.09, which means that our fine-grained domain adaption method can be suitable for CWS as well. The other model tendencies are consistent with the joint task. Interestingly, we find that the independent CWS model has a lower improvement in recall. The reason may be that the POS tagging can provide several additional features, which let the joint model prefer more fine-grained segmentation, leading to a larger recall value.
Domain-Aware v.s. Domain-Invariant It is interesting to compare our (sub)domain-aware (PGN) and (sub)domain-invariant (VAR) components comprehensively. In fact, the two components alone can serve for domain adaption as well besides our integrated usage. The PGN can be used directly for inference, while for VAR, we can perform decoding by settingē dm i = µ i in Equation 3. Here we analyze four models, PGN and VAR alone, and the integrated model inferencing with PGN (Final-PGN) and VAR (Final-VAR), respectively. All four models are trained on the same and full training corpus (i.e., S + ST 1 , ..., S + ST 1 + ... + ST n , respectively and gradually). Figure 4 shows the results. As shown, we can see that PGN and VAR are actually comparable to each other, and in our final model, PGN is slightly better than VAR. We find that in our integrated model, both PGN and VAR are much better than using them alone, which shows the importance of the joint learning by the carefully-designed L mse . The Sentential OOV Number Our fine-grained domain adaption is mainly advanced by the sentential OOV numbers with respect to the source training dataset. Thus, it is meaningful to examine the model performance on sentences with different OOV numbers. We divide the ZhuXian test dataset by four categories according to the OOV number in sentence, which are respectively [0-1], [2-3], [4][5] and ≥6. All categories include a sufficient number of sentences for statistical comparisons. Based on the division, we compare the performance of the fine-grained adaption, self-training as well as baseline models. Figure 5 shows the results. We can see that with the increase of OOV number, the model performance can be decreased as a whole, which is reasonable. In addition, our final model can significantly improve the model performance with higher OOV numbers in sentence.

Related Work
CWS and POS tagging are closely-related tasks for Chinese processing, which could be handled either jointly or in a pipeline way (Ng and Low, 2004;Shi and Wang, 2007;Zhang and Clark, 2008;Jiang et al., 2008;Kruengkrai et al., 2009;Jiang et al., 2009;Sun, 2011). The joint models are able to obtain better performances, as they can alleviate the error propagation problem between two tasks (Ng and Low, 2004;Zhang and Clark, 2008;Jiang et al., 2009;Wang et al., 2011). Recently, neural models lead to state-of-the-arts for joint CWS and POS tagging (Zheng et al., 2013;Shao et al., 2017;Zeng et al., 2013;Tian et al., 2020a). In particular, the BERT representations (Devlin et al., 2019) and the BiLSTM neural network (Graves et al., 2013;Huang et al., 2015) have shown impressive results for the joint task (Zhang et al., 2018;Diao et al., 2019;Tian et al., 2020a,b). In this work, we adopt both BERT and BiLSTM to reach a strong baseline for cross-domain adaption. Domain adaptation has been extensively studied in both the machine learning and NLP communities (Daumé III, 2007;Ben-David et al., 2007;Søgaard, 2013;Zou et al., 2019;Saito et al., 2020). The typical methods of domain adaption can be divided into two categories mainly. The first category aims to create a set of pseudo training corpora for the target domain, while the second category attempts to learn transferable features from the source domain to the target. Self-training is one most representative methods of the first category (McClosky et al., 2006;Zou et al., 2019). For the second category, the representation learning of domain-specific and domain-invariant features has received the most attention recently (Glorot et al., 2011;Ganin et al., 2016;Tzeng et al., 2017;Long et al., 2017;Hoffman et al., 2018).
For the joint CWS and POS tagging task, Liu and Zhang (2012) and  investigate the task under the cross-domain adaption setting, both of which exploit self-training. In particular,  suggest a lexicon-based type-supervised model for further enhancement, and meanwhile publish a benchmark dataset which is publicly available for cross-domain adaption of joint CWS and POS tagging. Unfortunately, there is no future work for the joint task since then, while the majority of studies focus on the cross-domain of the two individual tasks Schnabel and Schütze, 2014;Peng and Dredze, 2016;Zhou et al., 2017;Gui et al., 2017;Ding et al., 2020). We propose a novel fine-grained domain adaption method with a domain-mixed representation learning model for the joint task.

Conclusion
We suggested a novel fine-grained domain adaption method for joint word segmentation and POS tagging. We started from self-training strategy, which exploits various transfers to generate pseudo training instances for the target domain, and argued that the strategy might lead to low-quality of the autolabeled instances when the target sentences are distant from the source domain. To address the problem, we proposed fine-grained domain adaption, regarding the OOV number to the source training corpus as the main advancing indicator to construct a higher quality corpus progressively. In addition, we combined our method with another line of representation learning of domain adaption, presenting a domain-mixed model for full exploration of the produced training instances. We evaluated our method on the benchmark ZhuXian dataset by using CTB6 as the source domain. The results showed that our method is highly effective, and our final model can achieve significant improvements on the joint task. A Transformer with Adapters Figure 6 illustrates the internal network structure of the transformer unit in ADBERT. As shown, we can see that two adapter layers are inserted inside each transformer unit: where W share are adapter parameters, which are much smaller than those of BERT in scale.
Here we further emphasize that when BERT is powered with adapters, BERT can be regarded as a static knowledge by freezing all the pretrained parameters for downstream tasks, since the BERT parameter values can be shared across these tasks.

B Hyperparameters
For the model part, we set all the hidden sizes of BiLSTM to 400, and set the hidden sizes of all shared adapters to 192. We exploit the pretrained BERT-base-Chinese model for the character representations, 5 thus the output dimensional size of character representation is 768. The embedding of domain type is with a dimensional size of 50. For fine-grained domain adaption, the number of high-confidence word-tag pairs in Top-K is set by 1000, the probability threshold p threshold is 0.8.
For training, we exploit online learning with a batch size of 16 to update the model parameters, and use the Adam algorithm with a constant learning rate 2 × 10 −5 to optimize the parameters. The gradient clipping mechanism by a maximum value 5 https://github.com/google-research/bert e 1 , e 2 , ..., e n Transformer Unit Ln ⊕ Adapters · · · Transformer Unit Li ⊕ Adapters · · · Transformer Unit L1 ⊕ Adapters of 5.0 is adopted to avoid gradient explosion. We use sequential-level dropout to the character representations to avoid overfitting, where the sequential hidden vectors are randomly set to zeros with a probability of 0.2. In particular, we have two hyperparameters λ 1 and λ 2 in our overall training objective, which is auto-adjust during the training from 0 to 1 by exponential annealing in the first 5,000 steps (Bowman et al., 2016).