Translation as Cross-Domain Knowledge: Attention Augmentation for Unsupervised Cross-Domain Segmenting and Labeling Tasks

The nature of no word delimiter or inflection indicating segment boundaries or word semantics increases the difficulty of Chinese text understanding, and also intensifies the demand for word-level semantic knowledge to accomplish the tagging goal in Chinese segmenting and labeling tasks. However, for unsupervised Chinese cross-domain segmenting and labeling tasks, the model trained on the source domain frequently suffers from the deficient word-level semantic knowledge of the target domain. To address this issue, we propose a novel paradigm based on attention augmentation to introduce crucial cross-domain knowledge via a translation system. The proposed paradigm enables the model attention to draw cross-domain knowledge indicated by the implicit word-level cross-lingual alignment between the input and its corresponding translation. Aside from the model requiring crosslingual input, we also establish an off-the-shelf model which eludes the dependency on crosslingual translations. Experiments demonstrate that our proposal significantly advances the state-of-the-art results of cross-domain Chinese segmenting and labeling tasks1.


Introduction
As a language that has no word delimiter or inflection indicating segment boundaries or word semantics, Chinese increases the difficulty of segmenting and labeling tasks which need to correctly identify the segment boundaries from a sequence of characters and to thoroughly understand the word meanings. In this paper, we regard the knowledge about segment boundaries and detailed word meanings as the word-level semantic knowledge, and intend to promote the Chinese cross-domain segmenting and labeling tasks where the paucity of such knowledge * Equal contribution. 1 Our code is available at https://github.com/ lancopku/Attention-Augmentation Two typical examples that cross-lingual contexts derived from machine translation help improve segmenting and labeling tasks. The arrows highlight the cross-lingual word that each Chinese character pays most attention to. Regarding the first case, the integrity of the out-of-domain word 制霉菌素 is indicated by its translated English word. In the second case, the translation of the ambiguous word 追求 is pursuit instead of pursue, which indicates the correct tag noun.
in the target domain is usually a severe problem and frequently undermines the performance of a cross-domain model. Existing work achieves cross-domain segmenting and labeling either by crafting domain-specific knowledge Liu et al., 2014), which is inflexible and cumbersome in adapting to different domains, or by learning on unannotated data of target domain (Ye et al., 2019;Jia et al., 2019) via language modeling, which cannot comprehend the detailed word-level semantics apart from grasping the general meaning of a sentence.
To address these issues, the key is to search for cross-domain knowledge that is easily accessible yet contains crucial word-level semantic knowledge. Motivated by the previous studies that segmenting and labeling tasks can benefit machine translation (Chang et al., 2008;Wang et al., 2014;Niehues and Cho, 2017;Zaremoodi and Haffari, 2019), in turn, we speculate that the machine-translated sentences would conceivably reveal some fundamental segmenting knowledge and help infer detailed word-level semantics.
The cross-domain knowledge inferred from translations can be illustrated by the two examples in Figure 1. For Chinese word segmentation, the word in blue from the target domain is originally excessively segmented. However, a translated version of the input sentence provides a strong indication of the integrity of this word. The part-of-speech tag of the word 追求 is originally ambiguous since it can be both a noun (as in the given context) and a modal verb. Nonetheless, it corresponds to distinct English words depending on the meaning it presents as a noun or as a verb, thus its translated counterpart actually hints at the correct label.
Motivated by the above observations, we propose a novel paradigm based on attention augmentation to introduce word-level cross-domain knowledge via cross-lingual translation. The proposed paradigm complements the input sentence with its cross-lingual translation and enables the model attention to draw word-level knowledge implicitly embodied in the alignment of the input sentence pair. It then incorporates cross-lingual masked language modeling to further strengthen the word-level alignment, evolving into the masked attention augmentation. The enhanced alignment, in turn, helps boost segmenting and labeling tasks. To make our proposal more practical, we use this model to tag the raw text in target domains to reap abundant synthetic data, which aims to elicit the originally implicit crossdomain knowledge implied by the word-level alignment, and then use the synthetic data to train an off-the-shelf model relying on no translation inputs. Experimental results on a series of cross-domain segmenting and labeling datasets demonstrate that our model substantially advances the state-of-theart results.
Our contributions are highlighted as follows: • We propose a novel paradigm of attention augmentation that addresses the deficiency of word-level semantic knowledge for Chinese cross-domain segmenting and labeling tasks via augmented translation.
• We leverage this paradigm to derive plentiful synthetic data and then train a new tagging model with the synthetic data, relieving the model from the dependency on translation and further improving the practicability.
• The proposed approach significantly advances the state-of-the-art results of cross-domain Chinese segmenting and labeling tasks without any human-annotated data.

Related Work
Some previous work improves domain adaption in a single task framework. Domain-specific lexicon Liu et al., 2014) is usually adopted for cross-domain tasks. However, highquality dictionaries for target domains are not always available. Recent work (Ye et al., 2019;Ding et al., 2020) uses raw text in the target domain to train word embeddings or construct word collections (Liu et al., 2014;. To relieve the burdensome work in crafting domainspecific knowledge, some researches attempt to align different domains, including mapping entity label space (Daumé III, 2007), sharing hidden feature representations (Yang et al., 2017), aligning feature distributions (Ganin et al., 2016) and using adaptation layers (Lin and Lu, 2018). However, it is difficult to align different domains in an unsupervised way.
Increasing work turns to multi-task learning since related NLP tasks can boost each other in a joint-learning framework. Language modeling (LM) is a common auxiliary objective that has been shown to be beneficial for sequence tagging (Rei, 2017). A natural idea is to learn contextualized embeddings by masked language modeling on the text from target domain (Han and Eisenstein, 2019). Jia et al. (2019) consider unsupervised domain adaptation for Named Entity Recognition (NER) via cross-domain language modeling tasks. Zhao et al. (2018) propose to incorporate unlabeled data for Chinese Word Segmentation (CWS) by combining segmentation model with language models. In addition to language modeling, some work learns to jointly optimize syntactic parsing and semantic parsing objectives (Niehues and Cho, 2017;Zaremoodi and Haffari, 2019). Liu and Zhang (2012) propose to use character clustering and self-training to jointly train CWS and POS tagging task. Tian et al. (2020) use a two-way attention mechanism to incorporate both context feature and their corresponding auto-analyzed syntactic knowledge for each input character and trains CWS and POS tagging tasks jointly. Some extensions turn to a multilingual setting and they leverage NMT systems to help cross-lingual NER for low-resource languages (Jain et al., 2019)   POS tagging (Snyder et al., 2008). However, existing methods are still inefficient in learning wordlevel semantic knowledge which is essential for segmenting and labeling tasks.

Approach
In this section, we introduce our approach in detail. We suppose that an annotated source domain consisting of input-label pairs is available, and our goal is to tag the unlabeled inputs T ={u i } |T | i=1 from the target domain.
Each instance x i in S represents an input sequence and its ground-truth label y i is also a sequence. We implement our approach based on BERT. In addition, we consider the situation that the target domain shares the same label set as the source domain in this work.

Attention Augmentation
In light of the fact that segmenting and labeling tasks can benefit the machine translation task, we speculate that the translation process implicitly incorporates the segmenting process and comprehends the detailed word semantics. We then intend to exploit the word-level semantic knowledge embodied in the translation process, including the segmenting knowledge and the detailed word meanings. We enable the prevailing attention-based model to attend to both the input sentence and its translation counterpart to infer labels, resulting in an augmented attention pattern, and this paradigm is abbreviated to Attention Augmentation (A 2 ) .
In the paradigm of attention augmentation, the original input sentence and its cross-lingual translation compose a translation pair. This pair is fed into a self-attention based segmenting or labeling model for further processing, which in our case is BERT.
Owing to the self-attention mechanism, the model can not only attend to the original input but also its translated version to predict labels for the original input. For one thing, the implicit cross-lingual word-level alignment embodied in the translation pair indicates the knowledge about segment boundaries. For another, the original context and the cross-lingual context complement each other and help understand the detailed word meanings in a reciprocal manner if the monolingual context is insufficient to infer the accurate word semantics. Especially, these two kinds of word-level semantic knowledge are usually not covered by the original context in the source domain.
Concretely, given an input sequence x i from source domain, we first fetch its translated version t i with an available cross-lingual translation model. Then the original input x i and its translation t i are packed together into a single sequencex i and are separated by a special [SEP] token. Withx i as input, the pre-trained model encodesx i by a number of blocks consisting of self-attention mechanism and outputs a predicted label sequenceŷ i . The model is then updated with the cross-entropy loss L cls against the ground-truth labels y i of the input sequence x i : where h i denotes the encoded representations of the inputx i , W cls and b cls are learnable parameters. Since the ground-truth labels of the translated sentence t i is unavailable, the outputs of the positions corresponding to the translated sentence will be ignored during training.

Masked Attention Augmentation
In order to better align the representations of the source language and the translated language, we incorporate cross-lingual masked language modeling (Conneau and Lample, 2019) into our work and develop the masked Attention Augmentation approach (mA 2 ). We randomly mask some tokens in the concatenated sequencex i and encourage the model to reconstruct the masked tokens. Since the tokens in both languages can be masked, the model can attend to the tokens in the other language to predict the masked tokens for the current language, which enhances the alignment between the source language and the translated language. We simultaneously optimize the cross-lingual language modeling and the preceding tagging objective. An overview of mA 2 is illustrated in Figure 2.
To be precise, given a translation pair, 15% of the tokens in both the original input sentence and the translated sentence will be chosen at random for prediction. Each chosen token is replaced with a [MASK] token for 80% of the time, a random token for 10% of the time, and unchanged for the rest 10% of the time, resulting in the corrupted input sequence x c i . x c i is then fed into a tagging model for encoding. The hidden states of the last layer will be used to reconstruct the masked tokens apart from predicting labels. To this end, we add an additional transformation layer to predict the masked tokens in x c i . The learning process of the masked language modeling objective L mlm is formulated as follows: where W mlm and b mlm are learnable parameters in the additional transformation layer,ẑ i is the predicted sequence, Mask(·) denotes the function that masks some percentage of the input tokens at random. The overall loss L of mA 2 comprises the classification loss for the tagging purpose and the masked language modeling loss: Note that the process of cross-lingual language modeling has no requirements for segmenting or labeling tags of the input and thus the unannotated raw text T of the target domain can be engaged just for cross-lingual language modeling, which means the paradigm of masked attention augmentation is able to exploit both the annotated data in the source domain and the unannotated data in the target domain.

Attention Augmentation Rebooted
The proposed A 2 and mA 2 take translation pair as input for training. As a result, they require the translation process during the inference stage. To make the models more practical, we then intend to enable the models to be deployed in the scenario where the translation of input is not available.
To achieve this, we take advantage of the welltrained model with masked attention augmentation to annotate the unlabeled inputs T of the target domain, establishing synthetic training data for the target domain. This annotation process is expected to elicit the word-level cross-domain knowledge which is originally implicit in the translation pairs. We then use the synthetic training data that imbibes the word-level semantic knowledge to train a new tagging model. This new model, termed as mA 2 -reboot for short, is then unchained from the translation input and can be easily deployed for regular inference. In addition, it can be initialized with the fine-tuned language model on the source domain to be endowed with the basic knowledge of tagging, as we do in practice.

Experiments
We conduct experiments on three Chinese crossdomain segmenting and labeling tasks, namely partof-speech tagging (POS tagging), Chinese Word Segmentation (CWS) and Named Entity Recognition (NER). As most studies do, we conduct crossdomain experiments on datasets from different domains.   Named Entity Recognition We employ MSRA (Levow, 2006) dataset and the People's Daily (PFR) dataset. Each dataset is used for evaluation with the model trained on the other dataset.

Tasks
Since unannotated data of the target domain can be engaged just for cross-lingual language modeling, for these tasks, we exclude the labels of the available training set from the target domain data and treat it as the unannotated data of the target domain for cross-lingual language modeling.

Experimental Settings
The implementation of our approach is based on BERT-base-multilingual since it needs to process the cross-lingual context. In our work, the crosslingual context is provided by translating Chinese into English and we employ the commercial trans-  lation engine, i.e., Google Translation, to obtain high-quality translations. The learning rate is set to 2 × 10 −5 for CWS and NER tasks, and is 5 × 10 −5 for POS tagging. We adopt the Adam optimizer (Kingma and Ba, 2015) and train all the tasks for 20 epochs. Other hyper-parameters follow the default settings in BERT-base-multilingual. For mA 2 , half of the batches are from the source domain and used for both sequence tagging and masked language modeling while the other half batches are from the target domain and only used for the masked language modeling task. We report F1-score for these three tasks. Regarding Chinese POS-tagging task, it jointly considers word segmentation and POS tagging for evaluation as the way in NER. We compare our proposal with some general baselines, i.e., BERT-base-Chinese, BERT-base-multilingual and XLM (Conneau and Lample, 2019) which incorporates multilingual pretraining. Besides, we include some state-of-the-art methods on target domain data for comparison.

Overall Performance
The performance of POS tagging, CWS and NER are presented in Table 1  tagging, our approach outperforms previous stateof-the-art model on a total of 8 datasets. For CWS, our proposal surpasses the previous bestperforming BERT-base-Chinese, achieving 89.7 (+4.3 improvement) on the DM dataset and 93.5 (+1.7 improvement) on the PT dataset. Consistent improvement is also observed in NER task. We notice that our best result outperforms BERT-base-Chinese by 1.46 improvement on average. BERT-base-Chinese, BERT-base-multilingual, XLM and our approach are general approaches that can be applied to both segmenting and labeling tasks, and the last three are multilingual approaches. We notice that these models outperform the conventional state-of-the-art models on most datasets. We conjecture that the large-scale pre-training process involved in these approaches helps model learn better contextualized word embeddings of some specific domains from heterogeneous raw text, resulting in a performance boost for cross-domain segmenting and labeling tasks. However, what sets our approach apart is that the multilingual context in our approach provides advantageous alignments to help learn segment boundaries and detailed word meanings. As a result, our approach brings further significant improvement compared to other models.
In addition to the general methods, we also compare the proposal with some state-of-the-art methods for different tasks. For POS tagging, TwASP provides additional auto-analyzed syntactic features and alleviates the deficiency of word-level semantic knowledge to some extent. However, it still suffers from the problems that monolingual context is inherently hard to tackle. For CWS, DAAT struggles to design intricate strategies to build annotated datasets of target domain. Despite the inefficient process, a major downside is that such an annotation process can only identify limited out-of-domain terms with high reliability. We also implement some competitive baseline methods on NER task. Compared to these models, our cross-lingual context provides hints at the wordlevel semantic knowledge required by the target task, resulting in better performance.

Quantitative Analysis
Analysis reveals the superiority of our proposal is two-fold: (1) the word-level cross-lingual alignment discloses the information of segment boundaries; (2) the cross-lingual context helps the model comprehend the detailed word meanings.
For the first advantage, the word-level alignment embodied in the translation pair indicates the integrity of a segment, especially for the outof-domain words which are not covered by the source domain. Since all three tasks for Chinese necessitate the basic segmentation process, we calculate the recall score of the out-of-domain words in all datasets to analyze this benefit. We compare the strong baseline BERT-base-multilingual and the proposed mA which is based on BERTbase-multilingual to directly show the effectiveness. Results are reported in Figure 3. Taking the DM dataset from cross-domain CWS as an example, BERT-base-multilingual achieves a recall score of 59.5% for out-of-domain words while our proposal advances the score to 70.1%. Consistent improvement is observed in other datasets, which suggests the cross-lingual word alignment helps identify the integrity of segment, especially for the out-of-domain words in cross-domain tasks.  Regarding the second advantage, the detailed word meanings can be comprehended sufficiently through combining the cues revealed by multilingual contexts. To be specific, the cross-lingual alignment helps disambiguation since the ambiguity of a word or segment can also be reduced if it is expressed in two languages. For the POS tagging task, the accurate understandings of the detailed word meanings are crucial, especially for Chinese where no inflection could indicate the POS tags and ambiguity is common. We regard the words with multiple possible labels as ambiguous words and report the error rate of these words in Figure 3(b). As shown, an error rate reduction is obtained when our approach is applied. Taking the BC dataset as an instance, there are initially 9,470 ambiguous words that are incorrectly labeled by the baseline model. However, after introducing the cross-lingual context, this number is reduced by 27.98%, which confirms the effectiveness of our approach in comprehending detailed word meanings.

Qualitative Analysis
In addition to the preceding quantitative analysis, we instantiate the two aforementioned advantages and give an analysis from a qualitative perspective. Figure 4 shows the augmented attention heatmaps of two cases regarding the two advantages. Note that the words in red are incorrectly tagged by BERT-base-multilingual but are correctly recognized by our proposed mA 2 . Figure 4(a) demonstrates a case where the crosslingual alignment suggests the segmenting knowledge. Concretely, the out-of-domain word in red attends to its corresponding English translation which strongly indicates its integrity as a whole word, and further helps the model to correctly infer the segmentation labels.
Concerning the second advantage, Figure 4(b) shows a case where cross-lingual alignment helps the model comprehend the detailed word meanings via disambiguation. BERT-base-multilingual fails to predict the word 追求 as a noun since it also occurs frequently as a verb in Chinese natural text. However, the word 追求 corresponds to different English words according to its meaning as a noun or as a verb, and the attention heatmap reveals a strong alignment to the English word pursue which represents what 追求 means as a noun. Given this clue, our approach correctly tags the word as a noun, confirming that cross-lingual context promotes the understanding of detailed word semantics and enables disambiguation.

Effect of Cross-lingual Language Modeling
As the cross-lingual language modeling is used to enhance the alignment embodied in the input translation pairs, we conduct ablation study to verify its effectiveness. The models with and without crosslanguage modeling are exactly mA 2 and A 2 . The comparison results on POS tagging, CWS and NER tasks are shown at the bottom of Table 1, Table 2 and Table 3, respectively. We observe a significant performance boost with cross-language modeling involved on all the datasets, which verifies the effectiveness of the cross-lingual language modeling in our approach. During training, the model can either attend to the original context or the crosslingual context to predict the masked tokens, which drives the model to grasp the alignment between two contexts. More explicit and accurate alignment strengthens the foregoing two advantages of our approach and results in better performance.

Effect of Translation System
The machine translation system contributes crossdomain knowledge and thus plays an important role in our approach. Here we explore the effect of the translation system in our approach. As regards the effect of the languages to translate into, please refer to the Appendix. Generally, translations of input can be obtained by standard machine translation packages. Besides the Google Translation system we used, an alternative way is to train a translation model from scratch with existing datasets. Here we use a customized model, i.e., DynamicConv (Wu et al., 2019) ZH-EN translation model trained on WMT17 (Bojar et al., 2017), to conduct comparative experiments on CWS.
The results of using different translation models are shown in Table 4. We observe that our customized translation model is slightly inferior to the commercial translation engine. This is expected because a better translation system provides more accurate indication of word-level cross-domain knowledge. Nevertheless, both translation models outperform the baseline model by a large margin and the gap between themselves is negligible. We speculate that the existing general training set for a machine translation model has covered most of the knowledge that the commercial translation engine can provide. In other words, it suggests that our approach can be easily deployed with existing datasets and customized models.

Extending to in-domain Tasks
Previous analysis points out that the proposal supplements information of both segment boundaries and detailed word meanings. Such information is also required in in-domain tasks if the source domain data itself is insufficient to tackle the segmenting and labeling problems. In the future, we intend to extend the approach to in-domain tasks.
Here we conduct experiments on in-domain NER task and take it as an appraisal of the extension. We employ the same datasets used in the crossdomain NER setting but train and test the model with consistent domain.
Experiments show some encouraging results. To be specific, our approach outperforms the bestperforming model and achieves 94.39 (+1.30 im-  provement) and 96.29 (+0.56 improvement) F1score on MSRA and PFR datasets respectively. Detailed results are reported in Appendix. The explanation of the improvement accords with our earlier analysis. We also notice that the average improvement gap narrows on in-domain datasets, as is expected, because the deficiency of wordlevel semantic knowledge is not as severe as in cross-domain datasets and our proposal is mainly oriented to cross-domain tasks. Nevertheless, the proposal performs well in both settings, which indicates that our proposal is also a promising approach for in-domain segmenting and labeling tasks.

Extending to other Languages
Although Chinese text processing shows a strong need for the knowledge of segment boundaries and detailed word meanings, such knowledge is also required in tagging tasks of other languages. Another potential application of our approach is the tagging tasks of other languages. Here we choose English, the commonly-used language, as the task language, and conduct experiments on English NER task for a trial. Following Jia et al. (2019) Since the crosslingual context can also provide indication of entity boundaries in English and help disambiguation of word meanings to some extent, we observe improvement on English NER task too. This result suggests that extending our approach to the tasks of other languages is also a potential direction.

Conclusions
We propose a novel paradigm of attention augmentation that supplements cross-domain wordlevel knowledge via machine translation for Chinese cross-domain segmenting and labeling tasks. We also construct a model unchained from the dependency on translation. The proposed approach substantially advances the state-of-the-art results in Chinese cross-domain segmenting and labeling tasks without any human-annotated data, demonstrating the effectiveness of our proposal. 88.9 92.1 90.5 (+1.9) mA 2 -reboot (Japanese) 89.0 91.9 90.5 (+1.9)   (Liu et al., 2020) 64.54 DANN (Ganin et al., 2016) 69.22 cross-domain LM (Jia et al., 2019) 73  the effect of the language to translate into. The results of different target languages are shown in Table 5. First, results with all three languages outperform the baseline model by a large margin, which verifies that introducing cross-lingual context is beneficial for our tagging task. Second, we observe a small performance difference across the three languages. English and French achieve comparable results and Japanese performs slightly inferior to the other languages. We speculate that Japanese is also a kind of language requiring segmentation process and thus the segment boundary information provided by Chinese-Japanese alignment is not as explicit as that of the other two languages. Nevertheless, the proposed approach proves effective with all three languages in general. Table 6 shows the comparison result on in-domain datasets for Chinese NER task. As we can see from Table 6, our approach outperforms the competitive baseline models on both datasets with the in-domain setting. The results suggest that our proposal is also a promising approach for in-domain tasks.

C Evaluation on Cross-domain English NER
As extending the proposal to more languages is also a potential application direction, we conduct experiments on English cross-domain NER for a trial. Chinese translation of the English input serves as the cross-lingual context. Table 7 shows the comparison result of our approach and the previous state-of-the-art methods on these datasets. As shown, our approach obtains significant improvement compared to the previous methods. English NER task also necessitates the knowledge of segment boundaries as well as detailed word meanings. We assume that the cross-lingual context also helps supplement such knowledge for English to some extent. Therefore, the proposal can benefit this task. This result also reveals that extending the proposal to tasks of more languages is a potential direction.