Target-Oriented Fine-tuning for Zero-Resource Named Entity Recognition

Zero-resource named entity recognition (NER) severely suffers from data scarcity in a specific domain or language. Most studies on zero-resource NER transfer knowledge from various data by fine-tuning on different auxiliary tasks. However, how to properly select training data and fine-tuning tasks is still an open problem. In this paper, we tackle the problem by transferring knowledge from three aspects, i.e., domain, language and task, and strengthening connections among them. Specifically, we propose four practical guidelines to guide knowledge transfer and task fine-tuning. Based on these guidelines, we design a target-oriented fine-tuning (TOF) framework to exploit various data from three aspects in a unified training manner. Experimental results on six benchmarks show that our method yields consistent improvements over baselines in both cross-domain and cross-lingual scenarios. Particularly, we achieve new state-of-the-art performance on five benchmarks.


Introduction
Named Entity Recognition (NER) is one of the fundamental tasks in natural language processing. Recently, zero-resource NER draws more and more attention in recent studies (Täckström et al., 2012;Jia et al., 2019;Bari et al., 2020;Liu et al., 2020b;Wu et al., 2020a). This task describes that, in a specific domain or language, there is no labeled training data for NER. Therefore, zero-resource NER severely suffers from data scarcity.
As shown in Figure 1, the ideal training data for zero-resource NER is regarded as the Targets, which should satisfy two conditions at the same time: a) in the target domain or language, and b) annotated with NER labels. Thus it is intuitive to Figure 1: The example of the Targets and essential knowledge from three aspects, i.e., Task, Language, and Domain. The middle rectangle denotes an Spanish NER example in news domain, which is referred as the Targets. The rounded rectangle above the Targets denotes knowledge from different tasks. The bottom left one denotes essential knowledge in Spanish and English languages. The bottom right one denotes knowledge in News and Twitter domains. augment training data or transfer knowledge from three aspects, i.e., task, language, and domain. The aspect of domain/language can be divided into the source and the target, and the mainstream solution for zero-resource NER is transferring NER annotations from source domains/languages to target ones, e.g., from news to Twitter (Strauss et al., 2016) or from English to Spanish (Bari et al., 2020), where the former is referred as cross-domain and the latter as cross-lingual.
Based on the mainstream approach, recent researches have conducted further exploration by fine-tuning the contextualized word embeddings on different data. Their results show that only exploiting source labeled data for NER is not enough, due to the discrepancy of domain/language between the source and the target. To alleviate this problem, some studies focus on utilizing a large amount of target unlabeled data to transfer domain-or language-specific knowledge. AdaptaBERT (Han and Eisenstein, 2019) fine-tunes the masked language model (MLM) on unlabeled data in the target domain (e.g., social media). Both Pfeiffer et al. (2020) and Vidoni et al. (2020) add extra components to learn from unlabeled data in the target language (e.g., Spanish). Besides, Phang et al. (2020) apply non-NER labeled data in the source language (i.e., English) to transfer knowledge for cross-lingual NER, which suggests that annotations for non-NER tasks (e.g., MRC) are useful for NER task. However, they only exploit non-NER annotations in the source language and ignore that in the target languages (e.g., Spanish).
Though the aforementioned studies have improved the performance of zero-resource NER in cross-domain or cross-lingual scenarios, there are two main problems in these methods: a) they conduct knowledge transferring by only considering unlabeled target data and labeled source data, which is insufficient for knowledge transfer. Particularly, they ignore the fact that labeled target data from non-NER tasks is available. b) They fine-tune contextualized word embeddings on various auxiliary tasks in a pipeline manner, where each task is performed only once. We argue that the fine-tuning procedure can not capture enough knowledge from various data when trained only once. Besides, it lacks effective strategies to approach the Targets closer. Target at these issues, we suggest it is necessary to exploit more diverse data and design strategies more oriented to the Targets. Therefore, we propose four practical guidelines on how to fully exploit available data to alleviate data scarcity. Concretely, we highlight the necessity of transferring knowledge from three aspects, i.e., task, language, and domain (Guideline-I). Then for domain/language, we pay attention to the gap between the source and target data (Guideline-II). For task, we focus on the gap between non-NER tasks and NER (Guideline-III). Finally, we emphasize the importance of knowledge fusion between the target domain/language and NER task (Guideline-IV). According to our proposed guidelines, we design a target-oriented finetuning (TOF) framework for zero-resource NER to approach the Targets. This framework applies three tasks (i.e., MLM, MRC, and NER) to capture the knowledge from above three aspects. It enhances the training with MRC task, pseudo data, and continual learning, respectively. To validate the effectiveness and superiority of our approaches, we conduct experiments on six popular benchmarks for zero-resource NER in cross-domain and crosslingual scenarios.
Our contributions 1 are summarized as follows: • We analyze the key factor of zero-resource NER and propose four practical guidelines to transfer knowledge from three aspects, i.e., Task, Language, and Domain, and strengthen connections among them.
• We design a target-oriented fine-tuning (TOF) framework based on our guidelines to exploit more diverse knowledge and approach the Targets closer.
• Experimental results verify the effectiveness of our method in both cross-domain and crosslingual scenarios on six benchmarks. Particularly, our method achieves the state-of-the-art performance on five benchmarks.

Task Definition
The goal of zero-resource NER task is to transfer NER knowledge from labeled source data to unlabeled target data. Therefore, we assume that there are three kinds of data available for training: a) NER labeled source data, b) unlabeled target data, and c) non-NER labeled target data (e.g., MRC).

Basic Framework
Our method is built on AdaptaBERT proposed by Han and Eisenstein (2019). This network is designed for unsupervised domain adaptation on sequence labeling tasks (e.g., NER). A two step finetuning approach is applied in AdaptaBERT, and in this section, we will describle it in detail.
Step-1: Domain Tuning. They fine-tune contextualized word embeddings by training a masked language model (MLM) to reconstruct randomly masked tokens. And this is performed on a dataset containing all available target domain data and an equal amount of unlabeled source domain data.
Step-2: Task Tuning. They fine-tune contextualized word embeddings continually and learn the prediction model for the sequence labeling task. Following (Devlin et al., 2018), they build a strong NER system by simply feeding the contextualized embeddings into a linear classification layer. The log probability can be computed by the log softmax, log p(y t |w 1: (1) where contextualized word embedding x t captures information from the entire sequence w 1:T = w 1 , w 2 , ..., w T , and β y is a vector of weights for each tag y ∈ Y = {P ER, ORG, LOC, M ISC}. They train the model on labeled source domain data by minimizing the negative conditional loglikelihood of labeled data.

Our Approach
For zero-resource NER, we firstly analyze the problem of data scarcity. Then we propose four practical guidelines to guide knowledge transfer from different data, which is adapted to both cross-domain and cross-lingual scenarios. According to these guidelines, we design a target-oriented fine-tuning (TOF) framework for zero-resource NER.

Problem Analysis
The nature of zero-resource NER task is to perform NER with no labeled target domain/language data. And to deal with this task, it is intuitive to transfer essential knowledge from other available data. Concretely, when the data satisfies the two conditions at the same time: a) in the target domain/language (e.g., Twitter/Spanish), and b) annotated for the target task (i.e., NER), we consider it as our Targets. While the Targets is unavailable under the zeroresource setting, there is abundant data meeting either condition. Therefore, we transfer knowledge from three aspects, i.e., Domain, Language, and Task, as shown in Figure 1. Domain. It contains knowledge from specific domains (e.g., Twitter). As shown in the bottom right rectangle of Figure 1, '@Garbriele Corno:' is the special expression that only exists in tweets and '#' is used to highlight something. Language. It refers to linguistic knowledge in various languages. For example, the word order of 'Como se conoce popularmente en Brasil al tenista' in Spanish is different from its English expression 'As the tennis player is popularly known in Brazil'. Besides, the expressions of 'Brazil' and 'tenista' in English vary from those in Spanish.
Task. It describes hand-crafted annotations for different tasks, which is expensive and difficult to obtain (e.g., NER and MRC). For example, NER labels LOC and ORG denote names of locations and organizations, respectively. For MRC task in Figure 1, 'W NJ' is annotated as the answer to question 'Who has the most to lose?' .
Furthermore, we divide domain/language aspect into the source and target. Particularly, NER is regarded as the target task for zero-resource NER.

Four Practical Guidelines
Based on our analysis, we propose four practical guidelines on how to fully exploit available knowledge to alleviate data scarcity.
Guideline-I: It is necessary to exploit available knowledge from domain, language, and task.
Guideline-II: Bridge the gap between source domains/languages and target domains/languages.
Guideline-III: Bridge the gap between annotations for non-NER tasks and NER task.
Guideline-IV: Fuse the knowledge of both the target domain/language and NER task.

Target-Oriented Fine-tuning Framework
As shown in Figure 2, we design a Target-Oriented Fine-tuning (TOF) framework for zero-resource NER. It contains two components: a) Knowledge Transfer, which displays how to transfer not only domain/language but also task knowledge from various data, and b) Fine-tuning Process, which demonstrates a flow diagram of the complete finetuning process. Both components are designed based on our proposed guidelines, and their relations are illustrated in Figure 2.

Knowledge Transfer
As the right part of Figure 2 shows, to transfer both domain/language and task knowledge for the Targets, we consider six kinds of corpora: a) un- no , and f) labeled NER dataset D s,n , where {a), c), e)} is in the target domain and {b), d), f)} is in the source domain. Note that e) is the Targets without considering labels.
According to Guideline-I, since there is no available data that satisfies the Targets, it is necessary to transfer knowledge relevant to the Targets from other data as much as possible. Apart from source NER labeled data, we not only exploit unlabeled target data, but also utilize non-NER labeled target data. Therefore, three kinds of data are considered as shown in Figure 2: for 'Target Data', a) unlabeled NER dataset D t,no and c) labeled MRC dataset D t,m ; for 'Source Data', and f) labeled NER dataset D s,n . According to Guideline-II, there is discrepancy between the source and target domain/language. To deal with the gap, it is essential to apply finetuning tasks on the mixture of the source and target data. Besides, an effective way to bridge the gap is transforming source data into the target format, e.g., translate the source language data into target language. Therefore, we collect b) unlabeled NER dataset D s,no and d) labeled MRC dataset D s,m in the 'Source Data'.

Fine-tuning Process
Based on AdaptaBERT, we novelly introduce a MRC task between domain-tuning and task-tuning process. Thus, our fine-tuning process contains three fine-tuning tasks as follows.
Masked Language Model (MLM). To adapt contextualized word embeddings to both the source and target data, we use MLM (Devlin et al., 2018).
Based on Guideline-II, We train the model on a mixture of dataset D t,no and D s,no . We use the same strategy with (Han and Eisenstein, 2019) to generate 10 random maskings for each instance.

Machine Reading Comprehension (MRC).
Based on the Guideline-III, we add a span extraction MRC task, which has three advantages: a) MRC can enhance the ability of NER model on span extraction and help NER better capture semantic information of different entity types; b) MRC framework can be used to solve NER task (Li et al., 2020) and it becomes a bridge between NER and other tasks; and c) Recent work on framing other tasks as MRC Liu et al., 2020a) provides an idea for transferring knowledge from different tasks with a unified framework. The MRC model is implemented by feeding the contextualized word embedding of each token x t into two linear classification layers, respectively. The probability of each token being the start or the end index of a span can be computed as follows: where W start and W end ∈ R d 1 ×2 are learnable parameters, and d 1 denotes the dimensions of contextualized word embedding. Finally, the model is trained by optimizing the Cross-Entropy loss over p start t and p end t . According to Guideline-II, we train the MRC model on all available MRC data D t,m , D s,m and NER data D s,n that is transformed into MRC format D s,nm following (Li et al., 2020).
Named Entity Recognition (NER). To finetune contextualized word embeddings continually and learn the prediction model, we feed contextualized word embeddings into a linear classification layer and maximize the probability of each token with the ground-truth entity label. Concretely, given an input token sequence with N words, we firstly feed it into the feature encoder f θ to obtain contextualized word embeddings where h i is the feature vector corresponding to the i-th token x i and f θ is based on pre-trained language model, i.e., BERT (Devlin et al., 2018), where θ denotes model parameters. Then h i is fed into a linear classification layer with the softmax function to predict the probability distribution of entity labels, which is formulated as follows: whereŷ ∈ Y with Y being one-hot vectors corresponding to different entity labels, and {W, b} denotes learnable parameters. The loss function is defined as the cross entropy between the predicted probability distribution of each entity label and the ground-truth one for each word. We train NER model on D s,n and predict labels on D t,no .

Training
A novel training process is proposed to narrow the gap between the knowledge from available data and the Targets, which contains three processes, i.e., MRC enhancing, pseudo data enhancing, and continual learning enhancing. MRC Enhancing. We fine-tune contextualized word embeddings by sequentially training the MLM f (·, θ mlm ), MRC g(·, θ mrc ), and NER h(·, θ ner ) at Step-1∼3 in Figure 2. Pseudo Data Enhancing. According to Guideline-IV, we use the trained NER model (Step-3) to generate pseudo labels on NER unlabeled target dataD t,n (Step-4) and then fine-tune NER model h(·, θ ner ) continually on generated pseudo-labeled target data at Step-5.
Algorithm 1 The training process of TOF.
Input: Dataset D t,no , D s,no , D t,m , D s,m , D s,n , and D s,nm ; MLM f (·; θ mlm ); MRC g(·; θ mrc ); NER h(·; θ ner ); pre-trained BERT θ (0) ; Number of pseudo-data iterations T . Output: h(·, θ (T ) ner ). 1: Initialize θ mlm = θ (0) 2: Fine-tune f (·, θ mlm ) on {D t,no , D s,no } 3: Initialize θ mrc = θ mlm 4: Fine-tune g(·, θ mrc ) on{D t,m , D s,m , D s,nm } 5: Initialize θ ner = θ mrc 6: Fine-tune h(·, θ ner ) on {D s,n } 7: Gen pseudo-NERD t,n ← h(·, θ ner )on D t,no 8: Initialize θ  Continual Learning Enhancing. We design a continual learning strategy to make full use of pseudo data and imitate the training procedure on the Targets. We continually perform fine-tuning between MRC and NER with considering pseudo data (Step-6∼7 ), based on the following three considerations: 1) pseudo-labeled target NER data can be refined by the fine-tuned NER model after each iterations, 2) pseudo data is transformed into MRC format, which directly introduces entity type knowledge in target data through definition of MRC questions, and 3) pseudo data participates in both MRC and NER training, which can enhance knowledge connections between two tasks. At Step-8∼9, we refine pseudo data with newly fine-tuned NER model and take it as training data. After T times iteration, we conduct predictions on unlabeled target data with NER model h(·, θ (T ) ner ) (Step-10).
The training procedure is summarized as 1. D t,no and D s,no demote unlabeled NER data in the target and source domain/language, respectively. D t,m and D s,m denote labeled MRC data in the target and source domain/language, respectively. D t,n and D s,n denote labeled NER data in the target and source domain/language, respectively. Particularly, D t,n is the pseudo-labeled NER data in the target domain/language generated by NER model. And we transformed it into the MRC format, as D s,nm . f (·; θ mlm ), g(·; θ mrc ) and h(·; θ ner ) denote the model of MLM, MRC, and NER, respectively. Note that 'Gen' in Algorithm 1 denotes the generalize operation.

Data Preparation
We take CoNLL03 for English (en) in the news domain as the source dataset for both cross-lingual and cross-domain tasks. Cross-Lingual. We consider three NER datasets in target languages: CoNLL03 for German (de) (Tjong Kim , CoNLL02 for Dutch (nl) and Spanish (es) (Tjong Kim Sang, 2002). All datasets are labeled with 4 entity types: PER, ORG, LOC, MISC. Each of them is split into training, validation and test sets following (Wu et al., 2020b). We use three MRC datasets in target languages: MLQA (es) (Lewis et al., 2019), XQuAD (de) (Artetxe et al., 2019), and SQuAD (en) (Rajpurkar et al., 2016). Cross-Domain. We use three English datasets in target domains: CBS SciTech News dataset (Jia et al., 2019), short as CBS, in the science and technology news domain, Twitter NER (Zhang et al., 2018b) and WNUT16 (Strauss et al., 2016) in the social media domain. We use two English MRC datasets from news and twitter domains respectively: NewsQA (Trischler et al., 2016) and TweetQA (Xiong et al., 2019). The statistics of datasets are shown in 5 in Appendix A.

Data Preprocessing
NER datasets are processed in the 'BIO' scheme with four entity types, i.e., PER, LOC, ORG, and MISC except for WNUT16. We perform entity span detection task on WNUT16. Since there are ten entity types annotated in WNUT16, it is different from annotations in source domain/language. For MRC datasets, we transform all of them into a unified format following (Li et al., 2020) for MRC training. Besides, following (Li et al., 2020), we map the labeled NER datasets to labeled MRC dataset. Concretely, we use the description of each entity for annotators as the query, and each sentence as context. The corresponding answers for each query are entity spans with the same entity type in the sentence. We delete all entity labels on the target data and only use the unlabeled data. We use training and validation sets from the source for training and evaluation, and do predictions on test sets from different target domains/languages.

Implementation Details
We use BERT-base and multilingual BERT (Devlin et al., 2018) to initialize contextualized word embeddings in cross-domain and cross-lingual scenarios, respectively. We empirically follow the hyperparameter settings of (Han and Eisenstein, 2019) and (Li et al., 2020) except for the learning rate and batch size. Due to the discrepancy between various datasets, we choose the learning rate for Adam (Zhang et al., 2018a) optimizer according to the best performance of checkpoints on the validation set. And the batch size is set to 32, 16 and 64 for MLM, MRC and NER, respectively. More hyperparameters for training procedure are listed in Appendix B.

Systems
We evaluate following systems by entity-level F1 scores . Moreover, we conduct each experiment 5 times and report the mean F1-score. AdaptaBERT. Han and Eisenstein (2019) perform domain-tuning and task-tuning as described in Section 2.2. We take the AdaptaBERT as our baseline in the cross-domain scenario. AdaptaBERT + translation. Another baseline is set for the cross-lingual scenario. We apply trans- shows the results of our TOF after removing 'pseudo data' and 'continual learning', respectively, which demonstrates the effectiveness of these two enhancing strategies. The improvement of our TOF on nl (2.49 ↑) is not as good as other two languages (es:4.17 ↑ and de: 4.1 ↑), which results from the scarcity of MRC data in nl. The results well demonstrate the effectiveness of our proposed framework, which benefit from our four guidelines. Cross-Domain. We regard re-implemented results of AdaptaBERT as our baseline, since it not only achieves the state-of-the-art performance on WNUT16, but also outperforms the previous state-of-the-art methods on both CBS and Twitter. Our framework yields obvious improvements over the baseline (CBS: 1.11 ↑, Twitter: 2.33 ↑ and WNUT16: 4.83 ↑) and achieves new state-of-theart results on three datasets. In conclusion, all these results verify the effectiveness and generalizability of our TOF in cross-domain setting.

Ablation Study
We conduct ablation studies to explore how MRC datasets make difference at step 1∼3 in Figure 2. Table 2 highlights the impact of different MRC data in both cross-lingual and cross-domain scenarios.
In the cross-lingual scenario, we consider five kinds of MRC data: 1) 'w/o target MRC data' denoting training without MRC data in the target language; 2) 'w/o source MRC data' denoting training without MRC data in English; 3) 'w/o source MRC data (trans)' denoting without translating the source MRC data into the target language; 4) 'w/o   NER-MRC data' denoting without transforming the NER data into MRC format; and 5) 'w/o NER-MRC data (trans)' denoting without translating the NER-MRC data into the target language.
Results demonstrate that removing any data generally causes a performance drop. Therefore, we draw more in-depth observations as follows. For es, 'NER-MRC data' brings the greatest drop of the performance (Row 4). For nl and de, 'source MRC data' has the greatest impact (Row 2). Besides, 'source MRC data' affects the performance more than 'target MRC data' (Row 2 vs. Row 1). We think it is because 'source MRC data' is twice as much as the target one.
In the cross-domain scenario, since all of three target datasets are in English but in different domains, we do not consider the translated data (Row 3 and 5) in Table 2. Therefore, we conduct ablation studies on three kinds of data: 1) 'w/o target MRC data' denoting training without the target domain MRC data; 2) 'w/o source MRC data' denoting without the source domain MRC data; and 4) 'w/o NER-MRC' data denoting without transforming the NER data into MRC format.
According to the average results in Table 2, we observe that 'target MRC data', 'NER-MRC data', and 'source MRC data' are in descending order of impact. It is intuitive that on Twitter, as shown in Table 2 (Row 1 vs. Row 2 and Row 4), 'target MRC data' has the greatest impact on the performance, when the amount of the target MRC data is greater than or equal to that of 'source MRC data' and 'NER-MRC data'. However, CBS is affected most by 'NER-MRC data' (Row 4), since its target MRC data are collected from news domain, not science and technology news. For WNUT16, both 'target MRC data' and 'source MRC data' bring more drops than 'NER-MRC data' (Row 1 and Row 2 vs. Row 4). We conjecture that since WNUT16 is an entity span detection task rather than standard NER, it is affected more by the golden MRC data than NER-MRC data.

Impact of Task Order
We explore the impact of two different sequences for MRC-enhancing, i.e, 'MLM → MRC → NER' and 'MRC → MLM → NER' as shown in Table 3. The results demonstrate that the former outperforms the latter in both cross-lingual and cross-domain scenarios. We conjecture that MLM can capture knowledge of data itself, e.g., domainspecific information and linguistic characters, and MRC captures task-specific information with annotations. Besides, MRC is more relevant to NER than MLM according to task relevance. Therefore, MRC is appropriate to be an intermediate task.

Comparison with SpanBERT
We replace the pre-trained language model BERT with SpanBERT (Joshi et al., 2020) in the Adapt-aBERT and MRC-enhancing of our TOF to compare the span-enhancing method with ours. The results are shown in  Table 4: AdaptaBERT vs SpanBERT (Joshi et al., 2020) in cross-domain NER.
superior to BERT-base for NER task (Row 3 vs. Row 1). Different from BERT with masking different tokens for each instance, SpanBERT masks a span with several adjacent tokens, which is more related to NER task. 2) 'SpanBERT' underperforms 'AdaptaBERT+MRC-enhancing' on CBS and WNUT16 (Row 2 vs. Row 3), which suggests that although SpanBERT is trained on a large amount of corpus, it is not appropriate for some specific domains. Our MRC-enhancing method uses limited MRC data but achieves more improvements, which shows that MLM can not capture enough task-specific information and it is necessary to introduce other NER-related tasks. 3) Our MRC-enhancing method can make further improvements based on SpanBERT (Row 4 vs. Row 3).

Related Work
Zero-resource NER. Some studies (Jia and Zhang, 2020;Pfeiffer et al., 2020;Vidoni et al., 2020) focus on improving architectures of existing models, which add new components into networks to capture specific knowledge, i.e., entity types, language and task characteristics. Different from these methods, our approach only modifies the training procedure without changing model structures. Other studies introduce different auxiliary tasks to alleviate data scarcity (Han and Eisenstein, 2019;Xue et al., 2020;Phang et al., 2020). They are usually based on multi-task learning or two-phrase fine-tuning. Multi-task learning requires balance between the target task and auxiliary tasks, which needs carefully designed objectives. Although twophrase fine-tuning is effective, it is still inadequate for available data and depends on valid data selection. Our work differs in that we not only propose four practical guidelines to guide data selection and task fine-tuning, but also design a task-oriented fine-tuning framework to exploit more diverse data and target-oriented training strategies. Data Augmentation. Our approach is inspired by some studies on text classification. Gururangan  (Huang et al., 2015;Ma and Hovy, 2016;Akbik et al., 2018;Liu et al., 2019), our work is also inspired by formatting other tasks as MRC, such as NER (Li et al., 2020), co-reference resolution , and event extraction (Liu et al., 2020a). These studies show the superiority and scalability of MRC framework and provide a reference for our work. Different from (Li et al., 2020) using MRC to build a new solution architecture for NER, we exploit MRC to improve the training procedure of NER that is based on sequence labeling. Besides, we perform continual learning between MRC and NER to enhance the impact of MRC on NER.

Conclusion and Future Work
In this paper, we analyze the problem of data scarcity in zero-resource NER. To alleviate this issue, we propose four practical guidelines on transferring knowledge from three aspects, i.e., domain, language, and task, and strengthening connections between the source and target data. Based on these guidelines, we design a task-oriented fine-tuning framework to enhance the training procedure with various strategies. Our approach yields significant improvements on six benchmarks and achieves the state-of-the-art on five benchmarks. In the future, we will extend our framework on different target tasks and more task-specific enhancing strategies.
The statistics of all datasets are listed in Table 5.
We regard CoNLL03 in English as the source NER data in both cross-lingual and cross-domain scenarios. For target NER datasets, we consider crosslingual and cross-domain scenarios, respectively. In the cross-lingual scenario, CoNLL03 in German, CoNLL02 in Spanish, and CoNLL02 in Dutch denote the benchmark datasets in the target languages, i.e., German (de), Spanish (es), and Dutch (nl), respectively. In terms of MRC datasets, we apply MLQA in Spanish and XQuAD in German as labeled MRC datasets in the target languages, i.e., on es and de. Note that we use the initial validation and test splits in MLQA and XQuAD as the training and validation sets in our work. Since it is difficult to obtain labeled MRC datasets for nl, we consider the MRC data in the source language, i.e., English (en).
In the cross-domain scenario, CBS SciTech News NER datasets, short as CBS, in science and technology news domain, Tiwtter NER dataset in twitter domainm, and the shared task on entity span detection for WNUT2016 in twitter domain are considered as the target domain NER datasets. All of these three cross-domain benchmarks are in English. We use NewsQA in news domains for MRC fine-tuning on CBS, due to lack of available MRC data in science and technology news domain. TweetQA is applied to both Twitter and WNUT16 NER as the MRC data in the target domain.