PRAM: An End-to-end Prototype-based Representation Alignment Model for Zero-resource Cross-lingual Named Entity Recognition

Zero-resource cross-lingual named entity recognition (ZRCL-NER) aims to leverage rich labeled source language data to address the NER problem in the zero-resource target language. Existing methods are built either based on data transfer or representation transfer. However, the former usually leads to additional computation costs, and the latter lacks explicit optimization specific to the NER task. To overcome the above limitations, we pro-pose a novel prototype-based representation alignment model (PRAM) for the challenging ZRCL-NER task. PRAM models the cross-lingual (CL) NER task and transfers knowledge from source languages to target languages in a unified neural network, and performs end-to-end training, avoiding additional computation costs. Moreover, PRAM borrows the CL inference ability of multilingual language models and enhances it with a novel training objective— attribution-prediction consistency (APC)—for explicitly enforcing the entity-level alignment between entity representations and predictions, as well as that across languages using prototypes as bridges. The experimental results show that PRAM significantly outperforms existing state-of-the-art methods, especially in some challenging scenarios.


Introduction
Named Entity Recognition (NER) aims to identify the boundaries and categories of entities in a chunk of text (Tjong Kim Sang, 2002).Automatic NER is useful for various downstream applications, such as search engines (Cowan et al., 2015), dialogue systems (Bowden et al., 2018), and knowledge graphs (Al-Moslmi et al., 2020).Most of the recent advances in NER are achieved by deep neural networks that are trained on a large amount of labeled data (Lample et al., 2016;Chiu and Nichols, 2016;Li et al., 2020;Yu et al., 2020).However, it is not universal that every kind of language has sufficient labeled data for training a deep NER model.This motivates research on a new challenging task named zero-resource cross-lingual NER (ZRCL-NER), which aims to leverage a rich labeled source language to address the NER problem in an unlabeled target language.
Most advanced methods for the challenging task have concentrated on transferring knowledge from the rich-resource language (RRL) to the zeroresource language (ZRL).According to the type of transferred knowledge, these methods can be divided into two categories, i.e., data transfer based methods and representation transfer based methods.The data transfer based methods transfer the knowledge from RRL to the ZRL by generating pseudo-labeled data (Mayhew et al., 2017;Jain et al., 2019;Zhou et al., 2022) or soft labels for unlabeled data (Chen et al., 2021;Liang et al., 2021;Li et al., 2022) in the ZRL via translation or teacher models, respectively.However, they follow a two-separated training setup that needs extra translation or teacher models, introducing additional computation costs.In contrast, the representation transfer based methods directly model the cross-lingual NER in a unified model, borrowing the cross-lingual inference ability of the multilingual pre-trained language models (mPLMs) (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020).To further enhance the crosslingual inference ability, they enforce the mPLMs to learn language-independent features via representation alignment w.r.t.token (Kulshreshtha et al., 2020;Muller et al., 2021) or cross-lingual antagonistic training w.r.t.language (Keung et al., 2019).These optimizations are implicit for the NER task and thus show limited improvements on the ZRCL-NER.
To overcome the above limitations, we propose a novel Prototype-based Representation Alignment Model (PRAM) for ZRCL-NER.Our PRAM builds upon the mPLM that falls into the representation transfer based framework, and performs end-to-end training that avoids tedious training stages.This prevents excess computational costs that is not necessary for improving a model's performance on the target language.We explicitly enforce entity-level alignment between representations and predictions in both source and target languages by a new training objective called Attribution-Prediction Consistency (APC).By treating prototypes as category anchors, we can enforce the alignment of similar entity representations across languages by guiding similarity distributions of entity representations relative to prototypes through predictions.Specifically, we treat the centroid of entity representations belonging to the same class in the source language as a prototype.The APC minimizes JS divergence between predicted probability distribution (model output for each entity mention) and similarity distribution (similarities between each mention representation and all prototypes).In this way, with the predicted probability distribution becoming more discriminative in the training, the APC can impose the alignment between the entity representation and its corresponding prototype.The alignment property in the representation space, in turn, can regularize the deviation of model predictions.Probability distribution changes in source languages lead to similar changes in target languages due to mPLM's cross-lingual inference ability.Thus, the similar entity representations of the source and target languages are clustered with prototypes and this alignment can enhance cross-lingual inference.
To explore the performance of PRAM, we conduct experiments on three NER datasets under both single-and multi-source settings.The experimental results show that PRAM outperforms all state-of-the-arts (SOTAs), especially in some more challenging scenarios.Specifically, in the setting of single source transfer on the WikiAnn dataset, PRAM achieves 3.92% absolute improvements on F1-score compared with SOTA.On the other hand, PRAM achieves an average improvement of 3.88% on the F1 score compared with SOTA in the multisource setting.Moreover, we demonstrate the APC is able to impose entity-level alignment across languages by visualizing entity representations of the same class from different languages.
The contribution of this work is three-fold: 2 Related Work

Cross-lingual Named Entity Recognition
Cross-lingual NER can be roughly categorized into data transfer based and representation transfer based.The former aims to train a target-language NER model with pseudo labeled data constructed from the labeled source-language data.Specifically, translation-based data transfer constructs pseudolabeled data by translating texts and projecting labels from the source to the target language (Jain et al., 2019;Qin et al., 2021;Zhou et al., 2022).Knowledge-distillation-based data transfer generates soft labels for unlabeled target data with the trained teacher model on the source language (Wu et al., 2020a;Gou et al., 2021;Li et al., 2022).Such methods typically follow a two-separate training setup and need additional computing costs for the training of the translation models or the teacher models.The latter aims to model the crosslingual NER in a unified model with the help of mPLMs (Pires et al., 2019).To further exploit the language-independent features, some cross-lingual representation transfer techniques are introduced, including word-word alignment (Wang et al., 2020), adversarial training (Keung et al., 2019), and meta learning (Wu et al., 2020b), etc.These optimizations are implicit for the NER task and fail to learn the task-specific information, limiting their applicability on the ZRCL-NER.

Multilingual Representation Learning
Multilingual representation learning aims to create multilingual representations of different languages in a unified semantic space that can be used for various tasks across languages.With pre-trained from monolingual corpora in 104 languages, Multilingual BERT (M-BERT) (Devlin et al., 2019) can encode multilingual representations and works well on zero-resource cross-lingual transfer tasks, which motivates more work on multilingual pretrained language models (Conneau and Lample, 2019;Conneau et al., 2020;Ouyang et al., 2021).Cao et al. (2020) demonstrate that the cross-lingual transfer ability of M-BERT is mainly due to the partial alignment of its cross-lingual representations, and such transfer ability can be further enhanced by introducing effective alignment procedures.Motivated by this, several representation alignment approaches have been proposed to improve the cross-lingual transfer in downstream tasks.Wang et al. (2019) introduce a bilingual projection of the contextual representations based on word alignment trained on parallel data for cross-lingual dependency parsing.Wang et al. (2021) align the token representations from different languages via adversarial domain adaptation to efficiently apply M-BERT in cross-lingual information retrieval.Rotational alignment (Wang et al., 2020;Kulshreshtha et al., 2020) and adversarial training (Keung et al., 2019;Bari et al., 2020) are applied in the ZRCL-NER for representation alignment.Still, these techniques need to construct parallel corpus or train additional discriminators, which require expensive labor or computing costs.

Methodology
This section introduces our prototype-based representation alignment model (PRAM), as illustrated in Figure 1.First, the ZRCL-NER task is described formally.Then, we introduce the basic cross-lingual NER model, which we need to remold.Subsequently, our APC module is elaborated.Finally, we describe how to train PRAM in singleand multi-source settings.

Problem Definition
The zero-resource cross-lingual NER can be formulated as a sequence labeling problem.Given a sentence x = {x i } L i=1 with L tokens, a NER model aims to produce a label sequence y = {y i } L i=1 , where y i is the inferred label of the corresponding token x i .In the source language, the annotated training data is denoted by D s train = {(x, y)}.In the target language, only the unlabeled data denoted by D t u = {x} is available in the training process and a small set of labeled data denoted by D t test = {(x, y)} is left for evaluation.Formally, ZRCL-NER aims to learn a model based on D s train and D t u that can perform well on D t test .

The Basic Cross-lingual NER Model
The basic cross-lingual NER model is built by adding a linear classification layer upon a mPLM, which can be formulated as: where i=1 is the list of token representations, and p i is the predicted probability distribution for token x i .
For all samples in D s train , the loss function of the supervised learning is: where N is the training data size.It is worth noting that only the source language is annotated with ground-truth labels in the training process.

The Attribution-Prediction Consistency Constraint
Without additional constraints, the basic crosslingual NER model may be dominated by the source language, which may hinder its generalization in the target language.Such an issue is caused by the different entity representation distribution between the source and the target language (Libovickỳ et al., 2019).And during the training, the distribution difference may become even more significant.To alleviate this issue, we propose a new optimizing objective named attribution-predictionconsistency (APC) that can enforce entity-level alignment across languages.The APC is equipped for both source and target languages, which aims to optimize the consistency between the predicted probability distribution and the similarity distribution between each representation and all source prototypes.Cooperated with the initial cross-lingual inference ability of mPLM, the APC can progressively enforce entity-level alignment across languages as the training goes on.Firstly, we feed both the labeled data of the source language and the unlabeled data of the target language into the mPLM encoder within each batch.For each entity class, its prototype is obtained by averaging all its inclusion entity representations on the source language, where the representations are extracted according to Eq.( 1).This process can be formulated as follow: where 1[•] is an indicator function, k represents an entity class label, and n k denotes the number of tokens belonging to class k in the source language.In practice, the prototypes are generated from mini-batch instead of all samples to reduce computational costs.To further ensure the stability of updates, the moving average method (Xie et al., 2018) is adopted to update the prototypes: where C ′ k denotes the prototype of class k calculated from the previous moment, and λ ∈ (0, 1) is the moving average coefficient.
Subsequently, we can obtain the similarity distribution between each token and all prototypes.For each token x i , we calculate the cosine similarity between its representation h i and prototype C k : The similarity distribution q i = {q (i,k) } K k=1 is produced by applying a softmax function with a temperature coefficient τ (Chen et al., 2020) on the cosine similarity score distribution: Finally, the APC improves the consistency between the similarity distribution q i and the predicted probability distribution p i by minimizing their Jensen-Shannon divergence.Mathematically, this process can be formulated as: JsDiv(q i,j ||p i,j ), ). (8)

The Prototype-based Representation Alignment Model for Single-source Setting
Our PRAM is built by combining the basic crosslingual NER model and the APC module.The total loss of the PRAM is where α and β are the balancing weights for the source and target language, respectively.Discussion: The first term L sup in Eq.( 9) leverages the supervision signals in the source language to learn the task-specific semantics.As the training goes on, the predicted probability distribution for all samples in B do 6: hi ← f (θenc, xi) 7: pi ← f (θ cls , hi) 8: end for 9: # supervised learning 10: Calculate Lsup(θ) on (x, y) ∈ D s train as in Eq.( 3) 11: # attribution-prediction consistency (APC) 12: Obtain the class prototypes C as in Eq.( 4) and ( 5) 13: for all samples in B do 14: Produce qi as in Eq.( 6) and ( 7) 15: end for 16: Calculate L s c (θ) and L t c (θ) as in Eq.( 8) 17: # the total loss for training 18: Update θenc and θ cls via gradient back-propagation 20: iter += 1 21: end while becomes more discriminative in both the source and target languages.This is further leveraged by the APC, i.e., L s c and L t c , to impose the similarity distribution to be consistent with the probability distribution.Because the similarity distribution (in both source and target languages) measures the similarity between each entity representation and all prototypes (produced based on source language), the APC indirectly enforces the entity-level alignment between the source and target languages.The alignment property in the representation space, in turn, can revise the deviation of model predictions.
Algorithm 1 shows the pseudocode for the overall training process of PRAM.

Extend PRAM to Multi-source Setting
PRAM can be easily extended to meet the multisource scenario.Assuming that there are n source languages, we can construct n sets of prototypes, i.e., one set for one source language according to Eq.(4) and Eq.( 5).For each sample in the source languages, we obtain its similarity distribution according to the prototypes of the corresponding language.While, for each sample in the target languages, we obtain n similarity distributions according to different sets of prototypes of different source languages.Figure 2 provides a visual illustration of how to derive the similarity distributions in the multi-source setting.The total loss in Eq.( 9) for training PRAM can be rewritten as ) where L s i c (θ) denotes the consistency loss of the i-th source language, and L (t,s i ) c (θ) denotes the consistency loss of samples in the target language w.r.t. the prototypes in i-th source language.

Experiment
We evaluate PRAM in both single-source and multisource transfer settings, and compare it with stateof-the-art models.Moreover, an ablation study and several analytical experiments are conducted to demonstrate the effectiveness of our model.

Datasets and Experiment Settings
There are three benchmark datasets included in our experiments, which are CoNLL-2002 (Tjong Kim Sang, 2002), CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and WikiAnn (Pan et al., 2017;Rahimi et al., 2019).All datasets are labeled using the BIO scheme with four entity types, which are persons (PER), locations (LOC), organizations (ORG), and miscellaneous (MISC).Each dataset is split into training/development/test sets the same as initially published.The dataset statistics are listed in Table 1.
In the single-source transfer, we treat English as the source language and the others as the target languages for both CoNLL-2002/2003 andWikiAnn.In the multi-source transfer, we follow the previous work (Wu et al., 2020a)  with removed labels are selected as the unlabeled target data.

Implementation Details
The cross-lingual encoder is initialized with the parameters of the cased M-BERT base released by HuggingFace Transformers2 .Following the previous work (Wu et al., 2020b), the parameters of the embedding layer and the bottom three layers of M-BERT are frozen.We only consider the first subword in the loss function if a word is tokenized into several subwords by word piece.To avoid the adverse effect on the cross-lingual transfer performance led by excessive non-entity, we adopt a non-entity down-sampling strategy (Li et al., 2021) as described in Appendix A. And the balancing rate γ is set with {1.0, 1.5, 2.0} in different cases.
For all experiments, we use the AdamW opti-mizer with a learning rate of 1e-5 for training.The batch size and the maximum sequence length are both set to 128 empirically.The early stopping strategy is adopted and the maximum training step is set to 20000.Additionally, we use the grid search technique for other hyper-parameters to obtain the optimal ones, including the temperature coefficient τ selected from 0.15 to 0.25, the moving average coefficient λ selected from 0.8 to 0.99, the loss weight α and β selected from 0.5 to 1.5 for CoNLL and from 1.0 to 3.0 for WikiANN.We implement our approach with PyTorch 1.11.0 and all calculations are done on NVIDIA Tesla V100 GPU.The entity-level micro-F1 score is used as the evaluation metric.For all experiments, we report the average F1 scores over 5 runs with different random seeds.

Baselines
We compare our proposed method with the following SOTAs, including data transfer based methods and representation transfer based methods: TMP (Jain et al., 2019) proposes a system that improves the entity-projection annotation by leveraging machine translation.BERT-f (Wu and Dredze, 2019) fine-tunes the multilingual BERT in the source language and directly performs prediction in the target languages.AdvCE (Keung et al., 2019) introduces languageadversarial training on the contextual representations for cross-lingual NER.BERT-RA (Kulshreshtha et al., 2020) utilizes parallel corpora to supervise the rotation alignment of representations across different languages.TSL (Wu et al., 2020a) proposes teacher-student learning to transfer task-specific knowledge from the source to the target language.Unitrans (Wu et al., 2021) devises a pipeline to unify both model and data transfer for ZRCL-NER.AdvPicker (Chen et al., 2021) designs an adversarial learning framework to select less languagedependent data in the target language to improve the ZRCL-NER performance.RIKD (Liang et al., 2021) proposes a cross-lingual NER approach combining knowledge distillation and reinforcement learning.MTMT (Li et al., 2022)  Table 3: F1 Scores (%) on the WikiAnn dataset in the single-source setting.We bold the best performance and underline the second-best performance.

Performance Comparison
Single-source Transfer: As shown in Table 2 and  3, PRAM convincingly outperforms previous SO-TAs in most cases.Specifically, on the CoNLL-2002/2003 dataset (Table 2), PRAM achieves the best performance on German and Spanish and the second best on Dutch.Compared with the previous SOTA, PRAM improves the F1 score by 0.27% on average.On the WikiAnn dataset (Table 3), PRAM outperforms the previous SOTA by a large margin, with average F1-score improvements of 3.92% compared to MTMT (ranging from 3.20% for Chinese to 4.67% for Arabic).We can observe that PRAM achieves a more significant boost on the WikiAnn dataset, where the source (English) and the target languages (Arabic, Hindi, and Chinese) come from distinct language families.The huge difference between the source and target languages largely limits the transfer ability of previous methods, but PRAM still performs well in such challenging settings.Moreover, compared with the latest models, MTMT, RIKD, and AdvPicker, our Multi-source Transfer: To in-line with the previous work (Wu et al., 2020a), on the CoNLL-2002/2003 dataset, we take German, Spanish, and Dutch as target languages.As shown in Table 4, PRAM obtains significant and consistent improvements on three target languages.Specifically, PRAM improves the F1 score by 4.61% on average compared to BERT-f and 3.88% compared to TSL-avg (TSL with averaging teacher models) (Wu et al., 2020a).Moreover, compared to the singlesource setting, the cross-lingual performance of PRAM in the multi-source setting is consistently improved due to aligning target representations with the prototypes of multiple source languages.

Ablation Study
To investigate the contributions of different modules in PRAM, we conduct ablation experiments with three variants: 1) PRAM w/o S-C removes the APC on the source language; 2) PRAM w/o T-C removes the APC on the target language; 3) PRAM w/o C does not use any consistency strategy.As shown in Table 5, the performance of the three variants drops significantly.Compared to PRAM w/o S-C , PRAM w/o T-C yields a more significant drop in the F1 scores.This indicates that the APC on the target language is more critical for improving the cross-lingual transfer ability because it helps to align the target representations with the prototypes produced in the source language.Without the APC constraints, the average F1 score of PRAM w/o C decreases by 6.49% on CoNLL and 7.80% on WikiAnn compared to PRAM due to lacking explicit representation optimization.

Effectiveness of Attribution-Prediction Consistency
To validate the effectiveness of the APC, we conduct experiments with two variants of this training objective: Prediction Guiding (PG) employs the predicted probability distributions to supervise the similarity distributions, stopping gradient backpropagation from the prediction path.Conversely, Similarity Guiding (SG) prevents gradient backpropagation from the similarity-metric path.As shown in Figure 3, the performance of the two variants shows a significant drop compared to the APC.The performance degradation of the SG is more significant than that of the PG, suggesting that the guidance from the predictions to the representation distributions plays a more important role.We also report the prediction deviation of the model with different strategies, which is defined as the mean square error between the predicted probability distributions and the one-hot labels: e = 1 n (y i −p i ) 2 .The APC achieves a lower prediction deviation than the PG.This demonstrates that the APC effectively improves model performance through the combined effect of aligning cross-lingual representations and using alignment properties to revise prediction deviation.

Representation Visualization
To demonstrate that PRAM can align similar entity representations across languages, we randomly select 150 samples per class from the source and the target languages and employ t-SNE (Van der Maaten and Hinton, 2008) to project their representations encoded by mPLM into a two-dimensional space.As shown in Figure 4, PRAM results in greater alignment of similar representations across languages compared to the baseline, i.e., the basic cross-lingual NER model introduced in Section 3.2.When there is a huge difference between the source and the target language (from English to Arabic), many target representations from the baseline are mixed together and distributed differently from the similar source representations, which hinders the cross-lingual transfer.In contrast, PRAM significantly enhances the representation alignment across languages, especially in those classes where the baseline struggles (B-ORG, I-ORG, I-LOC, etc.).When the target language is similar to the source language (from English to German), PRAM can further optimize and align similar representations compared to the baseline, making the entities belonging to the same class more clustered.

Effect on the Source Language
To investigate the impact of cross-lingual transfer on the source language, we have undertaken a deeper evaluation of the performance delivered by PRAM in the single-source transfer setting.The results of our evaluation can be found in Table 6, which reveals that PRAM is indeed capable of boosting the performance on the source language (English or 'en') for both CoNLL and WikiAnn datasets.This demonstrates that PRAM can effectively handle both the source and target languages with a single end-to-end training.Obviously, PRAM is clearly more efficient than previous SOTAs, which required separate models to be trained in two stages for each language.

Conclusion
This paper proposes a novel and effective prototypebased representation alignment model (PRAM) for ZRCL-NER.A novel training objective named APC is proposed to cooperate with the crosslingual inference ability of mPLM, which can enhance the alignment of entity representations and predictions, as well as the representations of homogeneous entities across languages.The experimental results show that PRAM achieves excellent performance in both single-source and multi-source transfer settings.Last but not least, the training of PRAM is performed end-to-end and only additionally utilizes unlabeled target data.

Limitations
PRAM effectively handles the ZRCL-NER task but has certain limitations.Firstly, since PRAM relies on the cross-lingual inference ability of mPLM, its transfer ability may be restricted if the target language is not among the pre-trained languages of mPLM.Secondly, the high memory requirements of PRAM may occurs when the task is the multisource transfer, where we need to set a large batch size to ensure the stable update of prototypes on different source languages.This drives us to enhance the space efficiency of our method in the future.

A Non-entity Down-sampling Strategy
Since non-entity (O-class) samples constitute the bulk of the overall samples, it is crucial to alleviate the class imbalance caused by excessive non-entity samples.We adopt a non-entity downsampling strategy to address this.All tokens in the input sentence participate in forward propagation, but only the entity tokens and part of the nonentity tokens are considered in the loss functions.
Since the target data is unlabeled, we assign the class with the highest predicted probability output by the classifier as the label of the current token: With the help of the labels of the source language samples and the pseudo labels ŷt of the target language samples, we randomly downsample the non-entity tokens to balance the number of non-entity tokens and entity tokens: where n o and n e are the numbers of non-entity tokens and entity tokens, respectively, n ds o denotes the number of the downsampled non-entity tokens, and γ is a balancing rate.When calculating the supervised loss L sup and the consistency loss L consis , only the entity tokens and the sampled non-entity tokens participate.

Figure 1 :
Figure 1: (a) The illustration of PRAM based on a prototype-based representation alignment in an end-to-end manner; (b) The process of Attribution-Prediction Consistency (APC).

Algorithm 1
Overall training process of PRAM Require: D s train : training data in the source language; D t u : unlabeled data in the target language; θenc: parameters of the multilingual word encoder; θ cls : parameters of the classifier; Tsteps: the maximum steps for training.1: Initialize θenc and θ cls 2: iter=0 3: while iter < Tsteps do 4: Sample a mini-batch B with (x, y) ∈ D s train and (x) ∈ D t u 5:

Figure 2 :
Figure 2: The similarity distributions produced in the multi-source cross-lingual transfer.

Figure 3 :
Figure 3: The performance (F1 score, bar charts) and the prediction deviations (MSE score, line charts) of different training objectives on the WikiAnn dataset.

Figure 4 :
Figure4: Two-dimensional t-SNE visualizations of the samples: denotes the tokens from the source language, and ▲ denotes the tokens from the target language.

Table 2 :
CoNLL-2002 a similarity metric model based on knowledge distillation and multi-task learning for cross-lingual NER.F1 Scores (%) on theCoNLL-2002CoNLL- /2003dataset in the single-source setting.We bold the best performance and underline the second-best performance.

Table 4 :
F1 Scores (%) on the CoNLL dataset in the multi-source setting.We bold the best performance.

Table 6 :
F1 Scores (%) on the test set of the source language in the single-source settings.