Risk Minimization for Zero-shot Sequence Labeling

Zero-shot sequence labeling aims to build a sequence labeler without human-annotated datasets. One straightforward approach is utilizing existing systems (source models) to generate pseudo-labeled datasets and train a target sequence labeler accordingly. However, due to the gap between the source and the target languages/domains, this approach may fail to recover the true labels. In this paper, we propose a novel unified framework for zero-shot sequence labeling with minimum risk training and design a new decomposable risk function that models the relations between the predicted labels from the source models and the true labels. By making the risk function trainable, we draw a connection between minimum risk training and latent variable model learning. We propose a unified learning algorithm based on the expectation maximization (EM) algorithm. We extensively evaluate our proposed approaches on cross-lingual/domain sequence labeling tasks over twenty-one datasets. The results show that our approaches outperform state-of-the-art baseline systems.


Introduction
Sequence labeling is an important task in natural language processing. It has many applications such as Part-of-Speech Tagging (POS) (DeRose, 1988;Toutanova et al., 2003) and Named Entity Recognition (NER) (Ratinov and Roth, 2009;Ritter et al., 2011;Lample et al., 2016;Ma and Hovy, 2016;Hu et al., 2020). Approaches to sequence labeling are mostly based on supervised learning, which relies heavily on labeled data. However, the labeled data is generally expensive and hard to obtain (for lowresource languages/domains), which means that these supervised learning approaches fail in many cases.
Learning knowledge from imperfect predictions from other rich-resource sources (such as crosslingual, cross-domain transfer) (Yarowsky and Ngai, 2001;Guo et al., 2018;Huang et al., 2019;Hu et al., 2021) is a feasible and efficient way to tackle the low-resource problem. It transfers knowledge from rich-resource languages/domains to low-resource ones. One typical approach to this problem is utilizing existing systems to provide predicted results for the zero-shot datasets. However, due to the gap between the source and the target languages/domains, this approach may fail to recover the true labels. Several previous approaches try to alleviate this problem by relying heavily on cross-lingual information (e.g., parallel text (Wang and Manning, 2014;Ni et al., 2017)), labeled data in source languages (Chen et al., 2019), and prior domain knowledge (Yang and Eisenstein, 2015) for different kinds of zero-shot scenarios. However, these approaches are designed to be specific, and might not be generalizable to other kinds of settings where the required resources are expensive to obtain or not available due to data privacy (Wu et al., 2020). Instead, we want a learning framework that can address the zero-shot learning problem in a unified perspective.
In this work, we consider two widely explored settings in which we have access to: 1) the imperfect hard predictions (Rahimi et al., 2019;Lan et al., 2020); 2) the imperfect soft predictions (Wu et al., 2020), produced by one or more source models on target unlabeled data , and propose two novel approaches. We start by introducing a novel approach based on the minimum risk training framework. We design a new decomposable risk function parameterized by a fixed matrix that models the relations between the noisy predictions from the source models and the true labels. We then make the matrix trainable, which leads to further expressiveness and connects minimum risk training to learning latent variable models. We propose a learning algorithm based on the EM algorithm, which alternates between updating a posterior distribution and optimizing model parameters.
To empirically evaluate our proposed approaches, we extensively conduct experiments on four sequence labeling tasks of twenty-one datasets. Our two proposed approaches, especially the latent variable model, outperform several strong baselines.

Sequence Labeling
Given a sentence x = x 1 , . . . , x n , its word representations are extracted from the pre-trained embeddings and passed into a sentence encoder such as BiLSTM, Convolutional Neural Networks (CNN) and multilingual BERT (Devlin et al., 2019) to obtain a sequence of contextual features. Without considering the dependencies between predicted labels, the Softmax layer computes the conditional probability as follows, Given the gold sequence y * = y * 1 , . . . , y * n , the general training objective is to minimize the negative log-likelihood of the sequence, For simplicity, throughout this paper, we assume that all the sequence labelers are based on the Softmax method.

Cross-Lingual/Domain Transfer
Supervised models fail when labeled data are absent. Learning from imperfect predictions from rich-resource sources is a viable approach to tackle the problem. Generally speaking, there are two settings to obtain the imperfect predictions from: single source and multi source. The simplest singlesource approach is to train a single-source model on one source language/domain and use the source model to directly predict labels on the target test data. We name this approach as direct single-source transfer (DT). Another single-source approach is to use the predictions of the source model on a set of unlabeled target data to supervise the training of a target model. With imperfect hard predictions from the source model, the corresponding objective function is the cross-entropy loss between the imperfect hard predictions and the target model's soft predictions, whereŷ denotes the pseudo label sequence of x predicted by the source model andŷ i is the pseudo label for position i. With imperfect soft predictions from the source model, the corresponding objective function is the KL-divergence (KL) or mean square error (MSE) loss between the imperfect soft predictions and the target model's soft predictions (knowledge distillation, KD) (Wu et al., 2020). For multi-source setup, a simple approach contains the following two steps. The first step is to apply DT with each source language to produce predictions on unlabeled target data. The second step is to mix the predictions from all the source models and perform supervised learning of a target model on the mixed pseudo-labeled dataset. However, the mixed pseudo-labeled dataset can be very noisy because predictions from different source models may contradict each other. Similar to single-source setting, a more effective way is aggregating the soft predictions from multiple sources and doing KD (Wu et al., 2020).

Minimum Risk Training
In supervised learning, minimum risk training aims to minimize the expected error (risk) concerning the conditional probability, where R(y * , y) is the risk function that measures the distance between the gold sequence y * and the candidate sequence y, and Y(x) denotes the collection of all the possible label sequences given the sentence x. The risk function can be defined in many ways depending on specific applications, such as the BLEU score in machine translation (Shen et al., 2016). However, in our setting, there are no gold labels to compute R(y * , y). Instead, we assume there are multiple pretrained source models which can be used to predict hard labels, and we define the risk function as R(ŷ, y) to measure the difference between pseudo label sequence y predicted by source models and the candidate sequence y. The objective function becomes, Conventional minimum risk training is intractable which is mainly due to the combination of two reasons: first, the set of candidate label sequences Y(x) is exponential in size and intractable to enumerate; second, the risk function is hard to decompose (or indecomposable). To tackle the problem, we define the risk function as a negative probability −P (ŷ|y) that can be fully decomposed by position. The objective function becomes, We introduce a matrix ψ ψ ψ ∈ R K×K to model P ψ ψ ψ (ŷ i |y i ), where K is the number of labels. Notice that ψ ψ ψ here is a fixed matrix that does not change in training. In the general imperfect predictions learning, it is often implicitly assumed that the prediction from a source model is generally better than uniformly selecting a candidate label at random. Given this prior knowledge, we require P ψ ψ ψ (ŷ i = k|y i = k)> 1 K . Therefore, we empirically define matrix ψ ψ ψ as, where µ> 1 K is a hyper-parameter. In the implementation, for convenience, we multiply an identity matrix by a hyper-parameter τ and then apply Softmax operation to every column to obtain the matrix ψ ψ ψ.
To further explain ψ ψ ψ, we give an example from the perspective of prediction in Table 1. Given a sentence x = "I cried", a label distribution P θ θ θ (y|x) for the sentence, a pseudo label sequencê y = {Pron, Adj} predicted by the source model, and two settings µ 1 =0.4 and µ 2 =1 for ψ ψ ψ (1) and ψ ψ ψ (2) respectively, we compute P θ θ θ (y i |x) × P ψ ψ ψ (ŷ i |y i ) as shown in the table.
x P θ θ θ (y|x)ŷ Table 1: An example of prediction results on two different ψ ψ ψs. Case1 with a less sparse matrix than Case2 obtains a better prediction. y pred denotes the predictions by sequence labeler using corresponding matrix ψ ψ ψ.
Since ψ ψ ψ (2) is an identity matrix, it predicts the label with the largest value at each position. It assigns the wrong label Adj to the word "cried" as a consequence. On the contrary, ψ ψ ψ (1) introduces some uncertainties by providing smoothing over the pseudo labels. As a result, it correctly predicts the word "cried" as Verb. From the perspective of training, which minimizes J (θ θ θ), if ψ ψ ψ is an identity matrix, then it is a supervised model withŷ as the supervision signal; on the other hand, if ψ ψ ψ is a uniform matrix, then the supervision signal becomes random and training becomes meaningless.

Extending to Leverage Soft Predictions
Previous works shows that the soft predictions from source models can provide more information than the hard predictions (Hinton et al., 2015;Wu et al., 2020). Our novel approach can also easily leverage this information by simply replacing the one-hot pseudo labels with soft probability distributions from source models. The training objective becomes, where P s is the source model's soft predictions.
For simplicity, in the rest of this section, we introduce our approaches based on the setup of using one-hot pseudo labels, but all the approaches can be extended to leverage soft predictions in a similar way.

Minimum Risk Training: A Latent Variable Model Perspective
In this subsection, we instead use a trainable matrix σ σ σ to model P σ σ σ (ŷ|y). We initialize σ σ σ in the same way as ψ ψ ψ. Assuming that conditioning on y, x and y are independent with each other, we find that the non-negative term of equation (1) is a conditional marginal probability defined by a latent variable model in which y is the latent variable.
In latent variable model training, we generally optimize the negative conditional log-likelihood, and the objective function becomes, Interpolation In practice, given a pre-defined hyper-parameter µ, we combine the fixed P ψ ψ ψ (ŷ i |y i ) with the trainable P σ σ σ (ŷ i |y i ) to get a new probability, where λ ∈ [0, 1] is a hyper-parameter, φ φ φ is the combined matrix. If λ = 1, it denotes the minimum risk training. Otherwise, it denotes the latent variable model.

From Single-source to Multi-source Setup
By modeling the joint distribution over the pseudo labels which are predicted by U source models on the target unlabeled data, we can easily extend our latent variable model to the multi-source setting. The objective function becomes, Our overall architecture of the latent variable model is depicted in Figure 1.

Optimization
In this section, we propose a unified optimization scheme, which is based on the EM algorithm (Dempster et al., 1977) 1 , to learn the parameters 1 Another approach is to perform direct gradient descent optimization, which we find weaker results. We have a discussion on that in the analysis section. of the two proposed approaches. The EM algorithm is widely applied to learn parameters in a large family of models with latent variables such as the Gaussian mixture models. It is an iterative approach that has two steps in every iteration, which are the E-step and the M-step. In the E-step, it optimizes a posterior distribution of the latent variables. In the M-step, it estimates the parameters of the latent variable model according to the posterior distribution. As the single-source setup can be seen as a special case, we focus on the multi-source setup to derive the equations. We first introduce Q(y) = i Q(y i ) as a distribution over the latent variable y, and then we derive the upper bound of J (θ θ θ, φ φ φ) as follows, where C is a residual term, and Q(y i ) stands for Q(y i = y i ). The inequation above is derived from Jensen's inequality. To make the bound tight for particular θ θ θ and φ φ φ, we derive Q(y i ) as, We sketch our strategy of parameter update in the t-th iteration as follows, • E step, we compute Q(y i ) using parameters θ θ θ and φ φ φ from the (t − 1)-th iteration; • M step, we update parameters θ θ θ and φ φ φ together using a gradient-based approach by minimizing the upper bound above. Q(y i ) is fixed in this step and hence we minimize we repeat the two steps alternately until convergence. We give an overall process for multi-source setup with unlabeled target data in Algorithm 1. Compute the loss l e = J (φ φ φ, θ θ θ).

16:
Until l m has no improvement. 17: end for

Inference
For inference, we use Q(y) to obtain y pred 2 ,

Experiments
We use the multilingual BERT (mBERT) as our word representations 3 as the sentence encoder. Fol-lowing Wu et al. (2020), the source model are previously trained on its corresponding training data. We use the BIO scheme for CoNLL and OntoNotes NER tasks and Aspect Extraction. We run each model three times and report the average accuracy for the POS tagging task and F1-score for the other tasks.

Cross-Domain Sequence Labeling
We use English portion of the OntoNotes (v5) (Hovy et al., 2006), which contains six domains: broadcast conversation (bc), broadcast news (bn), magazine (mz), newswire (nw), and web (wb). More details can be found in the Appendix A.1.

Approaches
Single-source Setup The following approaches are applicable for single-source setup, • DT: we use the pre-trained source model to directly predict the pseudo labels on the target unlabeled data.
• Hard: we use the pseudo labels from DT on the target unlabeled data to train a new model.

Multi-source Setup
The following approaches are applicable for multi-source setup, • Hard-Cat: we apply DT with all the source languages/domains, mix the resulting pseudo labels from all the sources on the unlabeled target data, and train a new model.
• Hard-Vote: we do majority voting at the token level on the pseudo labels from DT with each source and train a new model.   Table 3: Multi-source cross-domain results on OntoNotes. KD-re is our re-implementation for the KD approach (Wu et al., 2020). The reported results from Lan et al. (2020) are denoted as † for reference.

Both Setups
The following approaches are applicable for both single-/multi-source setups, • KD-re: to fairly compare with the the KD approach (Wu et al., 2020) in the same settings (such as source model's cross-lingual ability), we re-implement the KD approach and adapt it to all tasks.
• MRT: our minimum risk training approach with a fixed matrix ψ ψ ψ with soft or hard predictions.
• LVM: our latent variable model with parameter φ φ φ (containing the fixed matrix ψ ψ ψ and the trainable matrix σ σ σ) with soft or hard predictions.
We also provide the reported results from existing approaches for reference. Due to different experiment configuration reasons, directly comparing our approaches to their reported results is generally not fair. For the CoNLL NER tasks, we provide the reported results from Wu et al. (2020). For the cross-domain sequence labeling tasks, we provide the reported results from Lan et al. (2020) who learns a consensus network to aggregate predictions from multiple sources.

Hyper-parameters
Hyper-parameter selection in transfer learning is difficult as no labeled dataset is available for the target language. We select the hyper-parameters only on the development set over the English language and directly use the selected hyper-parameters for the other languages. This may result in sub-optimal  Table 4: Results on the POS tagging tasks. KD-re is our re-implementation for the KD approach (Wu et al., 2020).
performance but is more realistic. In latent variable model training, the latent variable is generally very flexible, which may result in sub-optimal performance. Therefore, the initialization of the latent variable is very crucial. In practice, we find that the best strategy is to initialize µ of ψ ψ ψ with a large value (e.g., 0.9) and µ of σ σ σ with a small value (e.g., 0.3), and anneal λ from 1 to 0. At the early stage of training, this initialization offers a strong prior for the encoder which can keep the encoder from going in a bad direction; and at later stages of training, the warmed-up encoder can better guide the training of φ φ φ and vice versa. In this way, the encoder and φ φ φ can achieve a good balance during training. More details of the hyper-parameters can be found in the Appendix A.2.

Results and Observations
For the single-source setting, we use English as the source language and the others as the unlabeled target languages. In the multi-source setting, we repeat our experiments multiple times, each time with a language as the target and the others as the sources. We evaluate all approaches on the CoNLL, Aspect Extraction, OntoNotes, and POS tagging. We report the results in Table 2, 3 and 4 5 .
Observation #1 Our two approaches outperform several strong baselines on all the tasks and all the scenarios (single-/multi-source scenarios with soft/hard predictions), especially the multi-source scenario, which demonstrates the effectiveness of the two proposed approaches. It shows that modeling this kind of relation is fairly important, which helps to recover the true labels from noisy data. Meanwhile, introducing uncertainties for the relations between the predicted labels from the source models and the true labels in both training and prediction processes significantly benefit our approaches.
Observation #2 Our LVM approach achieves overall improvements over the MRT approach on all tasks. It suggests that our LVM approach learns the relations between predicted labels from the source models and true labels better than MRT.
Other Minor Observations First, all the approaches that use unlabeled target data for training outperform DT. It suggests that leveraging the unlabeled target data (which may contain knowledge of the target language/domain) in training for zero-shot transfer learning does help. Comparing the approaches that leverage soft instead of hard predictions from sources, the former generally outperform the latter. It suggests that soft predictions can still provide useful knowledge for samples with incorrect hard predictions. The reported results from Lan et al. (2020) are significantly worse. We speculate the reason is that they leverage poor embeddings and different encoders (BiLSTM-CRF). KD-re outperforms our approaches on Ca and Id of POS tagging task on the single-source setting, but its advantage is not statistically significant.

Analysis
We conduct the analysis on the multi-source setting with soft predictions from sources for its better performance.
Big Data Performance We experiment with our two models and the KD-re baseline on big target training data on the POS tagging task. We ran-  domly select 100000 sentences (without labels) for the Wikipedia-003 section of the Ca language on the CoNLL 2017 shared task (Ginter et al., 2017). We randomly select 1000, 10000, and 100000 sentences to train these three approaches, evaluate on the UD test set for each of the three languages respectively, and show the results in Figure  2. It shows that our latent variable model outperforms the other two approaches over all the settings. Though KD outperform MRT with less than 10000 sentences, but MRT has comparable result with enough unlabeled data. Besides, with more unlabeled data used for training, each model further gains a considerable boost.

Comparison to Direct Gradient Optimization
Our two proposed approaches can also be optimized directly by any gradient-based approach, such as the AdamW optimizer (Loshchilov and Hutter, 2018). We use the two proposed approaches to compare the performance of the direct gradientbased training strategy and the EM algorithm. We conduct the experiments on our two proposed approaches on CoNLL NER task on the multi-source setting. We show the results in Table 5. It shows that the EM algorithm outperforms direct gradientbased training for our approaches, which is slightly different from previous findings (Berg-Kirkpatrick et al., 2010).  Comparison to Hard EM In this part, we compare our optimization strategy (soft-EM) with the hard-EM approach. Instead of computing a dense vector for Q(y i ), hard-EM computes a one-hot vector. We conduct the experiments on our two proposed approaches on the CoNLL NER task on the multi-source setting. The results are shown in Table 6. It shows that soft-EM gains slightly improvement over hard-EM on the MRT approach, but differs significantly from hard-EM on our LVM approach.
Impact of Matrix ψ ψ ψ We analyze the relation between the performance and different initialization of ψ ψ ψ. We experiment with the MRT approach in the single-source setup with soft predictions on NER tasks and Figure 3 shows the results. The best value of τ is 2 for De and 3 for the others (resulting in µ = 0.43 and 0.67 respectively 6 ), which shows that the uncertainties introduced by a smooth ψ ψ ψ can effectively boost the model's performance. On the other hand, setting ψ ψ ψ to a nearly identity matrix with τ = 10 leads to worse scores.

Related Work
Cross-lingual/domain Sequence Labeling Recent works on cross-lingual transfer mainly have two scenarios: the single-source cross-lingual transfer (Yarowsky and Ngai, 2001;Wang and Manning, 2014;Huang et al., 2019) and the multi-source cross-lingual transfer (Täckström et al., 2012;Guo et al., 2018;Rahimi et al., 2019;Hu et al., 2021). Wu et al. (2020) propose a knowledge distillation approach to further leveraging unlabeled target data and achieve the state-of-the-art results. Hu et al. (2021) propose a multi-view framework to selectively transfer knowledge from multiple sources by utilizing a small amount of labeled dataset. Crossdomain adaption is widely studied (Steedman et  2003). Existing works include bootstrapping approaches (Ruder and Plank, 2018), mixture-ofexperts (Guo et al., 2018;Wright and Augenstein, 2020), and consensus network (Lan et al., 2020).
Other previous work (Kim et al., 2017;Guo et al., 2018;Huang et al., 2019) utilized labeled data in the source domain to learn desired information. However, our proposed approaches do not require any source labeled data or parallel texts.

Conclusion
In this paper, we propose two approaches to the zero-shot sequence labeling problem. Our MRT approach uses a fixed matrix to model the relations between the predicted labels from the source models and the true labels. Our LVM approach uses trainable matrices to model these label relations. We extensively verify the effectiveness of our approaches on both single-source and multisource transfer over both cross-lingual and crossdomain sequence labeling problems. Experiments show that MRT and LVM generally bring significant improvements over previous state-of-the-art approaches on twenty-one datasets.
Aspect Extraction We select the restaurant domain over subtask 1 in the SemEval-2016 shared task (Pontiki et al., 2016).

A.2 Hyper-parameter setting
We select the hyper-parameters according to the strategy which is described in the main paper. For multi-source cross-lingual/domain tasks, we select hyper-parameters based on the performance on the English development set and apply them to other target languages. For single-source crosslingual/domain tasks, we simply use the same hyper-parameter as multi-source setting. In the inference step, We use P θ θ θ (y|x) in single-source cross-lingual/domain and Q(y) in multi-source cross-lingual/domain to predict the label sequence. We empirically set the learning rate of mBERT as 2e-5 and the learning rate of φ and φ φ φ as 2e-4 for multi-source setup and 2e-5 for single-source setup. We train each model for three epochs. We tune the following hyper-parameters.
τ andτ for initializing matrices τ andτ are used to initialize the matrices ψ ψ ψ and σ σ σ in our minimum risk training and latent variable model approaches respectively. Due to different sizes of the label sets for different tasks, the range of selection is different. Take the CoNLL NER tasks for example, we tune it in the range of {1, 2, 3, 4, 10} forτ in ψ ψ ψ in MRT and LVM, and {1, 2, 3, 4, 10} for τ in σ σ σ in LVM. The CoNLL NER tasks have 11 labels (9 entity labels, a padding label and an ending label), which means µ ∈ {0.21, 0.43, 0.67, 0.85, 1.0}.