MetaTS: Meta Teacher-Student Network for Multilingual Sequence Labeling with Minimal Supervision

Sequence labeling aims to predict a fine-grained sequence of labels for the text. However, such formulation hinders the effectiveness of supervised methods due to the lack of token-level annotated data. This is exacerbated when we meet a diverse range of languages. In this work, we explore multilingual sequence labeling with minimal supervision using a single unified model for multiple languages. Specifically, we propose a Meta Teacher-Student (MetaTS) Network, a novel meta learning method to alleviate data scarcity by leveraging large multilingual unlabeled data. Prior teacher-student frameworks of self-training rely on rigid teaching strategies, which may hardly produce high-quality pseudo-labels for consecutive and interdependent tokens. On the contrary, MetaTS allows the teacher to dynamically adapt its pseudo-annotation strategies by the student’s feedback on the generated pseudo-labeled data of each language and thus mitigate error propagation from noisy pseudo-labels. Extensive experiments on both public and real-world multilingual sequence labeling datasets empirically demonstrate the effectiveness of MetaTS.


Introduction
Sequence labeling or tagging is the task of detecting the boundary of all occurring entity mentions from unstructured text and classifying them into predefined types, such as Named Entity Recognition (NER) (Chiu and Nichols, 2016), Aspect-Based Sentiment Analysis (ABSA) (Mitchell et al., 2013), etc. An entity mention should be a single word or a sequence of words that contain key information, such as a person, location, or institution.
In the E-commerce search domain, we need to recognize product attributes from short queries, such as product type, brand, size, to better understand users' preferences and intents.  Table 1: Ground-truth labels and noisy pseudo-labels for an English query NER example. We use colors to denote the entity type and use brackets to indicate the entity boundary. Entity labels: Brand, ProductLine, Size, ProductType, NonContent, Misc.
Despite recent advances in deep learning models for sequence labeling (Huang et al., 2015;Raganato et al., 2017), they still rely on massive labeled data. Nonetheless, the sequence labeling tasks usually lie in the low-data regime due to costly and laborintensive human annotation for token-level labels, especially for a variety of languages (Xie et al., 2018), as search engines or social networks usually cover a diverse range of countries and locales using different languages. In this paper, we attempt to explore a unified multilingual sequence labeling model with minimal supervision, where each language only has limited labeled data.
The emergence of multilingual pre-trained language models (mPLMs) such as mBERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) have enabled breakthroughs on various multilingual NLP tasks. However, it has been recently noted that mPLMs are not data-efficient and typically require sufficient fine-tuning data for superior performance on downstream tasks. To mitigate data scarcity, Semi-Supervised Learning (SSL) (Chapelle et al., 2009) has been a promising paradigm that allows us to take advantage of large-scale unlabeled multilingual data. Selftraining (Scudder, 1965) stands out among the SSL approaches, in which a teacher model produces pseudo-labels for unlabeled examples, and a student model learns from these examples with generated pseudo-labels. Self-training has shown promising results in instance-level tasks, e.g., image classification (Tarvainen and Valpola, 2017;Xie et al., 2020b). However, a major research challenge that dictates the success of self-training is the well-known confirmation bias problem (Arazo et al., 2020), which results in progressive drifts on the noisy pseudo-labeled data provided by the teacher. This problem is more pronounced in sequence labeling (Ruder and Plank, 2018), as complicated dependencies between tokens pose tremendous challenges towards the rigid teaching strategies, e.g., the fixed teacher (Lee et al., 2013) or the periodically synchronizing teacher , to generate accurate pseudo-labels for consecutive and interdependent tokens.
To encourage the teacher to generate better pseudo-labels for multilingual sequence labeling, we propose a novel Meta Teacher-Student (MetaTS) network, where the teacher learns dynamically and continuously from the student's feedback to adapt its teaching strategies, i.e., the pseudoannotation choices. Concretely, given a language for each step, the student network will be updated based on the pseudo-labeled data produced by the teacher. To quantitatively measure how well the teacher generates these pseudo-labels at the current step, we will evaluate the difference between the student performance after the update using the pseudo-labeled data of the language and that before the update. The improvement or degradation of the student performance can be used as the feedback to meta-optimize the teacher network (a.k.a. learning to learn (Finn et al., 2017)). Table 1. Pseudo-labels (choice #2) are closer to the ground-truth labels of the sentence than pseudo-labels (choice #1). Better pseudo-annotation strategies by the teacher lead to more accurate pseudo-labels (e.g., choice #2 in Table 1), thus boosting the student's performance on the labeled data. As such, the proposed MetaTS method learns to teach the student with better tokenlevel pseudo-labels and alleviates the serious confirmation bias problem in sequence labeling. Empirically, extensive experiments on both the public multilingual Open-domain NER dataset (Tjong Kim Sang, 2002a,b), multilingual E2E-ABSA challenge benchmark (Pontiki et al., 2014) and a realworld large-scale multilingual E-commerce NER dataset have demonstrated the effectiveness of the MetaTS method.

Consider an example in
Overall, our contributions can be summarized as follows: (1) we explore a unified and effective multilingual sequence labeling setting with minimal supervision required; (2) we propose a novel MetaTS framework to alleviate the confirmation bias problem via learning from the student's feedback to generate better fine-grained pseudo-labels; (3) we conduct extensive experiments that verify the effectiveness of MetaTS.

Sequence Labeling (SL)
Sequence labeling is the process of identifying (boundary) and categorizing (type) entities in text into a predefined entity set C. Formally, given a sentence X = [x 1 , x 2 , ..., x N ] with N tokens, the goal is to predict a tag sequence Y = [y 1 , y 2 , ..., y N ], where y n ∈ C (n ∈ [1, N ]). Based on the BIO schema (Li et al., 2012), the first token of an entity mention with type X is labeled as B-X; the remaining tokens inside that entity mention are labeled as I-X; and the non-entity tokens are labeled as O.
Low-Resource Multilingual SL Suppose that there are R languages L = [l 1 , l 2 , .., l R ]. For each language l i , there are only a small amount of la- Our goal aims to train a unified supervised multilingual model that can achieve better performance on all languages in the low-resource setting.

Multilingual Pre-trained Language
Model (mPLM) The emergence of mPLMs, such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019) and mUnicoder (Yang et al., 2020), has led to significant performance gains on various multilingual NLP tasks (Hu et al., 2020). mPLMs leverage self-supervised learning on a large-scale multilingual unlabeled corpus, which treats shared word piece tokens as the anchor across languages to produce weakly-aligned multilingual representations. These multilingual contextualized embeddings are versatile and can substantially benefit downstream tasks. However, mPLMs are trained on open-domain data and lack adaptivity to a specific domain in the low-data regime . Thus, it is critical to exploit enormous unlabeled data for the downstream tasks to achieve task-aware adaptation.

Teacher-Student Network
The teacher-student (TS) network is a classic architecture widely used in self-training (Scudder, 1965), where the student model has a similar or higher capacity than the teacher, and knowledge distillation (Hinton et al., 2015) where the student model is smaller than the teacher. Mathematically, let T and S respectively be the teacher and student network, parameterized by θ T and θ S . We use f (X; θ T ) and f (X; θ S ) denote the entity label predictions of the sentence X by the teacher and student, respectively. Then the knowledge transfer is usually achieved by minimizing the loss between the predictions from the teacher and student: where f (X; θ T ) can be a soft target or converted to a hard target as the pseudo-labels. L is the transfer loss to enforce the consistency between the teacher and the student probability distributions, such as Cross-Entropy (CE) loss, Kullback-Leibler (KL) divergence loss, or Mean Square Error (MSE).

Meta Teacher-Student Network
Inspired by the teacher-student interaction mechanism, we propose a meta teacher-student (MetaTS) network for low-resource multilingual sequence labeling. Our ultimate goal lies in learning from large-scale multilingual unlabeled data based on pseudo-labels to mitigate the shortage of labeled data for token-level classification. The framework of MetaTS is illustrated in Figure 1.

Student Network
Given a language l i , recall that there are lim- The student network learns the distilled knowledge of unlabeled data from the teacher, which behaves as the teacher's predictions on unlabeled sequences where f n,c denotes the probability of the n-th token belonging to the c-th class and c ∈ C. θ (t) T is the teacher's parameters at the t-the step. Then we achieve the knowledge transfer of Eq. (1) by minimizing the student's loss L S on these hard pseudo-labels where is the cross-entropy loss. θ are the parameters of the student before and after the update at the t step, which will be used for the meta-learning of the teacher in the next section.

Teacher Network
The teacher network is jointly optimized by three objectives: a supervised learning loss L sup , a semisupervised regularization loss L reg , and a metalearning loss L meta , i.e.,

Supervised learning
The supervised loss L sup on the labeled data is defined as Semi-supervised regularization The regularization loss L reg alleviates the overfitting of the teacher to limited labeled data by enforcing the prediction consistency between the original and augmented unlabeled samples (Xie et al., 2020a). However, in the text domain, 1) data augmentation techniques are much more difficult to maintain the original word or sentence semantics compared with those in the vision domain; 2) external text augmentations are very tedious and usually unavailable for multilingual corpus, especially for low-resource languages. Thus, we do not explicitly augment the sentence but instead propose to add random Gaussian noises G(0, σ 2 ) to the BERT embedding of each token to increase the diversity of the sentence. We name it as virtual data augmentation. Let z m,n ∈ R |C| denote the soft prediction T ) of the teacher on the n-th token of X l i m . z G m,n ∈ R |C| is the soft prediction of the same token with Gaussian noises G. Thus, we have where τ is a temperature factor to control the smoothness. z max m,n denotes the max probability over C classes, i.e., arg max c z m,n . I is an indicator function used to mask the token with low prediction confidence, i.e., I(z) amounts to 1 if z > , otherwise 0, where ∈ (0, 1) is a threshold. Meta learning The meta loss L meta aims to enforce the teacher to learn from the student's feedback on the current pseudo-labels in order to adjust its pseudo-annotation strategies, which is also known as learning to learn. To quantitatively measure the quality of the current pseudo-labels, we evaluate the student's performances (loss) on the labeled data before and after the update, i.e., θ t S and θ t+1 S as defined in Eq. (3), S,lab can be used as a dynamic feedback or reward function to meta-optimize the teacher network towards the direction that generates better pseudo-labels for the language l i . If the pseudo-labels at the t-th step can improve the student network, then λ l i meta will be negative, and positive vice versa. Thus, the meta loss L meta is defined as: is the pseudo-labels for the language l i produced by the teacher at the t-th step.

Alternating Training
During the teacher-student interaction stage, we alternately train the student network and the teacher network by minimizing L S and L T separately for each language. As such, the teacher and student can achieve mutual learning, i.e., at this stage, the student will only learn from the multilingual unlabeled data with pseudo-labels produced by the teacher, and meanwhile, the teacher will also adjust its pseudo-annotation strategy according to the feedback from the student. After distilling the knowledge from the teacher to teach the student network, we finally take the student model finetuned on the multilingual labeled data as the final model for evaluation.

Datasets
We consider the following three multilingual sequence labeling datasets for experiments, of which the statistics of the datasets are shown in Table 3 4 entity types: person, location, organization, and miscellaneous.
(ii) Multilingual E2E-ABSA is an ABSA benchmark from SemEval ABSA challenge (Pontiki et al., 2014). We follow the settings of End-to-End Aspect-based Sentiment Analysis (Mitchell et al., 2013;Zhang et al., 2015), which jointly extracts aspect terms and the associated sentiments using a unified tagging scheme. It consists of English (

Setting
For the low-resource setting, we only use 1%, 10%, 1% randomly sampled training data as the labeled data for each language of the open-domain NER, E2E-ABSA, and E-commerce query NER datasets, respectively. And we treat the remaining training data as the unlabeled data. This results in tens to thousands of labeled data for each language. We use the span-level micro F1-score (exact match) as the evaluation metrics.  languages as the encoder, which has 12 layers, 768d hidden size, 12 heads and 110M total parameters. The hidden states of the last layer of the model are used as the token representations for token-level label prediction. The mBERT is jointly optimized with other parameters during the training stage.

Implementation details
Initialization & Training For all the experiments, the model is optimized by the Adam algorithm (Kingma and Ba, 2015) for training. The weight matrices are initialized with a uniform distribution U (−0.01, 0.01). Gradients with the 1 norm larger than 40 are normalized to be 1. To alleviate overfitting, we perform early stopping on the validation set during both the teacher-student interaction and finetune stages.
Hyperparameter For the all three multilingual sequence labeling datasets, the hyper-parameters are manually tuned on 10% randomly held-out labeled training data (downsampled version) of the all languages. The initial learning rate for Adam is tuned amongst {10 −5 , 2×10 −5 , 3×10 −5 , 5×10 −5 , 10 −4 }. The batch size is tuned amongst {8, 16, 32, 64}. The Gaussian noise variance σ 2 is tuned amongst {0.001, 0.01, 0.1, 1.0} and we have found that when σ 2 is larger than 0.01, the model will collapse. This is reasonable since too large σ 2 can bring in unbearable noises that the model itself cannot denoise. The temperature factor τ is tuned amongst {0.5, 0.6, 0.7, 0.8, 0.9}. The threshold is tuned amongst {0.5, 0.6, 0.7, 0.8, 0.9}. For both the teacher and the student network, we use label smoothing for Eq. (2) and Eq. (4) with the smoothing factor 0.15. We use 128 as the maximum sentence length for all datasets. The detailed hyperparameters are listed in Table 2.

Baselines
We compare our model with different groups of baseline methods to verify the effectiveness.
• Fully-supervised. (i) mBERT (Single) finetunes a mBERT on the sampled labeled data for each language; (ii) mBERT (Multi) fine-tunes a mBERT on the sampled labeled data of all languages; (iii) mBERT (Full) uses the full labeled data of all languages to fine-tune a mBERT, which is usually regarded as the upper bound. (iv) BOND (hard/soft/soft-high) 6 (Liang et al., 2020) employs a state-of-the-art TS framework of self-training with hard pseudo-labels, soft pseudolabels (Xie et al., 2016), as well as the proposed soft pseudo-labels on selected high confidence tokens. For a fair comparison, we use the mBERT as the base encoder for all baselines.

Multilingual Academic Benchmarks
We present the the main results on multilingual academic datasets for open-domain NER and E2E-ABSA in Table 4 and   of two sequence labeling tasks by a large margin (NER: +1.36% Avg gain over MT (MSE), E2E-ABSA: +2.26% Avg gain over BOND (soft-high)).
• Supervised: (i) Supervised baselines perform much worse than semi-supervised baselines. This demonstrates that even with mPLMs like mBERT, supervised learning cannot achieve satisfactory results in the low-data regime. (ii) mBERT (Multi) significantly beats mBERT (Single), which shows that the joint usage of labeled data from multiple languages is better than each monolingual model when supervision signals are insufficient.
• Semi-supervised: (i) By leveraging large unlabeled data, semi-supervised baselines can obtain considerable improvements. (ii) Our proposed MetaTS method can still outperform those semisupervised baselines based on the traditional TS framework. This indicates our meta teacher-student learning paradigm can capture more underlying treasures from the unlabeled data, which can learn to adjust pseudo-annotation strategies by taking advantage of the student's learning feedback.

Multilingual Industrial Dataset
We present the main results on the multilingual industrial dataset for E-commerce NER in Table 6. Compared with widely-used benchmark datasets in the academia, this industrial dataset, as illustrated in Table 3, behaves more challenging in terms of: (i) large label space: there are much more (13) entity types, bringing in a significant difficulty for the prediction; (ii) high entity coverage: almost all tokens in the user query are tagged with a non-O tag (>90% coverage rate) (in low-coverage datasets, high-performance does not mean the model can well identity the entities due to the high O proportion ); (iii) short text: the user queries are usually short, which lack sufficient contextual information for context-dependent token-level prediction; (iv) data imbalance: the labeled data among different language are very skewed, closer to the real-world data distribution of high-resource and low-resource languages; (v) large-scale data size: this dataset has much more data (about 700k) than existing public datasets. Even involving so many challenges for this dataset, MetaTS can still achieve significant improvements over all the baseline methods on most languages. This shows more convincing evidence that MetaTS generates more high-quality pseudo-labels for even short-text data in a large label space via the meta teacher-student learning paradigm.

Ablation Results
To verify the efficacy of each component, we compare MetaTS with its ablation variants in Table 7. w/ L reg v.s. w/o L reg : For MetaTS w/o L reg , we remove the regularization loss L reg on the unlabeled multilingual data for the teacher. We can observe that there are remarkable performance drops on all three datasets. This indicates that it works better when the teacher is jointly trained with other auxiliary tasks such as the virtual data augmentation since it can enhance the prediction confidence of the teacher towards the unlabeled data. w/ L meta v.s. w/o L meta : For MetaTS w/o L meta , we remove the meta loss L meta for the teacher. That is, we discard the instant feedbacks from the student on the generated pseudo-labels, so that the teacher cannot dynamically adjust its pseudo-annotation strategy. As such, MetaTS w/o L meta has demonstrated significant degradation. Besides, we can also conclude that the meta-learning loss contributes more to our performance improvements. Hard labels v.s. Soft labels: Compared with utilizing hard pseudo-labels to teach the student, we  observe that soft pseudo-labels (Xie et al., 2016) can substantially hurt the model performance and lower the convergence speed, even worse after high confidence selection  is introduced. This circumstance has also been shown in prior study (Kumar et al., 2020). We hypothesize that such performance drops may be attributed to soft pseudo-labels being noisier than sharpened hard pseudo-labels in meta-learning.

Impact of Labeled-Unlabeled Ratio
To investigate the effect of the labeled-unlabeled data ratio, we vary the labeled proportion of each language's training set and compare MetaTS with mBERT (Multi), MT (MSE), and BOND (softhigh). We use the average span-level micro F1 score over all languages of the multilingual E2E-ABSA dataset and change the labeled proportion from 0.1, 0.2, 0.3, 0.4 to 0.5. Since the remaining training data is treated as the unlabeled data, the corresponding labeled-unlabeled ratios are from 1:9, 1:4, 3:7, 2:3 to 1:1. As shown in Figure 2, the gap between the MetaTS and all baseline methods grows as the labeled-unlabeled ratio shrinks. Semi-supervised baselines MT (MSE) and BOND (soft-high) show marginal improvements over the supervised learning method mBERT (Multi) and even perform worse when the labeled size becomes large. This verifies that the MetaTS is much less sensitive to the drop in the labeled proportion for each language by making effective use of the large amounts of multilingual unlabeled data.

Pseudo-Labeling Visualization
To qualitatively demonstrate that MetaTS can generate better token-level pseudo-labels that involve complicated dependency relations, we perform the pseudo-labeling visualization of ground-truth labels, self-training (BOND) pseudo-labels, and our MetaTS model pseudo-labels for three datasets we used. As illustrated in Table 8, we only show some English cases for easy understanding, although we also observe our consistent advantages in many other languages (This is quantitatively verified by Section 5.2 Main Results).
As we can see, traditional teacher-student frameworks with self-training cannot handle the token pseudo-labeling in complicated contexts, including (1) entities of ambiguity: the entities have ambiguous semantics, which can denote different types in light of their surrounding contexts. For example, in the open-domain NER, self-training usually confuses organization (ORG) with location (LOC) as a sequence of misclassifying Republic of Ireland (Case#1) as LOC due to the location word "Ireland".
In the E-commerce NER, half and half (Case#6) is used to describe the visual features of wigs instead of the size; (2) entities in the transition context: the entities before and after the transition may have contrastive meanings. For example, the user expresses a positive sentiment towards atmoshpere but a negative sentiment to food (Case#3). (3) high entity coverage: most of tokens in the sentence are truly entities instead of O. For example, in Case#2 and Case#5, self-training cannot identify the cor-rect types for all occurring entities. (4) entity missing: self-training may not be able to capture the entities like protions in the Case#4. In contrast, our proposed MetaTS can demonstrate more robustness to these challenges, attributed to the meta teacher-student learning paradigm that can adjust teacher's pseudo labeling strategies according to the student's instant feedback.
6 Related Works

Multilingual Sequence Labeling
Most recent works on multilingual sequence labeling focus on improving the cross-lingual transferability for different languages (Täckström, 2012;Fang et al., 2017;Enghoff et al., 2018;Xie et al., 2018;Rahimi et al., 2019;Johnson et al., 2019;Wu et al., 2020a,b,c;Li et al., 2020a). Cross-lingual transfer (Li et al., 2020b) aims to leverage knowledge from source languages to improve the performance in target languages only, which puts more emphasis on how to reduce the language distribution gaps due to the lack of labeled data for target languages. Besides, each target language usually requires training an individual model. This behaves particularly resource consuming. On the contrary, our goal is to improve all languages' performance using a unified model. Only a few studies have explored building a unified multilingual model with enough labeled data to handle multiple languages . Different from that, we explore a motivated and challenging multilingual setting with minimal supervision.

Meta Learning
Inspired by human beings' ability to adapt and transfer knowledge from previous tasks, meta learning (Finn et al., 2017;Nichol et al., 2018;Pham et al., 2020;Yao et al., 2019Yao et al., , 2021 has been initiated on low-resource NLP, such as text classification Wu et al., 2019;Geng et al., 2019;Geng et al., 2020;Bao et al., 2020), relation classification (Han et al., 2018;Gao et al., 2019;Obamuyide and Vlachos, 2019), slot tagging (Hou et al., 2020), event detection (Deng et al., 2020), and natural language understanding (NLU) (Dou et al., 2019). Considering multilingualism, only a few works have explored meta learning to improve the cross-lingual transferability of low-resource languages, e.g., text classification (Li et al., 2020b), NLU (Nooralahzadeh et al., 2020), NER (Wu et al., 2020c), and machine translation (Gu et al., 2018). On the contrary, our ultimate goal is to utilize meta learning to better leverage multilingual unlabeled data for boosting all languages' performance. Our work is inspired by meta-policies for teaching mechanisms (Fan et al., 2018;Pham et al., 2020), which only focus on instance-level image classification tasks and rely on single feedback from the student. Besides, the success of the two works is conditioned on additional techniques like data augmentation for images, which is tedious and almost infeasible in challenging NLP tasks, especially for multilingual sequence labeling.

Conclusion
The effectiveness of supervised methods for lowresource multilingual sequence labeling is limited due to data scarcity. To tackle this challenge, we propose a novel MetaTS method to enhance the teacher-student framework of self-training, which leverages the student's feedback on multilingual token-level pseudo-labels to adjust the teacher's pseudo-annotation strategies. Extensive evaluations on both the public academic benchmarks and the large-scale industrial dataset quantitatively and qualitatively demonstrate the effectiveness of MetaTS. In the future, the proposed MetaTS method can potentially be applied to multilingual natural language understanding (XLU) tasks (Hu et al., 2020) and be generalized to multi-task learning  problems.