Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition

Few-shot Named Entity Recognition (NER) exploits only a handful of annotations to iden- tify and classify named entity mentions. Pro- totypical network shows superior performance on few-shot NER. However, existing prototyp- ical methods fail to differentiate rich seman- tics in other-class words, which will aggravate overfitting under few shot scenario. To address the issue, we propose a novel model, Mining Undefined Classes from Other-class (MUCO), that can automatically induce different unde- fined classes from the other class to improve few-shot NER. With these extra-labeled unde- fined classes, our method will improve the dis- criminative ability of NER classifier and en- hance the understanding of predefined classes with stand-by semantic knowledge. Experi- mental results demonstrate that our model out- performs five state-of-the-art models in both 1- shot and 5-shots settings on four NER bench- marks. We will release the code upon accep- tance. The source code is released on https: //github.com/shuaiwa16/OtherClassNER.git.


Introduction
Named Entity Recognition (NER) seeks to locate and classify named entities from sentences into predefined classes (Yadav and Bethard, 2019). Humans can immediately recognize new entity types given just one or a few examples (Lake et al., 2015). Although neural NER networks have achieved superior performance when provided large-scale of training examples , it remains a non-trivial task to learn from limited new samples, also known as few-shot NER (Fritzler et al., 2019).
Traditional NER models, such as LSTM+CRF (Lample et al., 2016), fail in few-shot settings. They calculate the transition probability matrix based on statistics, which requires a large number of data for optimization. Recently, prototypical * Corresponding author. network (Snell et al., 2017) shows potential on fewshot NER. The basic idea is to learn prototypes for each predefined entity class and an other class, then classify examples based on which prototypes they are closest to (Fritzler et al., 2019). Most existing studies focus on the predefined classes and leverage the label semantic to reveal their dependency for enhancement . However, they ignore the massive semantics hidden in the words of other class (O-class for short).
In this paper, we propose to learn from O-class words, rather than using predefined entity classes only, to improve few-shot NER. In fact, O-class contains rich semantics and can provide stand-by knowledge for named entity identification and disambiguation. As shown in Figure 1(a), if we can detect an undefined class consisting of references to named entities (such as pronouns), then due to their interchangeability (Katz and Fodor, 1963), we will obtain prior knowledge for named entity identification. For example, Newton can be replaced with he or professor in S 2 and S 3 . If we can detect additional classes, including he and professor, we will have more evidence about where Newton may appear. In addition, if we can detect an undefined class that composed of Action (O 1 ), we may cap-ture underlined relations between different named entities, which is important evidence when distinguishing the named entity type (Ghosh et al., 2016;Zheng et al., 2017).
Nevertheless, it is challenging to detect related undefined classes from O class words due to two reasons: 1) Miscellaneous Semantics. O-class contains miscellaneous types of words. Based on our observations, although there are massive related yet undefined classes, the noise maybe even more, such as function and stop words. These noisy classes have little or negative impacts on the identification of target entities. Therefore, how to distinguish noise from task-related classes is a key point. 2) Lack of Golden Label. We neither have the labeled examples nor the metadata of each undefined class. The zero-shot methods (Pushp and Srivastava, 2017) fail in this case, since they need metadata (such as class name and class description) as known information. Unsupervised clustering methods also cannot meet quality requirements as shown in our experiment.
To handle the issues, we propose the Mining Undefined Classes from Other-class (MUCO) model to leverage the rich semantics to improve few-shot NER. Instead of a single prototype, we learn multiple prototypes to represent miscellaneous semantics of O-class. Figure 1(b) shows the difference between our method and previous methods. To distinguish task-related undefined classes without annotations, we leverage weakly supervised signals from predefined classes and propose a zero-shot classification method called Zeroshot Miner. The main idea is inspired by transfer learning in prototypical network. Prototypical network can be quickly adapted to new class B when pre-training on related base class A. The underlined reason is that if two classes (A and B) are task-related, when we make examples in A class to cluster in the space, the examples in B class also tend to cluster in the space, even without explicit supervision on class B (Koch et al., 2015). Based on this phenomenon, we first perform prototype learning on predefined classes to cluster words in predefined classes, and then regard words in O-class that also tend to cluster as the undefined classes. Specifically, we train a binary classification to judge whether clustering occurs between any two of the words. After that, we label the found undefined classes back into sentences to jointly recognize predefined and undefined classes for knowledge transfer. Our contributions can be summarized as follows: • We propose a novel approach MUCO to leverage rich semantics in O class to improve fewshot NER. To the best of our knowledge, this is the first work exploring O-class in this task.
• We propose a novel zero-shot classification method for undefined class detection. In the absence of labeled examples and metadata, our proposed zero-shot method creatively use the weakly supervised signal of the predefined classes to find undefined classes.
• We conduct extensive experiments on four benchmarks as compared with five state-ofthe-art baselines. The results under both 1shot and 5-shots settings demonstrate the effectiveness of MUCO. Further studies show that our method can also be conveniently adapted to other domains.

Related Work
Few-shot NER aims to recognize new categories with just a handful of examples (Feng et al., 2018;Cao et al., 2019). Four groups of methods are adopted to handle the low-resource issue: knowledge enhanced, cross-lingual enhanced, crossdomain enhanced, and active learning. Knowledgeenhanced methods exploit ontology, knowledge bases or heuristics labeling (Fries et al., 2017;Tsai and Salakhutdinov, 2017;Ma et al., 2016) as side information to improve NER performance in limited data settings, which suffer from knowledge low-coverage issue. Cross-lingual (Feng et al., 2018;Rahimi et al., 2019) and cross-domain enhanced methods (Wang et al., 2018; respectively use labeled data from a counterpart language or a different domain as external supervised signals to avoid overfitting. When the language or domain discrepancy is large, these two methods will inevitably face the problem of performance degradation (Huang et al., 2017). Active learning methods (Wei et al., 2019) explicitly expand corpus by selecting the most informative examples for manual annotation, which need extra human-laboring. Different from previous methods, we focus on mining the rich semantics in the O class to improve few-shot NER.

Prototypical Network
Prototypical network (Snell et al., 2017), initially proposed for image classification, has been successfully applied to sentence-level classification tasks, such as text classification  and relation extraction (Gao et al., 2019 3 Methodology Figure 2 illustrated the architecture of the proposed MUCO model. MUCO is composed of two main modules: Undefined Classes Detection detects multiple undefined classes hidden in O class to fully exploit the rich semantics in O class. Joint Classification jointly classifies the undefined classes and predefined classes, so as to leverage the stand-by semantic knowledge in undefined classes to enhance the understanding of predefined classes.

Notation
In few-shot NER, we are given training exam- For each example (x, y), x is composed by S and w j , where S =< w 1 , w 2 , . . . , w n > stands for the sentence and w j is the queried named entity, y is the class label of the queried named entity w j . We denote the prototype of class y as p y and prototypes for all classes C ∪ O as P = {p y |y ∈ C ∪ O}. Formally, our goal is first to detect multiple undefined classes O = {o 1 , o 2 , . . . , o r } to label the examples in D o , and then maximize the prediction probability P (y|x) on D c and D o .

Undefined Classes Detection
In few-shot NER, most of the words in the sentence belong to O class. Different from predefined classes, O class means none-of-the-above, and contains multiple undefined entity types. Previous methods ignore the fine-grained semantic information in O class and simply regard O as a normal class. We argue to further decouple O class into multiple undefined classes to fully exploit the rich semantics hidden in O class.
In the section, we aim to detect undefined classes from O class. It is a non-trivial task since we lack metadata and golden labels to help us distinguish undefined classes. What is worse, the examples from O class is numerous and the search space is large. To handle the issue, we propose a zeroshot classification method called Zero-shot Miner to leverage the weak supervision from predefined classes for undefined classes detection. Our method inspires by transfer learning, we argue that if an undefined class is task-related, when we push the examples in predefined classes to cluster in the space, the examples in the undefined class should also have the signs of gathering, even without explicit supervision (Koch et al., 2015). For instance, in Figure 2, if we guide Emeneya and Newton (the green points 1, 3) to cluster in the space, professor and He (the grey points 9, 12) will also tend to cluster in the space.
Based on this argument, undefined classes detection could be achieved by finding multiple groups of examples in O class that have a tendency to cluster during the training of the prototypical network on predefined classes. As shown in Figure  2, there are three steps in our zero-shot classification method. In step 1, we train the prototypical network on predefined classes to obtain the learned mapping function. Through the learned mapping function the examples belonging to the same class will cluster in the space. In step 2, we train a binary group classifier on predefined classes base on the position features from the learned mapping function and unlearned mapping function to judge whether any two points tend to cluster during the step 1 training. In step 3, we use the learned binary group classifier in step 2 to infer examples in O class to distinguish undefined classes from each Step 1 Step 2 Step 3 Figure 2: The architecture of the proposed MUCO model. We first detect undefined classes from O class, and then jointly classify the predefined classes and the found undefined classes for knowledge transfer. Specifically, in undefined classes detection, we propose a zero-shot classification method, which includes three steps. In step 1, we learn a mapping function through prototypical network training on predefined classes. In step 2, we learn a binary group classifier to judge whether any two points in predefined classes tend to cluster during the step 1 training. In step 3, we use the binary group classifier to infer pairs of examples in O class to distinguish multiple undefined classes.
other. The following articles will illustrate the three steps sequentially.

Step 1: Mapping Function Learning
In prototypical network, mapping function f θ (x) aims to map the example x to a hidden representation. BERT is adopted as the mapping function in our model, which is a pre-trained language representation model that employs multi-head attention as the basic unit, and have superior representation ability (Geng et al., 2019). We train the mapping function by correctly distinguishing the predefined classes. First, we extract the feature of the queried word. Formally, given the training example (x, y) ∈ D c , where x is composed of sentence S =< w 1 , w 2 , . . . , w n > and the queried word w j , we extract the j-th representation of the sequence output of the last layer of BERT as the hidden representation.
Then, following (Qi et al., 2018), we randomly initialize the prototype p y of class y at the beginning of training, and then we shorten the distance between examples in class y to prototype p y during training. Compared to traditional prototypical learning (Snell et al., 2017), we do not need to waste part of the examples for prototype calculation. d where f θ (x) and p y are first normalized by L2 normalization.
The final optimization goal for training the mapping function is where P c = {p c |c ∈ C} stands for the prototypes of all the predefined classes.

Step 2: Binary Group Classifier Training
Recall that to detect multiple undefined classes, we need to find multiple example groups, and the examples in each group should have a tendency to cluster.
To handle the issue, we learn a binary group classifier on predefined classes. The main idea is that if we can determine whether any two examples belong to the same group, we can distinguish groups from each other. Formally, given a pair of examples (x i , y i ) and (x j , y j ) in D c , their original position h i , h j from unlearned mapping function f θ (x), and after-training positionh i ,h j from learned mapping functionf θ (x), the probability of x i and x j belonging to the same class is defined as follows: By comparing the distance variation between original positions h and the after-training positionsh, we can tell whether aggregation occurs between any of the two points. The optimization goal of the binary group classifier is where N is the numbers of the examples in predefined classes, and y ij is the label. If x i and x j are from the same predefined class (y i =y j ), y ij is 1, otherwise 0.

Step 3: Binary Group Classifier Inference
After training, we feed each pair of examples x u and x v in D o to the binary group classifier to obtain the group dividing results. The output b uv indicates the confidence that x u and x v belong to the same group. We set a threshold to divide the group. If b uv is larger than the threshold γ, x u and x v shall belong to the same group (undefined class). If consecutive words belong to the same group, we will treat these words as one multi-word entity. Noted that some of the examples in O class may not belong to any group. We assume that these examples come from the task-irrelevant classes, and no further classification is made for these examples.
Soft Labeling After the process of group dividing, we obtain labels of multiple undefined classes O = {o 1 , o 2 , . . . , o r }. We further adopt the soft labeling mechanism. For each undefined class o i , we calculate the mean of the examples as the class center, then we apply softmax on the cosine similarity between examples and its class center as the soft labels. Through soft labeling, we can consider how likely examples belong to the undefined classes.

Joint Classification
In the section, we take into consideration of both the predefined classes C and the found undefined classes O for joint classification. First, we label the examples in undefined classes back into the sentences, as shown in Joint Classification of Figure 2. Then, we optimize the examples to make them closer to the corresponding prototype for better discrimination. Comparing to the Equation 3, we add the prototypes from O class P o = {p o 1 , p o 2 , . . . , p or } as candidate prototypes.
Formally, given the examples (x, y) ∈ D c ∪ D o , the corresponding prototype p y and prototypes set P = P c ∪ P o from both predefined classes C and undefined classes O, the optimization object is defined as: Scale Factor When calculating d(x, p y ), the f θ (x) and p y have been normalized and the value is limited to [-1, 1]. When softmax activation is applied, the output is unable to approach the one-hot encoding and therefore imposes a lower bound on the cross-entropy loss (Qi et al., 2018). For instance, even we give the golden prediction: giving 1 for correct category and -1 for the wrong ones, the probability of output p(y|x) = e 1 /[e 1 + (|C ∪ T | − 1)e −1 ] is still unable to reach 1. The problem becomes more severe as we increase the number of named entity categories by introducing more categories for O class. To alleviate the issue, we modify Eq. 6 by adding a trainable scalar s shared across all classes to scale the inner product .

Implementation Details
Following traditional prototypical network (Snell et al., 2017), we pre-train the model on several base classes, whose types are disjoint to few-shot classes and have abundant labeled corpus. The underlined idea is to leverage existing fully annotated classes to improve the performance of the model on new classes with only a few annotations. All predefined classes (both base classes and few-shot classes) are used when searching for undefined classes, so that the annotations of undefined classes can be shared between pre-training and fine-tuning, which will improve the transfer performance of our model.

Datasets
We conduct experiments on multiple datasets to reduce the dataset bias, including three English benchmarks Conll2003 (Sang andDe Meulder, 2003), re3d (Science and Laborator, 2017) and Ontonote5.0 (Pradhan et al., 2013) and one Chinese benchmark CLUENER2020 . Conll2003 contains 20,679 labeled sentences, distributed in 4 classes in the News domains. The data in re3d comes from defense and security domain, with 10 classes and 962 labeled sentences.
Ontonotes5.0 has 17 classes with 159,615 labeled sentences in mixed domains -News, BN, BC, Web and Tele. CLUENER2020 has 10 fined grained entity types with 12,091 annotated sentences. For all of the datasets, we adopt BIO (Beginning, Inside, and Outside) labeling, which introduces an extra O class for non-entity words.

Data Split
We divided the classes of each benchmark into two parts: base classes and few-shot classes. The few-shot classes for Conll / re3d / Ontonote / CLUENER are Person / Person, Nationality, Weapon / Person, Language, Money, Percent, Norp / Game, Government, Name, Scene. The rest are the base classes. The division is based on the average word similarity among classes (mean similarity is reported in Appendix A). At each time, the class with the largest semantic difference from other classes is selected and added to the few-shot classes until the number of few-shot classes reaches 1/3 of the base classes. In this way, we can prevent the few-shot classes and base classes from being too similar, leading to information leakage. We do not follow previous methods  to adopt different datasets as base and few-shot classes, because there are overlapped classes in such data split, such as Person, which will reduce the difficulty of few-shot setting. For base classes, all examples are used to train the base classifier. For few-shot classes, only K examples are used for training, and the rest are used for testing. Alternatively, we adopt the N-way K-shot setting for fewshot classes, where N is the number of few-shot classes and K is the number of examples sampled from each few-shot class. K is set to 1 and 5 respectively in our experiment. Noted that we can not guarantee the number of the examples is exactly equal to K when sampling, because there will be multiple class labels in one sentence. Following (Fritzler et al., 2019), we ensure there are at least K labels for each few-shot class.

Evaluation Metrics
Following , we measure the precision, recall, and macro-averaged F1 scores on all few-shot classes. For fair comparison with baselines, as long as the found undefined class is classified as O class, it can be considered correct. We report the average on ten runs as the final results.

Hyperparameters
For feature extraction, we adopt BERT-base as our backbone 1 , which has 12-head attention layers and 768 hidden embedding dimension. For learning rate, we adopt greedy search in the range of 1e-6 to 2e-4. We set learning rage to 2e-5 when pretraining on base classes and 5e-6 when fine-tuning on few-shot classes. The threshold γ is set to 0.68 to ensure that the found undefined classes are sufficiently relevant to the predefined classes. The batch size is 128 and the maximum sequence length 128. We set the scale factor in Eq. 7 to 10 at the beginning. Our code is implemented by Tensorflow and all models can be fit into a single V100 GPU with 32G memory. The training procedure lasts for about a few hours. The best result appears around the 100 epochs of the training process.

Baselines
We divide the baselines into two categories: 1) Supervised-Only Methods. BERT uses pre-trained BERT model to sequentially label words in sentence (Devlin et al., 2018). Prototypical network (PN) learns a metric space for each class (Snell et al., 2017). Both of the methods are only trained on the few-shot classes. 2) Few-shot Methods. L-TapNet+CDT (LTC) uses semantic associations between base and few-shot classes to improve the prototype quality, which is only trained on base classes . We use the original published code 2 . Warm Prototypical Network (WPN) (Fritzler et al., 2019) is the transfer learning version of PN, which is first pre-trained on base classes and then fine-tuned on few-shot classes. MAML first learns fast-adapted parameters on base classes and then fine-tune the parameters on few-shot classes (Finn et al., 2017). Table 1 and 2 present the overall performance of the proposed approach on four NER benchmarks -Conll2003, re3d, Ontonote5.0 and CLUENER2020. MUCO (ours) consistently outperforms state-ofthe-art models, showing the effectiveness of ex-  ploiting the rich semantics in O class and the superiority of the proposed MUCO model.

Overall Performance
Compared with supervised-only methods (BERT and PN), few-shot methods (TransferBERT, WPN, MAML, L-TapNet+CDT and MUCO(ours)) achieve better performance. By first training on base classes, these methods will learn a prior, which prevents from overfitting densely labeled words. Among few-shot methods, our model achieves the best performance. Previous methods regard O class as a single class. On the contrary, we induce different undefined classes from O class, and add more task-related classes for joint training, which directly handles the dilemma of scarcity of data in few-shot learning and provides stand-by semantics to identify and disambiguate named entity, thereby improving the performance of few-shot NER. No matter English corpus (the first three) or Chinese corpus (the last one), our methods consistently improves the F score, showing the language-independent superiority of our method. Task-agnostic superiority also shows in section 4.10. Our undefined classes detection method is completely data-driven. The found undefined classes will be automatically adjusted to be useful and task-related based on current language or task predefined classes.
To further evaluate our core module undefined classes detection in section 3.2, we introduce a Word-Similarity (WS) baseline. WS detects undefined classes by performing KMeans (Kanungo et al., 2002) in O words based on word similarity.
To be fair, WS, like our method, uses soft-label enhancement (section 3.2.2). We report the final few-shot NER performance on Ontonote for comparison. As shown in Figure 3, our method achieves better performance, which shows the superior of our undefined classes detection module. Word similarity baseline only uses semantics of words and lacks weak supervision from predefined classes, so that noisy classes (such as punctuation) cannot be distinguished from task-related ones, which inevitably reduces the quality of undefined classes.

Quality of Found Undefined Classes
In the section, we evaluate the quality of the found undefined classes from quantitative and qualitative perspective. All the following experiments are conducted on Ontonote5.0.
For quantitative analysis, we invite three computer engineers to manually label 100 sentences for human evaluation. The metrics are Intra-class Correlation (IC) and Inter-class Distinction (ID). The IC statistics how many labels actually belong to the declared class. The ID counts how many labels belong to only one of the undefined classes, not to multiple classes. We obtain golden labels by applying the majority vote rule. Table 3 reports the average results on undefined classes. Considering the zero-shot setting, the accuracy of 49.15% and 50.85% is high enough, which indicates that the found undefined classes basically have semantic consistency within the classes and semantic difference between classes.
For qualitative analysis, we illustrate a case study in Table 4. The words in O 1 , O 2 and O 3 are mainly the general entity versions of Person, Location and Numerous respectively. According to the grammatical rules, general entities and named entities can be substituted for each other, Lincoln can also be called president, so identifying general entities can provide additional location knowledge and enhance named entity identification. The words in O 4 and O 5 are mainly Action, which may imply relations between different named entities and provide important evidence for named entity disambiguation (Tong et al., 2020). The errors mainly come from three aspects: 1) The surrounding words are incorrectly included, such as from in businessmen from in O 1 ; 2) Some strange words reduce intra-class consistency, such as was at the tail in O 3 ; 3) There is semantic overlap between classes, such as O 4 and O 5 . Future work will explore how to improve the quality of the undefined classes.

Different Number of Undefined Classes
Since our model needs to manually set the number of undefined classes, we observe the performance of the model under different number settings. We set the number of undefined classes to 1/2/5/10/25/50 by adjusting the threshold γ.  Figure 4 illustrates the F score of MUCO (ours) on various numbers of undefined classes. It will impair the performance when the number is too large or too small. When the number is too large, the found classes will have overlapping problems, resulting in severe performance degradation (-11.51%). When the number is too small, the model is unable to find enough task-related classes, limiting the ability to capture the fine-grained semantics in O class. Empirical experiments found that when the number of undefined classes is approximately equal to the number of few-shot classes, our method achieves the best performance (the number is 5 in Figure 4). We argue that the number of predefined classes is proportional to the amount of information hidden in weak supervision. Therefore, with more predefined classes, we can also find more high-quality undefined classes.

Cross-Domain Ability
In this section, we answer whether our model could achieve superior performance facing the discrepancy of different domains. To simulate a domain adaption scenario, we choose the benchmark Conll2003 (Sang and De Meulder, 2003) as the source domain and AnEM (Ohta et al., 2012) as the target domain. The entity types in AnEM, such as Pathological-Formation, are all medical academic terms and can ensure the discrepancy to common classes in Conll2003. As illustrated in Table 5, our method achieves the best adaptation performance on the target domain. All the predefined classes, both in source domains and target domains, are used when detection undefined classes. The annotations of undefined classes can be shared between pre-training and fine-tuning, which will improve the transfer performance of our model.

Task-Agnostic Ability
In this section, we answer whether our assumption of O class is task-agnostic and effective for few-shot token-level classification tasks other than NER. We conduct experiments on two tasks of widespread concern: Slot Tagging  and Event Argument Extraction (Ahn, 2006). Slot Tagging aims to discover user intent from task- Annotated Words O1 gentleman; journalist; president; ambassador; I; he; they; businessmen from; and those Huwei people who; O2 the harbour; this land, which; over the river; with the great outdoors; outsides; to nature; the skyline; O3 some; a major; the small number; supplied; not only one of the; empty; large; increase of; was at the tail; O4 believe; comfort; attacked or threatened; arrest; geared; talks; not dealing; discussions; agreement; O5 stop; have; do; discussion; take; seek; sat down; negotiated; think; failed; replace; oriented dialogue system. We adopt Snips dataset (Coucke et al., 2018) for Slot Tagging, and the split of train/test is We,Mu,Pl,Bo,Se/Re,Cr. Event Argument Extraction aims to extract the main elements of event from sentences. We adopt the ACE2005 dataset 3 with 33 classes and 6 domains. The train/test is bc,bn,cts,nw/un,wl. As illustrated in Table 6, the proposed model achieves superior performance on both tasks, which demonstrates the generalization ability of our method. No matter what task the predefined class belongs to, our method is always able to mine the task-related classes from the O class to help eliminate the ambiguity of the predefined class. The reason is that our detection method is entirely datadriven, and does not rely on manually writing undefined class descriptions. The found category will automatically change according to the task type of the entered predefined classes. Therefore, the migration cost between tasks of our method is meager.

Conclusion
In this paper, we propose Mining Undefined Classes from Other-class (MUCO) to utilize the rich semantics in O class to improve few-shot NER. Specifically, we first leverage weakly supervised signals from predefined classes to detect undefined classes from O classes. Then, we perform joint classification to exploit the stand-by semantic knowl-