S MART S PAN N ER : Making S PAN N ER Robust in Low Resource Scenarios

,


Introduction
NER is a fundamental information extraction task and plays an essential role in natural language processing applications such as information retrieval, question and answering, machine translation and knowledge graphs (Liu et al., 2022).The goal of NER is to extract named entities (NEs) into predefined categories, such as person (PER), location (LOC), organization (ORG) and geo-political entity (GPE).With the rapid evolution of neural architectures (Hochreiter and Schmidhuber, 1997;Kalchbrenner et al., 2014;Vaswani et al., 2017) and large pretrained models (Devlin et al., 2019;Brown et al., 2020;Lewis et al., 2020), recent years have seen the paradigm shift of NER systems from sequence labeling (Chiu and Nichols, 2016; Luo 1 NEH is the first word of a named entity, for example, "Carl" is the NEH of "Carl Dinnon" in Fig. 1.If a named entity has only one word, the NEH is itself. Reporter 0 Carl 1 Dinnon 2 of 3 Britain 4 's 5 ITN 6 filed 7 this 8 report 9   et al., 2020;Lin et al., 2020) to span prediction (Xu et al., 2017;Mengge et al., 2020;Tan et al., 2020;Fu et al., 2021).It is pointed out that nested entities are very common in the field of NER (Finkel and Manning, 2009), i.e., a named entity can contain or embed other entities as illustrated in Fig. 1.Compared with SEQLAB, SPANNER has an obvious advantage: All candidate entities can be easily found with different sub-sequences, which is straightforward for nested NER (Fu et al., 2021).However, our experiments reveal that the performance of SPANNER drops much more than that of SEQLAB when the amount of training data drops, which makes SPANNER hard in low resource scenarios.In order to mitigate this problem, we propose a novel method SMARTSPANNER in this paper.By introducing a Named Entity Head (NEH) prediction task for each word in given sentences, we perform multi-task learning together with the task of span classification for NER.We conduct experiments on both flat and nested standard benchmark datasets (CoNLL03, FEW-NERD, GENIA and ACE05).Experimental results demonstrate that SMARTSPANNER could improve the robustness in low resource scenarios significantly as shown in Fig 2 .Our contributions are summarized as follows: • To the best of our knowledge, we are the first to highlight the robustness problem of SPAN-NER in low resource scenarios.
• To mitigate this problem, we develop a simple and effective method, named SMARTSPAN-NER.By introducing the task of NEH prediction, SMARTSPANNER can achieve significant gains in low resource scenarios.
• We provide an in-depth analysis of the reasons for the strong robustness of the method SMARTSPANNER.

Related Work
There has been a long history of research involving NER (McCallum and Li, 2003).Traditional approaches are based on Hidden Markov Model (HMM; Zhou and Su, 2002) or Conditional Random Field (CRF; Lafferty et al., 2001).With the development of deep learning technology (Hinton and Salakhutdinov, 2006), SEQLAB methods such as LSTM-CRF (Huang et al., 2015) and BERT-LSTM-CRF (Devlin et al., 2019) achieve very promising results in the field of NER.However, these methods cannot directly handle the nested structure because they can only assign one label to each token.
As it is pointed out that named entities are often nested (Finkel and Manning, 2009), various approaches for nested NER have been proposed in recent years (Wang et al., 2022;Shibuya and Hovy, 2020).One of the most representative directions is span-based methods that recognize nested entities by classifying sub-sequences of a sentence (Xu et al., 2017;Mengge et al., 2020;Tan et al., 2020;Fu et al., 2021;Zaratiana et al., 2022;Weng and Zhang, 2023).SPANNER methods are naturally suitable for the nested structure because nested entities can be easily detected in different sub-sequences.Although the strengths and weak- nesses of SPANNER have been systematically investigated by Fu et al. (2021), its performance in low resource scenarios is not discussed.To the best of our knowledge, we are the first to highlight that the performance of SPANNER drops much more than that of SEQLAB when the amount of training data drops, which poses a challenge to the robustness of SPANNER in low resource scenarios.In order to address this challenge, we propose a novel method SMARTSPANNER.
It should be noted that the low resource scenarios in this paper refer to those with at least 1,000 labeled sentences, different from the settings for fewshot NER (Huang et al., 2021;Ding et al., 2021) and more common in real-world applications.

Problem Description
The commonly-used NER standard benchmark dataset CoNLL03 (English) (Tjong Kim Sang and De Meulder, 2003) is selected to show the different sensitivities of SEQLAB and SPANNER methods.210%, 20%, 50% and 100% of the training data are used to train the models respectively, and the F 1 scores on the test dataset are reported in Table 1.
From Table 1, although the performances of SE-QLAB and SPANNER on the entire training data are almost the same, the F 1 score of SPANNER drops much more than that of SEQLAB when the training data drops, i.e., the robustness of SPANNER in low resource scenarios needs to be greatly improved.

NEH for SPANNER
Given a sentence S = {w 1 , • • • , w n } with n words, and a span (i, j) denoting the sub-sequence in S which starts with w i and ends with w j , w i is an NEH if there is an NE span (i, j).For example, the words "Reporter 0 ", "Carl 1 " and "Britain 4 " in Fig. 1 are NEHs, and the other words are not.
From the above definition, we could conclude that w i is an NEH, which is a necessary but not suf- ficient condition for span (i, j) to be an NE.That is to say, if w i is not an NEH, the span (i, j) must not be an NE.Therefore, if we introduce the NEH prediction task for SPANNER, the number of spans for semantic tag classification could be greatly reduced in training and inferring.For example, without the NEH prediction task, the number of spans in Fig. 1 (all sub-sequences for the sentence with 11 words) for SPANNER is 66.If the three NEHs in Fig. 1 are correctly predicted, the number of spans (the sub-sequences starting with the words "Reporter 0 ", "Carl 1 " and "Britain 4 ") could be reduced to 28.These are the basic ideas of our method SMARTSPANNER.With the introduction of the NEH prediction task, the number of spans for semantic tag classification could be greatly reduced, so the difficulty of semantic tag classification in SMARTSPANNER will be easier than in SPANNER.Meanwhile, the NEH prediction task is obviously easier than semantic tag classification (fewer categories and more balanced positive and negative samples).Therefore, with the help of the NEH prediction task, SMARTSPANNER reduces the difficulties of NER and could be more robust in low resource scenarios.

SMARTSPANNER
An overview of our SMARTSPANNER method is shown in Fig. 3, which consists of two parts: NEH prediction and span classification.NEH prediction aims to predict whether a word is the first word of an entity, and span classification aims to classify spans to corresponding semantic tags.The two parts are jointly trained under a multi-task learning framework with a shared encoder.The encoder is applied to a sentence to obtain contextual word representation, which is shared in downstream NEH prediction and span classification tasks.In this work, the NEH prediction task is treated as a special type of span classification, where the span width is 1 and the number of classes is 2. The span classification task is to aggregate the span information for multi-class classification.During inference, only the words predicted as NEHs are used to generate candidate spans for span classification.

Encoder
Considering a sentence S with n tokenized words {w i } n i=1 , we convert the words to their contextual embeddings with BERT (Devlin et al., 2019) encoder.We generate the input sequence by concatenating a [CLS] token, {w i } n i=1 and a [SEP] token, and use a series of L stacked Transformer blocks (TBs) to project the input to a sequence of contextual vectors, i.e.,

Span Classification
Considering a span (i, j), we use the boundary embeddings and span length embedding to represent the span (Fu et al., 2021): where z j−i+1 is the span length embedding, which could be obtained by a learnable look-up table.
Next, we feed the span representation into a multi-layer perceptron (MLP) classifier, and apply a softmax layer to obtain the probability P s to predict its semantic tag.
Finally, we minimize the cross-entropy loss function: where k is the number of semantic tags, and y t denotes a label indicating whether the span (i, j) is in tag t.

NEH Prediction
In this work, we treat NEH prediction as binary classification for a special span (i, i) (i.e., only word w i , 1 ≤ i ≤ n) in sentence S.So the NEH probability P h is: and the cross-entropy loss function is: where y h denotes whether the word w i is an NEH.

Joint Training and Inference
We jointly minimize the following loss for training: where L h and L s are the losses of the NEH prediction task and span classification task, and w is the hyper-parameter to balance the two tasks.
During inference, only the words predicted as NEHs are used to generate candidate spans for span classification.
Therefore, in order to keep the training and inferring data distributions consistent, training the span classification model in SMARTSPANNER only requires a part of the spans that are all needed in SPANNER.A selection strategy is designed for each span (i, j) in training: where rand is a randomly generated float number in [0, 1] and sp is a hyper-parameter of the selection threshold (0.05 used in this paper).
An example for this selection strategy is provided to compare the number of total training samples in SPANNER and SMARTSPANNER: Given a sentence having n tokens and d named entities, we could get the number of training samples generated by this sentence for SPANNER (N sn ) and SMARTSPANNER (N ssn ) respectively when the max span width is set to the value m: It should be noted that the training samples of NEH prediction are included in N ssn (the first term n in the right).Supposing n = 100, d = 5, m = 10, sp = 0.05, we could have N sn = 955, N ssn ≈ 197.This means the training samples are greatly reduced in SMARTSPANNER, and thus the training time is also greatly reduced.This is the reason why we name the method "SMART".

Datasets
Four standard benchmark datasets CoNLL03 English (Tjong Kim Sang and De Meulder, 2003),  FEW-NERD (SUP) (Ding et al., 2021), GENIA (Kim et al., 2003) and ACE053 English are selected for evaluation, where CoNLL03 and FEW-NERD are for flat NER and the others are for nested NER.The statistics of these datasets are shown in Table 2.
It should be pointed out that we follow Shibuya and Hovy (2020)'s preprocessing steps4 to split GENIA and ACE05 into train, development, and test sets.

Experiment Settings
In experiments, we implement the SPANNER and SMARTSPANNER methods based on the source codes5 by Fu et al. (2021), where the pretrained model bert-base-uncased6 is used as the encoder.And we implement BERT-CRF as the SE-QLAB method for comparison.
The training datasets for low resource scenarios are constructed by random.shuffle on the entire training sentences and extracting the first 1,000, 2,000 or 5,000 sentences, where the random seeds in 10 runs are set to 1, 2, 3, 4, 5, 6, 7, 8, 9 and 42 respectively.In addition, the random seed for the results in Table 1 is set to 42.The development datasets of CoNLL03, GENIA and ACE05 are all used in the constructed low resource scenarios.As the development data in FEW-NERD is very large (18,824 sentences), we choose the first 2,000 sentences as the development data when the training data contains 1,000, 2,000 or 5,000 sentences.
For the SPANNER and SMARTSPANNER methods, the embedding size of the span width is set to 50, the max span width is set to 207 .The MLP for span classification takes two layers, which is the same as the setting by Fu et al. (2021).Considering that NEH classification is less challenging compared to span classification, we use single-layer MLP for NEH classification.
The training epoch number is set to 10, and the batch size is set to 16.During training, the SPANNER, SMARTSPANNER and SEQLAB models are optimized by AdamW (Loshchilov and Hutter, 2019) with the learning rate set to 0.00001 and a linear warmup scheduler.The values of the hyper-parameters w for joint training and sp for span selection in SMARTSPANNER are set to 0.2 and 0.05 respectively.All models are trained using a single NVIDIA V100 GPU.

Main Results
The results of SPANNER, SMARTSPANNER and SEQLAB on the test datasets of CoNLL03, FEW-NERD, GENIA and ACE05 are reported in Table 3, where the values of the precision (P ), recall (R) and F 1 score are included (SEQLAB only for flat NER datasets, not for nested).
From the results of SPANNER and SMARTSPAN-NER on 1,000, 2,000, and 5,000 training sentences, it is obvious that the F 1 scores of SMARTSPAN-NER are greatly better that those of SPANNER on all the four datasets.Specially, when the training data is 1,000 sentences, compared with SPAN-NER, SMARTSPANNER has the most obviously improvement in F 1 scores (17.28% to 72.45% on CoNLL03, 0.00% to 19.57% on FEW-NERD,  23.06% to 50.66% on GENIA, and 20.44% to 55.38% on ACE05).Therefore, SMARTSPANNER is much more robust in low resource scenarios.
From the comparison results of SEQLAB and SMARTSPANNER on the two flat NER datasets, it can be seen that the F 1 scores of SMARTSPANNER are better than those of SEQLAB on all low resource settings except on the 1,000 sentences of CoNLL03.It is worth mentioning that SMARTSPANNER is more effective than SEQLAB on all the low resource settings of FEW-NERD -using such settings poses a significant challenging due to the large number of entity types in FEW-NERD.
Therefore, by introducing the NEH prediction task, SMARTSPANNER greatly improves the robustness of SPANNER in low resource scenarios, even better than SEQLAB.

Efficiency
According to our analysis, compared with SPAN-NER, SMARTSPANNER reduces the number of samples (spans) for training and inferring.To verify this, we compare the efficiencies of SPANNER and SMARTSPANNER in this section.From Table 4, it could be seen that SMARTSPAN-NER takes at least 26% less training time than SPANNER on both all low resource settings and the entire data of all the four datasets.This is because much fewer negative samples for span classification are used in the training of SMARTSPANNER, according to the selection strategy.
From Table 5, it could be seen that the inferring time of SMARTSPANNER is at least 57% shorter than that of SPANNER on the four test datasets of CoNLL03, FEW-NERD, GENIA and ACE05.This is because only the spans that start with the predicted NEHs are used for span classification in SMARTSPANNER, which reduces the number of spans for inferring greatly.
From the comparison results in Table 4 and Table 5, SMARTSPANNER is much more efficient than SPANNER during training and inferring on all low resource settings and the entire data of all the four datasets.

Analysis
We have shown the robustness of SMARTSPANNER in low resource scenarios.In this section, we aim to take a deeper look and understand what contributes to its final performance.192,302 192,302 192,302 192,302 Table 7: Number of inferring samples for the tasks in SMARTSPANNER (SSN) and SPANNER (SN) methods on the test dataset of ACE05 (1,000, 2,000, 5,000 and all sentences for training respectively).

Task Analysis
SMARTSPANNER has two tasks, i.e., NEH prediction task and span classification (SP) task, while SPANNER has one task, i.e., SP task.We first provide a detailed comparison of training data for these tasks on ACE05, which are shown in Table 6.
From Table 6, it could be found that the task of NEH prediction is the easiest, because the number of classification categories is the smallest and the balance of positive and negative samples8 is the best.Meanwhile, due to the more balanced positive and negative samples brought by the selection strategy, the task of span classification in SMARTSPANNER is much easier than that in SPAN-NER.Although deep learning methods can solve difficult problems, they require large amounts of data.Therefore, SMARTSPANNER is more effective than SPANNER for NER in low-resource scenarios.Furthermore, it could be seen that the total samples of SMARTSPANNER for training are much less than those of SPANNER (about 70% reduction).This is why the training time of SMARTSPANNER is much less than that of SPANNER.
Next, we compare the number of inferring samples for the tasks in SMARTSPANNER and SPAN-NER on the test dataset of ACE05 when 1,000, 2,000, 5,000 and all sentences are used for training, as shown in Table 7.It could be clearly seen that the total inferring samples in SMARTSPANNER are greatly less than those in SPANNER (more than 70% reduction).This is why the inferring time of SMARTSPANNER is much less.
Finally, we provide the results of the two tasks in SMARTSPANNER on the test dataset of ACE05 when 1,000, 2,000, 5,000 and all sentences are used for training data.In Table 8, the rows of "NEH" list the results of the NEH prediction task, the rows of "SP" list the results of the SP task on all the possible spans, and the rows of "NEH + SP" list  the results of SMARTSPANNER (only the spans that start with the predicted NEHs are used for span classification).
From Table 8, it could be seen that the F 1 scores of NEH prediction tasks are higher than those of span classification tasks due to the lower task difficulty.When span classification in SMARTSPAN-NER is used for all possible spans, the precision suffers because of the inconsistent distributions of training and test data (worse than that of SPAN-NER in Table 3).When NEH prediction is used before span classification (i.e., only the spans that start with the predicted NEHs are used for span classification), the precision rates are greatly improved (more than 10%), the recall rates are slightly decreased (less than 2%), and the F 1 scores are significantly improved (more than 5%).The reason for the drop of recall rates is that the recall rates of NEH are not 100%.Two cases from the test dataset of ACE05 are shown in A.1.
The analyses on CoNLL03, FEW-NERD and GENIA are provided in A.2.

Hyper-Parameter Analysis
There are two hyper-parameters (sample selection probability sp and joint training weight w) for training SMARTSPANNER.In this section, we provide a detailed analysis for the values of these two hyperparameters respectively.

Sample Selection Probability
We vary the sample selection probability parameter sp from 0 to 1 with step 0.05, and perform 10 independent runs of SMARTSPANNER for each (from 0 to 1 with step 0.05) in 10 independent runs when 1,000, 2,000, 5,000 and all sentences are used for training respectively.value of sp on the dataset ACE05.Fig. 4 and Fig. 5 show the results (Precision, Recall and F 1 scores) of NEH prediction and NER in SMARTSPANNER respectively when the parameter sp varies (the joint training weight w remains at 0.2).
From Fig. 4, it could be seen that the recall rates of NEH prediction hardly change and the precision rates improve slightly with the increase of the parameter sp.Therefore, the change of the parameter sp has little effect on the task of NEH prediction.That is to say, despite the parameter sp changes, the distribution of the data for span classification

ALL Sentences
Figure 6: The average values of Precision, Recall and F 1 scores of SMARTSPANNER on the test dataset of ACE05 with different joint training weight w (from 0.1 to 0.9 with step 0.1) in 10 independent runs when 1,000, 2,000, 5,000 and all sentences are used for training respectively.
in SMARTSPANNER during inference is stable.
From Fig. 5, it could be seen that the F 1 scores of NER in SMARTSPANNER decreases with the increase of the parameter sp, especially when the training data is small (such as 1,000 sentences).This is due to the inconsistency between training data and inferring data for span classification in SMARTSPANNER (the larger the parameter sp, the higher the inconsistency).Since the robustness of deep learning methods will improve as the amount of data increases, SMARTSPANNER is less sensitive to the parameter sp when the training data increases (such as all sentences).From Fig. 5, it could be seen that 0.05 is a good choice for the parameter sp in low resource scenarios.

Joint Training Weight
We vary the joint training weight w from 0.1 to 0.9 with step 0.1 and perform 10 independent runs of SMARTSPANNER for each value of w on the dataset ACE05.Fig. 6 shows the results (Precision, Recall and F 1 scores) of NER when the parameter w varies (the sample selection probability parameter sp remains at 0.05).
From Fig. 6, it could be found that the value of parameter w between 0.2 and 0.4 is suitable for SMARTSPANNER according to the F 1 scores.And the sensitivity of SMARTSPANNER to parameter w decreases with the increase of training data.

Conclusion
In this paper, it is found that the SPANNER method is sensitive to the amount of training data, i.e., the performance of SPANNER is worse than SEQLAB in low resource scenarios.In order to alleviate this problem, SMARTSPANNER is proposed by introducing the NEH prediction task into SPANNER in a multi-task learning manner.The comparison results of experiments designed on the CoNLL03, FEW-NERD, GENIA and ACE05 datasets demonstrate that SMARTSPANNER is much more robust in low resource scenarios than SPANNER, and greatly reduces the running time of training and inferring.In addition, the reasons for the strong robustness of SMARTSPANNER are analyzed in depth on the dataset ACE05.

Limitations
The SMARTSPANNER method proposed in this paper is very effective in low resource scenarios.However, when the training data contains more than 10,000 sentences, compared with SPAN-NER, the advantages of SMARTSPANNER on the CoNLL03 and GENIA datasets are not so obvious.Furthermore, when all the training data of the FEW-NERD dataset (more than 100,000 sentences) is used, the results of SMARTSPANNER even drop a bit.Therefore, SMARTSPANNER is not strongly recommended in high resource scenarios.

Ethics Statement
In this section, we discuss the ethical consideration of this work from the following two aspects.First, for SMARTSPANNER, the code, data and pretrained models adopted from previous works are granted for research-purpose usage.Second, SMARTSPAN-NER improves the robustness of SPANNER in low resource scenarios by introducing the NEH prediction task.Hence we do not foresee any major risks or negative societal impact of our work.However, like any other ML models, the named entities recognized by our model may not always be completely accurate and hence should be used with caution for real-world applications.

A.1 Case Study
Two cases from ACE05 are shown in Table 10, where the results of SPANNER and SMARTSPAN-NER are provided.From Table 10, it could be clearly seen that SMARTSPANNER performs better for NER than SPANNER, and the predicted NEHs are beneficial to improve the precision of NER in SMARTSPANNER.

A.2 Supplement Experimental Results
We provide the results of task analysis on the datasets CoNLL03, FEW-NERD and GENIA, which are shown in Table 11, Table 12 and Table 13.
From Table 11, it could be seen that SMARTSPAN-NER also obtains much more balanced positive and negative samples on the three datasets.As shown in Table 11 and Table 12, SMARTSPANNER requires significantly fewer samples for training and inferring on the three datasets than SPANNER, enabling much faster training and inferring.Table 13 demonstrates the results of the two tasks in SMARTSPANNER on the three datasets, which are consistent with the results on ACE05.
For reference, we report the results of NEH prediction in SMARTSPANNER on ACE05 with different joint training weight w in Fig. 7.
According to the results (Zaratiana et al., 2022), setting a larger epoch value will lead to better results for SpanNER.Therefore, we just increase the training epoch number from 10 to 25 and conduct three independent experiments on the four datasets (using random seeds 1, 2 and 42 to obtain 1,000 training sentences).The comparison results are reported in Table 9.It can be seen that setting a larger epoch number only results in a significant improvement of F 1 score for SPANNER on CoNLL03 (with minor improvements on GENIA and ACE05, and no change on FEW-NERD).Moreover, the superiority of SMARTSPANNER over SPANNER remains evident across all four datasets, highlighting the robustness of SMARTSPANNER in lowresource scenarios.477,887 477,887 477,887 477,887 11,595,325 11,595,325 11,595,325 11,595,325 616,491 616,491 616,491 616,491  Table 13: The average values of precision (P ), recall (R) and F 1 scores of the two tasks (NEH and SP) in SMARTSPANNER on the test datasets of CoNLL03, FEW-NERD and GENIA in 10 independent runs (1,000, 2,000, 5,000 and all sentences for training respectively).

Figure 1 :Figure 2 :
Figure 1: A sentence from ACE05 with 5 nested NEs.The superscript of each word indicates its index in the sentence.
Figure 3: An overview of SMARTSPANNER method, which consists of (a) NEH prediction and (b) span classification.The two parts are jointly trained under the multi-task learning framework, and jointly determine the final results.

Figure 4 :Figure 5 :
Figure4: The average values of Precision, Recall and F 1 scores of NEH prediction on the test dataset of ACE05 with different sample selection probabilities sp (from 0 to 1 with step 0.05) in 10 independent runs when 1,000, 2,000, 5,000 and all sentences are used for training respectively.

Table 1 :
F 1 scores of SEQLAB and SPANNER methods on the test dataset of CoNLL03 with different percentages of training data and the maximum epoch set to 10.

Table 3 :
Overall results of SPANNER, SMARTSPANNER and SEQLAB on CoNLL03, FEW-NERD, GENIA and ACE05.The P , R and F 1 are the mean values in 10 independent runs.Best F 1 scores are in bold."# Sents" stands for the number of sentences used for training.

Table 4 :
Training time (in seconds) of SPANNER (SN) and SMARTSPANNER (SNN) methods on the datasets of CoNLL03, FEW-NERD, GENIA and ACE05.

Table 4 and
The value is 24,614, not 24,700 shown in Table2.This is because spans exceeding the max span width (set to 20 in experiments) are not used for training. *

Table 5
For training, the running time is the average time to train one epoch.For inferring, the running time is the time to complete NER on the test datasets.

Table 8 :
The average values of precision (P ), recall (R) and F 1 scores of the two tasks (NEH and SP) in SMARTSPANNER on the test dataset of ACE05 in 10 independent runs (1,000, 2,000, 5,000 and all sentences for training respectively).
GuoDong Zhou and Jian Su. 2002.Named entity recognition using an HMM-based chunk tagger.In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 473-480, Philadelphia, Pennsylvania, USA.Association for Computational Linguistics.

Table 9 :
Averaged P , R and F 1 values of SPANNER and SMARTSPANNER methods over 3 independent runs, where the models are trained on 1,000 sentences randomly sampled from the training data with the maximum epoch set to 25.

Table 11 :
Descriptions of the training data of CoNLL03, FEW-NERD and GENIA for the tasks in SMARTSPANNER (SSN) and SPANNER (SN) methods, where # CAT means the number of classification categories, # PS means the number of positive samples, and # NS means the number of negative samples.