GNN-SL: Sequence Labeling Based on Nearest Examples via GNN

To better handle long-tail cases in the sequence labeling (SL) task, in this work, we introduce graph neural networks sequence labeling (GNN-SL), which augments the vanilla SL model output with similar tagging examples retrieved from the whole training set. Since not all the retrieved tagging examples benefit the model prediction, we construct a heterogeneous graph, and leverage graph neural networks (GNNs) to transfer information between the retrieved tagging examples and the input word sequence. The augmented node which aggregates information from neighbors is used to do prediction. This strategy enables the model to directly acquire similar tagging examples and improves the general quality of predictions. We conduct a variety of experiments on three typical sequence labeling tasks: Named Entity Recognition (NER), Part of Speech Tagging (POS), and Chinese Word Segmentation (CWS) to show the significant performance of our GNN-SL. Notably, GNN-SL achieves SOTA results of 96.9 (+0.2) on PKU, 98.3 (+0.4) on CITYU, 98.5 (+0.2) on MSR, and 96.9 (+0.2) on AS for the CWS task, and results comparable to SOTA performances on NER datasets, and POS datasets.


Introduction
Sequence labeling (SL) is a fundamental problem in NLP, which encompasses a variety of tasks e.g., Named Entity Recognition (NER), Part of Speech Tagging (POS), and Chinese Word Segmentation (CWS).Most existing sequence labeling algorithms (Clark et al., 2018;Zhang and Yang, 2018;Bohnet et al., 2018;Shao et al., 2017;Meng et al., 2019) can be decomposed into two parts: (1) representation learning: mapping each input word to a higherdimensional contextual vector using neural network models such as LSTMs (Huang et al., 2019), CNNs (Wang et al., 2020), or pretrained language models (Devlin et al., 2018); and (2) classification: fitting the vector representation of each word to a softmax layer to obtain the classification label.Because the protocol described above relies on the model's ability to memorize the characteristics of training examples, its performance plummets when handling long-tail cases or minority categories.Intuitively, it's easier for a model to make predictions on long-term cases at test time when it is able to refer to similar training examples.For example, in Figure 1, the model can more easily label the word "Phoenix" in the given sentence as an "ORGANIZATION" entity when referring to a similar example.
Benefiting from the success of augmented models in NLP (Khandelwal et al., 2019(Khandelwal et al., , 2020;;Guu et al., 2020;Lewis et al., 2020;Meng et al., 2021b) , a simple yet effective method to mitigate the above issues is to apply the k nearest neighbors (kNN) strategy: The kNN model retrieves k similar tagging examples from a large cached datastore for each input word and augments the prediction with the probability computed by the cosine similarity between the input word and each of the retrieved nearest neighbors.Unfortunately, there is a significant shortcoming of this strategy.Retrieved neighbors are related to the input word in different ways: some are related in semantics while others in syntactic, some are close to the original input word while others are just noise.A more sophisticated model is required to model the relationships between retrieved examples and the input word.
In this work, inspired by recent progress in combining graph neural networks (GNNs) with augmented models (Meng et al., 2021b), we propose GNN-SL to provide a general sequence-labeling model with the ability of effectively referring to training examples at test time.The core idea of GNN-SL is to build a graph between the retrieved nearest training examples and the input word, and use graph neural networks (GNNs) to model their relationships.To this end, we construct an undirected graph, where nodes represent both the input words and retrieved training examples, and edges represent the relationship between each node.The message is passed between the input words and retrieved training examples.In this way, we are able to more effectively harness evidence from the retrieved neighbors in the training set and by aggregating information from them, better token-level representations are obtained for final predictions.
To evaluate the effectiveness of GNN-SL, we conduct experiments over three widely-used sequence labeling tasks: Named Entity Recognition (NER), Part of Speech Tagging (POS), and Chinese Word Segmentation (CWS), and choose both English and Chinese datasets as benchmarks.Notably, applying the GNN-SL to the ChineseBERT (Sun et al., 2021), a Chinese robust pre-training language model, we achieve SOTA results of 96.9 (+0.2) on PKU, 98.3 (+0.4) on CITYU, 98.5 (+0.2) on MSR, and 96.9 (+0.2) on AS for the CWS task.We also achieve performances comparable to current SOTA results on CoNLL, OntoNotes5.0,OntoNotes4.0 and MSRA for NER, and CTB5, CTB6, UD1.4,WSJ and Tweets for POS.We also conduct comprehensive ablation experiments to better understand the working mechanism of GNN-SL.

Related Work
Retrieval Augmented Model Retrieval augmented models additionally use the input to retrieve information from the constructed datastore to the model performance.As described in Meng et al. (2021b), this process can be understood as "an open-book exam is easier than a close-book exam".The retrieval augmented model is more familiar in the question answering task, in which the model generates related answers from a constructed datastore (Karpukhin et al., 2020;Xiong et al., 2020;Yih, 2020).Recently other NLP tasks have introduced this approach and achieved a good performance, such as language modeling (LM) (Khandelwal et al., 2019;Meng et al., 2021b), dialog generation (Fan et al., 2020;Thulke et al., 2021), neural machine translation (NMT) (Khandelwal et al., 2020;Meng et al., 2021a;Wang et al., 2021).

Graph Neural Networks
The key idea behind graph neural networks (GNNs) is to aggregate feature information from the local neighbors of the node via neural networks (Liu et al., 2018;Veličković et al., 2017;Hamilton et al., 2017).Recently more and more researchers have proved the effectiveness of GNNs in the NLP task.For text classification, Yao et al. (2019) uses a Text Graph Convolution Network (Text GCN) to learn the embeddings for both words and documents on a graph based on word co-occurrence and document word relations.For information extraction, Lin et al. ( 2020) characterizes the complex interaction between sentences and potential relation instances via a graph-enhanced dual attention network (GEDA).For the recent work GNN-LM, Meng et al. (2021b) builds an undirected heterogeneous graph between an input context and its semantically related neighbors selected from the training corpus, GNNs are constructed upon the graph to aggregate information from similar contexts to decode the token.

kNN-SL
Sequence labeling (SL) is a typical NLP task, which assigns a label y ∈ Y to each word w in the given input word sequence x = {w 1 , . . ., w n }, where n denotes the length of the given sentence.We assume that {X , Y} = {(x 1 , y 1 ), . . ., (x N , y N )} denotes the training set, where (x i , y i ), ∀1 ≤ i ≤ N denotes the pair containing a word sequence and its corresponding label sequence.Let N be the size of the training set.

kNN-SL
The key idea of the kNN-SL model is to augment the process of classification during the inference stage with a k nearest neighbor retrieval mechanism, which can be split into the following pipelines: (1) using an already-trained sequence labeling model (e.g., BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019)) to obtain word representation h for each token with the input word sequence; (2) using h as the query and finding the most similar k tokens in the cached datastore which Vanilla probability p vanilla For a given word w, the output h generated from the last layer of the vanilla SL model is used as its representation, where h ∈ R m .Then h is fed into a multi-layer perceptron (MLP) to obtain the probability distribution p vanilla via a softmax layer:  (Vert et al., 2004) of the distance to the original embedding h: (2) where T is a temperature parameter to flatten the distribution.Finally, the vanilla distribution p vanilla (y|w, x) is augmented with p kNN (y|w, x) generating the final distribution p final (y|w, x): where λ is adjustable to make a balance between kNN distribution and vanilla distribution.

Overview
Intuitively, retrieved neighbors are related to the input word in different ways: some are similar in semantics while others in syntactic; some are very similar to the input word while others are just noise.
To better model the relationships between retrieved neighbors and the input word, we propose graph neural networks sequence labeling (GNN-SL).
The proposed GNN-SL can be decomposed into five steps: (1) obtaining token features using a pre-trained vanilla sequence labeling model, which is the same as in KNN-SL; (2) obtaining k nearest neighbors from the whole training set for each input word; (3) constructing an undirected graph between each word within the sentence and its k nearest neighbors; (4) obtaining aggregated word representations through messages passing along the graph; and (5) feeding the aggregated word representation to the softmax layer to obtain the final label.The full pipeline is shown in Figure 2.
For steps ( 1) and ( 2), they are akin the strategies taken in KNN-SL.We will describe the details for steps (3) and ( 4) in order below.

Graph Construction
We formulate the graph as G = (V, E, A, R), where V represents a collection of nodes v and E represents a collection of edges e.A refers to node types and R refers to edge types.

Nodes
In the constructed graph, we define three types of nodes A: (1) Input nodes, denoted by a input ∈ A, which correspond to words of the input sentence.
In the example of Figure 2 Step 3, the input nodes are displayed with the word sequence x = {Obama, lives, in, Washington}; (2) Neighbor nodes, denoted by a neighbor ∈ A, which correspond to words in the retrieved neighbors.The context of nearest neighbors is also included (and thus treated as neighbor nodes) in an attempt to capture more abundant contextual information for the retrieved neighbors.For the example in Figure 2, for each input word with representation h, k nearest neighbors are queried from the cached representations of all words in the training set with the L2 distance as the metric of similarity.Taking the input word {Obama} as the example with k = 2, we obtain two nearest neighbors {Obama, Trump} leveraging the kNN search.The contexts of each retrieved nearest neighbor are also considered by adding both left and right contexts around the retrieved nearest neighbor, where {Obama} is expanded to {[CLS], Obama, is} and {Trump} is expanded to {[CLS], Trump, is}}2 .The analysis of the size of the context is conducted in Section 5.5.
(3) Label nodes, denoted by a label ∈ A, since the labels of nearest neighbors provide important evidence for the input node to classify, we wish to pass the influence of neighbors' labels to the input node along the graph.As will be shown in ablation studies in Section 5.5, the consideration of label nodes introduces a significant performance boost.Shown in Figure 2 Step 3, taking the input word {Obama} as the example, the two retrieved nearest neighbors are {Obama} and {Trump}, and both the corresponding node labels are {B-PER}.
With the above formulated, A can be rewritten as {a input , a neighbor , a label }.
Edges Given the three types of nodes A = {a input , a neighbor , a label }, we connect them using different types of edges to enable information passing.
We define four types of edges for R: (1) edges within the input nodes a input , notated by r input-input ; (2) edges between the neighbor nodes a neighbor and the input nodes a input , notated by r neighbor-input ; (3) edges within the neighbor nodes a neighbor , notated by r neighbor-neighbor ; and (4) edges between the label nodes a label and the neighbor nodes a neighbor , denoted by r label-neighbor .All types of edges are bidirectional which allows information passing on both sides.We use different colors to differentiate different relations in Figure 2 Step 3.
For r input-input and r neighbor-neighbor , they respectively mimic the attention mechanism to aggregate the context information within the input word sequence or the expanded nearest context, which are shown with the black and green color in Figure 2.For r neighbor-input , it connects the retrieved neighbors and the query input word, transferring the neighbor information to the input word.For r label-neighbor colored with orange, information is passed from label nodes to neighbor nodes, which is ultimately transferred to input nodes.

Message Passing On The Graph
Given the constructed graph, we next use graph neural networks (GNNs) to aggregate information based on the graph to obtain the final representation for each token to classify.More formally, we define the l-th layer representation of node n as follows: where M(s, e, n) denotes the information transferred from the node s to the node n along the edge e, A(s, e, n) denotes the edge weight modeling the importance of the source node s on the target node n with the relationship e, and Aggregate(•) denotes the function to aggregate the transferred information from the neighbors of node n.We detail how to obtain A(•), M(•), and Aggregate(•) below.
Message For each edge (s, e, n), the message transferred from the source node s to the target node n can be formulated as: where d denotes the dimensionality of the vector, W v τ (s) ∈ R d×d and W ϕ(e) ∈ R d×d are two learnable weight matrixes controling the outflow of node s from the node side and the edge side respectively.
As we use different types of edges for node connections, we follow Hu et al. (2020) to keep a distinct edge-matrix W ϕ(e) ∈ R d×d for each edge type between the dot of Q(n) and K(s): where µ ∈ R |A|×|R|×|A| is a learnable matrix denoting the contribution of each edge with a different relationship.
Aggregate For each edge (s, e, n), we now have the attention weight A(s, e, n) and the information M(s, e, n), the next step is to obtain the weightedsum information from all neighboring nodes: MultiHead(s, e, n)) (7) where ⊕ is element-wise addition and W o τ (n) ∈ R d×d is a learnable model parameter used as an activation function like a linear layer.
The aggregated representation for each input word is used as its final representation, passed to the softmax layer for classification.For all our experiments, the number of heads is 8.

Experiments
We conduct experiments on three widely-used subtasks of sequence labeling: named entity recognition (NER), part of speech tagging (POS), and Chinese Word Segmentation (CWS).Due to the limit of pages, we put our training details that including the vanilla SL model and the kNN retrieval in Appendix A.

Control Experiments
To better show the effectiveness of the proposed model, we compare the performance of the following setups: (1) vanilla SL models: vanilla models naturally constitute a baseline for comparison, where the final layer representation is fed to a softmax function to obtain p vanilla for label prediction; (2) vanilla + kNN: the kNN probability p kNN is interpolated with p vanilla to obtain final predictions; (3) vanilla + GNN: the representation generated from the final layer of GNN is passed to the softmax layer to obtain the label probability p GNN ; (4) vanilla + GNN + kNN: the kNN probability p kNN is interpolated with the GNN probability p GNN to obtain final predictions, rather than the probability from the vanilla model, as in vanilla + kNN.

Named Entity Recognition
The task of NER is normally treated as a charlevel tagging task: outputting a NER tag for each character.The details for the chosen baselines and datasets are in Appendix B, and the results are below.
Results Results for the NER task are shown in Table 1, and from the results: (1) We observe a significant performance boost brought by kNN, respectively +0.06, +1.62, +1.75 and +0.19 for English CoNLL 2003, English OntoNotes 5.0, Chinese OntoNotes 4.0 and Chinese MSRA, which proves the importance of incorporating the evidence of retrieved neighbors.
(2) We observe a significant performance boost for vanilla+GNN over both the vanilla model: respectively +2.14 on the BERT-Large and +1.78 on the RoBERTa-Large for English OntoNotes 5.0, and +1.10 on the BERT-Large and +0.40 on the ChineseBERT-Large for Chinese MSRA.Due to the fact that both vanilla+GNN and the vanilla model output the final layer representation to the softmax function to obtain final probability and that p kNN do not participate in the final probability interpolation for both, the performance boost over vanilla SL demonstrates that we are able to obtain better token-level representations using GNNs.
(3) When comparing with vanilla+kNN, we observe further improvements of +0.33, +2.14, +3.17 and +1.10 for English CoNLL 2003, English  Large for Chinese MSRA.This demonstrates that as the evidence of retrieved nearest neighbors (and their labels) has been assimilated through GNNs in the representation learning stage, the extra benefits brought by interpolating p knn in the final prediction stage is significantly narrowed.

Chinese Word Segmentation
The task of CWS is normally treated as a charlevel tagging problem: assigning seg or not seg for each input word.We put the details of the chosen baselines and datasets in Appendix B, and below are the results.
Results Results for the CWS task are shown in Table 2. From the results, same as the former Section 5.2, with different vanilla models as the backbone, we can observe obvious improvements by applying the kNN probability (vanilla + kNN) or the GNN model (vanilla + GNN), while keeping the same results between vanilla + GNN and vanilla + GNN + kNN, e.g., for PKU dataset +0.1 on BERT + kNN, +0.3 both on BERT + GNN and BERT + GNN + kNN.Notablely we achieve SOTA for all four datasets with the ChineseBERT: 96.9 (+0.2) on PKU, 98.3 (+0.3) on CITYU, 98.5 (+0.2) on MSR and 96.9 (+0.2) on AS.

Part of Speech Tagging
The task of POS is normally formalized as a character-level sequence labeling task, assigning labels to each of the input word.The details of the  chosen baselines and datasets are in Appendix B, and below are the results.
Results Results for the POS task are shown in Table 4 for Chinese datasets and Table 5 for English  To forward visualize the phenomenon that some retrieved neighbors are close to the original input sentence while others are just noise, we sample examples from the NER English OntoNotes5.0 dataset in Appendix C.

Ablation Study
The Number of Retrieved Neighbors To evaluate the influence of the amount of retrieved neighbors, we conduct experiments on English OntoNotes 5.0 for the NER task by varying the number of neighbors k.The results are shown in Figure 3.As can be seen, as k increases, the F1 score of GNN-SL first increases and then decreases.The explanation is as follows, as more examples are, more noise is introduced and relevance to the query decreases, which makes performance worse.
Effectiveness of Label Nodes In Section 4.2, labels are used as nodes in the graph construction process.To evaluate the effectiveness of that strategy, we conducted contrast experiments by removing the label nodes.The experiments are based on the BERT-Large model and adjusted to the best parameters, and the results are shown in Table 3.All the results show a decrease after removing the label nodes, especially -0.42 for the POS Chinese CTB6 dataset and -0.32 for the NER Chinese MSRA dataset, which proves the necessity of incorporating label information of neighbors.
The Size of the Context Window In Section 4.2, to acquire the context information of each retrieved nearest word we expand the retrieved nearest word to the nearest context.We experiment with varying context sizes to show the influence.Results on NER English OntoNotes 5.0 are shown in Table 6.We can observe that, as the context size increases, performance first goes up and then plateaus.That is because a decent size of context is sufficient to provide enough information for predictions.
kNN search without Fine-tuning In section 4, we use the representations obtained by the finetuned pre-trained model on the labeled training set to perform kNN search.To validate its necessity, we also conduct an experiment that directly uses the BERT model without fine-tuning to extract representation.We evaluate its influence on NER CoNLL 2003 and observe a sharp decrease when switching the fine-tuned SL model to a non-finetuned BERT model,i.e.,93.17 v.s. 92.83.That is due to the gap between the LM task and the SL task, and that nearest neighbors retrieved by a vanilla pre-trained language model might not be the NEAREST neighbor for the SL task.

Conclusion
In this work, we propose GNN-SL, which augments the vanilla SL model output with similar tagging examples retrieved from the whole training set.Since not all the retrieved tagging examples benefit the model prediction, we construct a heterogeneous graph, and leverage graph neural networks (GNNs) to transfer information from the retrieved nearest examples to the input word.This strategy enables the model to directly acquire similar tagging examples and improves the effectiveness in handling long-tail cases.We conduct multi experiments and analyses on three sequence labeling tasks: NER, POS, and CWS.Notably, GNN-SL achieves SOTA 96.9 (+0.2) on PKU, 98.3 (+0.4) on CITYU, 98.5 (+0.2) on MSR, and 96.9 (+0.2) on AS for the CWS task.

Limitation
Admittedly, the main limitation of this work is the selection of k nearest neighbors.Intuitively, highquality nearest neighbors can make GNN learn the representation more easily.Thus, in future work, we will focus on the process of kNN selection including that attempt more measures rather than space cosine similarity distance and more representations extracted with different strategies.

A Trainng Details
The Vanilla SL Model As described in Section 4, we need the pre-trained vanilla SL model to extract features to initial the nodes of the constructed graph.For all our experiments, we choose the standard BERT-large (Devlin et al., 2018) and RoBERTalarge (Liu et al., 2019) for English tasks, as well as the standard BERT-large and ChineseBERT-large (Sun et al., 2021) for Chinese tasks.
kNN Retrieval In the process of kNN retrieval, the number of nearest neighbors k is set to 32, and the size of the nearest context window is set to 7 (setting both the left and right side of the window to 3).The two numbers are chosen according to the evaluation in Section 5.5 and perform best in our experiments.For the k nearest search, we use the last layer output of the pre-trained vanilla SL model as the representation and the L 2 distance as the metric of similarity comparison.

C Examples
To illustrate the augment of our proposed GNN-SL, we visualize the retrieved kNN examples as well as the input sentence in Table 7.For the first example the long-tail case "Phoenix", which is assigned with "LOCATION" by the vanilla SL model, is amended to "ORGANIZATION" by the nearest neighbors.Especially, we can observe that both 1th and 8-th retrieved labels are "ORGANIZATION" while the 16-th retrieved label is "LOCATION" which is against the ground truth.That phenomenon proves that retrieved neighbors do relate to the input sentence in different ways: some are close to the original input sentence while others are just noise, and our proposed GNN-SL has the ability to better model the relationships between the retrieved nearest examples and the input word.For the second example, with the augment of the retrieved nearest neighbors, our proposed GNN-SL outputs the correct label "PERSON" for the word "Tom Moody".

Figure 1 :
Figure 1: Example for the NER assignment when given a similar example.

Figure 2 :
Figure 2: An example for the process of GNN-SL.Step 1 Representation Extraction: Suppose that we need to extract named entities for the given sentence: Obama lives in Washington.The representation for each word is the last hidden state of the pretrained vanilla sequence labeling model.Step 2 kNN Search: Obtaining k nearest neighbors for EACH input word in the cached datastore, which consists of representation-label pairs for all training data.Step 3&4 Graph Construction & Message Passing: The queried k nearest neighbors and the input word are constructed into a graph.The message is passed from the nearest neighbors to each input word to obtain the aggregated representation.Step 5 Prediction with Softmax: The aggregated representation of each word is passed to a softmax layer to compute the likelihood of the assigned label.

Figure 3 :
Figure 3: Experiments on English OntoNotes 5.0 datasets by varying the number of neighbors k.

Input Sentence # 1
Hornak moved on from Tigers to Phoenix for studies and work.Ground Truth: ORGANIZATION, Vanilla SL Output: LOCATION, GNN-SL Output: ORGANIZATION Retrieved Nearest Neighbors Retrieved Label 1-th: Hornak signed accomplished performance in a Tigers display against Phoenix.ORGANIZATION 8-th: Blinker was fined 75,000 Swiss francs ($57,600) for failing to inform the English club of his previous commitment to Udinese.ORGANIZATION 16-th: Since we have friends in Phoenix, we pop in there for a brief visit.LOCATION Input Sentence #2 Australian Tom Moody took six for 82 but Tim O'Gorman, 109, took Derbyshire to 471.Ground Truth: PERSON, Vanilla SL Output: -(Not an Entity), GNN-SL Output: PERSON Retrieved Nearest Neighbors Retrieved Label 1-th: At California, Troy O'Leary hit solo home runs in the second inning as the surging Boston Red Sox.PERSON 8-th: Britain's Chris Boardman broke the world 4,000 meters cycling record by more than six seconds.PERSON 16-th: Japan coach Shu Kamo said: The Syrian own goal proved lucky for us.PERSON

Table 3 :
Experiments without label nodes on three tasks: NER, CWS, and POS.

Table 5 :
POS results for two English datasets: WSJ and Tweets.

Table 7 :
Retrieved nearest examples from English OntoNotes 5.0 dataset, where the labeled words are underlined.