MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

NER model has achieved promising performance on standard NER benchmarks. However, recent studies show that previous approaches may over-rely on entity mention information, resulting in poor performance on out-of-vocabulary(OOV) entity recognition. In this work, we propose MINER, a novel NER learning framework, to remedy this issue from an information-theoretic perspective. The proposed approach contains two mutual information based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rotate memorizing entity names or exploiting biased cues in data. Experiments on various settings and datasets demonstrate that it achieves better performance in predicting OOV entities.


Introduction
Named Entity Recognition (NER) aims to identify and classify entity mentions from unstructured text, e.g., extracting location mention "Berlin" from the sentence "Berlin is wonderful in the winter". NER is a key component in information retrieval (Tan et al., 2021), question answering (Min et al., 2021), dialog systems , etc. Traditional NER models are feature-engineering and machine learning based (Zhou and Su, 2002;Takeuchi and Collier, 2002;Agerri and Rigau, 2016). Benefiting from the development of deep learning, neuralnetwork-based NER models have achieved stateof-the-art results on several public benchmarks (Lample et al., 2016;Peters et al., 2018;Devlin et al., 2018;Yamada et al., 2020;Yan et al., 2021).
Recent studies (Lin et al., 2020;Agarwal et al., 2021) show that, context does influence predictions * Corresponding authors.  Table 1: The comparison between the in-dictionary and out-of-dictionary parts of the CoNLL 2003 baseline (Lin et al., 2020), which was tested on Bert-CRF. It is obvious that the performance gap between InDict and OutDict is significantly large. of NER models, but the main factor driving high performance is learning the named tokens themselves. Consequently, NER models underperform when predicting entities that have not been seen during training (Fu et al., 2020;Lin et al., 2020), which is referred to as an Out-of-Vocabulary (OOV) problem.
There are three classical strategies to alleviate the OOV problem: external knowledge, OOV word embedding, and contextualized embedding. The first one is to introduce additional features, e.g., entity lexicons (Zhang and Yang, 2018), part-ofspeech tags , which alleviates the model's dependence on word embeddings. However, the external knowledge is not always easy to obtain. The second strategy is to get a better OOV word embedding (Peng et al., 2019;Fukuda et al., 2020). The strategy is learning a static OOV embedding representation, but not directly utilizing the context. Last one is fine-tune pre-trained models, e.g., ELMo (Peters et al., 2018), BERT (Devlin et al., 2018), which provide contextualized word representations. Unfortunately, Agarwal et al. (2021) shows that the higher performance of pretrained models could be the results of learning the subword structure better.
How do we make the model focus on contextual information to tackle the OOV problem? Motivated by the information bottleneck principle (Tishby et al., 2000), we propose a novel learning framework -Mutual Information based Named Entity Recognition (MINER). The proposed method provides an information-theoretic perspective to the OOV problem by training an encoder to minimize task-irrelevant nuisances while keeping predictive information.
Specifically, MINER contains two mutual information based learning objectives: i) generalizing information maximization, which aims to maximize the mutual information between representations and well-generalizing features, i.e., context and entity surface forms; ii) superfluous information minimization, which prevents the model from rote memorizing the entity names or exploiting biased cues via eliminating entity name information. Our codes 1 are publicly available.
Our main contributions are summarized as follows: 1. We propose a novel learning framework, i.e., MINER, from an information theory perspective, aiming to improve the robustness of entity changes by eliminating entity-specific and maximizing wellgeneralizing information.
2. We show its effectiveness on several settings and benchmarks, and suggest that MINER is a reliable approach to better OOV entity recognition.

Background
In this section, we highlight the information bottleneck principle. Subsequently, the analysis of possible issues was provided when applying it to OOV entity recognition. Furthermore, we review related techniques in deriving our framework.
Information Bottleneck (IB) principle originated in information theory, and provides a theoretical framework for analyzing deep neural networks. It formulates the goal of representation learning as an information trade-off between predictive power and representation compression. Given the input dataset (X,Y), it seeks to learn the internal representation Z of some intermediate layers by: where I represents the mutual information(MI), a measure of the mutual dependence between the two variables. The trade-off between the two MI terms 1 https://github.com/BeyonderXX/MINER is controlled by the Lagrange multiplier β. A low loss indicates that representation Z does not keep too much information from X while still retaining enough information to predict Y.
Section 5 suggests that directly applying IB to NER can not bring obvious improvement. We argue that IB cannot guarantee well-generalizing representation.
On the one hand, it has been shown that it is challenging to find a trade-off between high compression and high predictive power (Tishby et al., 2000;Wang et al., 2019;Piran et al., 2020). When compressing task-irrelevant nuisances, however, useful information will inevitably be left out. On the other hand, it is unclear for the IB principle which parts of features are well-generalizing and which are not, as we usually train a classifier to solely maximize accuracy. Consequently, neural networks tend to use any accessible signal to do so (Ilyas et al., 2019), which is referred to as a shortcut learning problem (Geirhos et al., 2020). For training sets with limited size, it may be easier for neural networks to memorize entity names rather than to classify them by context and common entity features (Agarwal et al., 2021). In Section 4, we demonstrate how we extend IB to the NER task and address these issues.

Model Architecture
In recent years, NER systems have undergone a paradigm shift from sequence labeling, which formulates NER as a token-level tagging task (Chiu and Nichols, 2016;Akbik et al., 2018;Yan et al., 2019), to span prediction (SpanNER), which regards NER as a span-level classification task (Mengge et al., 2020;Yamada et al., 2020;Fu et al., 2021). We choose SpanNER as base architecture for two reasons: 1) SpanNER can yield the whole span representation, which can be directly used for optimize information. 2) Compared with sequence labeling, SpanNER does better in sentences with more OOV words (Fu et al., 2021).
Overall, SpanNER consists of three major modules: token representation layer, span representation layer, and span classification layer. Besides, our method inserts a bottleneck layer to the architecture for information optimization.

Span Representation Layer
For all possible spans S = {s 1 , s 2 , · · · , s m } of sentence X, we re-assign a label y ∈ Y for each span. Take "Berlin is wonderful" as an example, its possible spans and labels Given the start index b i and end index e i , the representation of span s i can be calculated by two parts: boundary embedding and span length embedding.
Boundary embedding: This part is calculated by concatenating the start and end tokens' repre- Span length embedding: In order to introduce the length feature, we additionally provide the length embedding t l i , which can be obtained by a learnable look-up table.
Finally, the span representation can be obtained as:

Information Bottleneck Layer
In order to optimize the information in the span representation, our method additionally adds an information bottleneck layer of the form: where f e is an MLP which outputs both the Kdimensional mean µ of z as well as the K * K covariance matrix Σ. Then we can use the reparameterization trick ((Kingma and Welling, 2013)) to get the compressed representation z i .

Span Classification Layer
Once the information bottleneck layer is finished, z i is fed into the classifier to obtain the probability of its label y i . Based on the probability, the basic loss function can be calculated as follows: where score() is a function that measures the compatibility between a specified label and a span representation: where y k is a learnable representation of class k.
Heuristic Decoding A heuristic decoding solution for the flat NER is provided to avoid the prediction of over-lapped spans. For those overlapped spans, we keep the span with the highest prediction probability and drop the others.
It's worth noting that our method is flexible and can be used with any other NER model based on span classification. In next section, we will introduce two additional objectives to tackle the OOV problem of NER.

MI-based objectives
Motivated by IB (Tishby et al., 2000;Federici et al., 2020), we can subdivide I(X; Z) into two components by using the chain rule of mutual information(MI): The first term determines how much information about Y is accessible from Z. While the second term, conditional mutual information term I(X; Z|Y ), denotes the information in Z that is not predictive of Y .
For NER, which parts of the information retrieved from input are useful and which are redundant?
From human intuition, text context should be the main predictive information for NER. For example, "The CEO of X resigned", the type of X in each of these contexts should always be "ORG". Besides, entity mentions also provide much information for entity recognition. For example, nearly all person names capitalize the first letter and follow the "firstName lastName" or "lastName Figure 1: Visualization of MINER, where x 1 and x 2 share the same context and entity labels, while their entity words are different. z 1 and z 2 are compressed entity representations sampled by p(z 1 |x 1 ) and p(z 2 |x 2 ), respectively, which are implemented by information bottleneck(IB) layer. Our method add two additional learning objectives to basic architecture. The first one is to maximize the mutual information, i.e., I(z 1 ; z 2 ), to enhance context information and entity surface form information of z 1 and z 2 . The second objective is to minimize the Jensen-Shannon divergence, representing an upper bound of I(x 1 ; z 1 |x 2 ), aiming to eliminate task-irrelevant nuisances.
firstName" patterns. However, entity name is not a well-generalizing features. By simply memorizing the fact which span is an entity, it may be possible for it to fit the training set, but it is impossible to predict entities that have never been seen before.
We convert the targets of Eq. (6) into a form that is easier to solve via a contrastive strategy. Specifically, consider x 1 and x 2 are two contrastive samples of similar context, and contains different entity mentions of the same entity category, i.e., s 1 and s 2 , respectively. Assuming both x 1 and x 2 are both sufficient for inferring label y. The mutual information between x 1 and z 1 can be factorized to two parts.
where z 1 and z 2 are span representations of s 1 and s 2 , respectively, I(z 1 ; x 2 ) denotes the information that isn't entity-specific. And I(x 1 ; z 1 |x 2 ) represents the information in z 1 which is unique to x 1 but is not predictable by sentence x 2 , i.e., entityspecific information.
Thus any representation z containing all information shared from both sentences would also contain the necessary label information, and sentencespecific information is superfluous. So Eq. (6) can be approximated by Eq. (7) by: minimize I(x 1 ; z 1 |y) ∼ I(x 1 ; z 1 |x 2 ), The target of Eq. (8) is defined as generalizing information maximization. We proved that I(z 1 ; z 2 ) is a lower bound of I(z 1 ; x 2 )(proof could be found in appendix 7). InfoNCE (Oord et al., 2018) was used as a lower bound on MI and can be used to approximate I(z 1 ; z 2 ). Subsequently, it can be optimized by: where g w (·, ·) is a compatible score function approximated by a neural network, z 2 are the positive entity representations from the joint distribution p of original sample and corresponding generated sample, z are the negative entity representations drawn from the joint distribution of the original sample and other samples.
The target of Eq. (9) is defined as superfluous information minimization. To restrict this term, we can minimize an upper bound of I(x 1 ; z 1 |x 2 ) (proofs could be found in appendix 7) as follows: where D JS means Jensen-Shannon divergence, p z 1 and p z 2 represent p(z 1 |x 1 ) and p(z 2 |x 2 ), respectively. In practice, Eq. (11) encourage z to be invariant to entity changes. The resulting Mutual Information based Named Entity Recognition model is visualized in Figure 1.

Contrastive sample generation
It is difficult to obtain samples with similar contexts but different entity words. We generate contrastive samples by the mention replacement mechanism (Dai and Adel, 2020). For each mention in the sentence, we replace it by another mention from the original training set, which has the same entity type. The corresponding span label can be changed accordingly. For example, "LOC" mention "Berlin" in sentence "Berlin is wonderful in the winter" is replaced by "Iceland".

Training
Combine Eq. (4), (10), and (11), we can get the following objective function, which try to minimize: where γ and β are the weights of the generalizing information loss and superfluous information loss, respectively.

Experiment
In this section, we verify the performance of the proposed method on five OOV datasets, and compared it with other methods. In addition, We tested the universality of the proposed method in various pre-trained models.

Datasets and Metrics
Datasets We performed experiments on: 1. WNUT2017 (Derczynski et al., 2017), a dataset focus on unusual, previous-unseen entities in training data, and is collected from social media.
4. Conll03-Typos , which is generated from Conll2003 (Sang and De Meulder, 2003). The entities in the test set are replaced by typos version(character modify, insert, and delete operation).
5. Conll03-OOV , which is generated from Conll2003 (Sang and De Meulder, 2003). The entities in the test set are replaced by another out-of-vocabulary entity in test set. Table 2 reports the statistic results of the OOV problem on the test sets of each dataset. As shown in the table, the test set of these datasets comprises a substantial amount of OOV entities.
Metrics We measured the entity-level micro average F1 score on the test set to compare the results of different models. Li et al. (2020) share the same intuition as us, enriching word representations with context. However, the work is neither open source nor reported on the same dataset, so this method cannot be compared with MINER. We compare our method with baselines as follows: • Fu et al. (2021)   computes attention weight probabilities over textual and text-relevant visual contexts separately.

Baseline methods
• Li et al. (2021) (MIN), which utilizes both segment-level information and word-level dependencies, and incorporates an interaction mechanism to support information sharing between boundary detection and type prediction, enhancing the performance for the NER task.
• Fukuda et al. (2020) (CoFEE), which refer to pre-trained word embeddings for known words with similar surfaces to target OOV words.
• Nie et al. (2020) (SA-NER), which utilize semantic enhancement methods to reduce the negative impact of data sparsity problems. Specifically, the method obtains the augmented semantic information from a largescale corpus, and proposes an attentive semantic augmentation module and a gate module to encode and aggregate such information, respectively.

Implementation Details
Bert-large released by Devlin et al. (2018) is selected as our base encoder. The learning rate is set to 5e-5, and the dropout is set to 0.2. The output dim of the information bottleneck layer is 50. In order to make a trade-off for the performance and efficiency, on the one hand, we truncate the part of the sentence whose tokens exceeds 128. On the other hand, we count the length distribution of entity length in different datasets, and finally choose 4 as the maximum enumerated entity length. The values of β and γ differ for different datasets. Empirically, 1e-5 for β and 0.01 for γ can get promised results. The model is trained in an NVIDIA GeForce RTX 2080Ti GPU. Checkpoints with top-3 performance are finally evaluated on the test set to report averaged results.

Main Results
We demonstrate the effectiveness of MINER against other state-of-the-art models. As shown in table 3, we conducted the following comparison and analysis: 1) Our baseline model, i.e., SpanNER, does an excellent job of predicting OOV entities. Compared with sequence labeling, the span classification could model the relation of entity tokens directly;2) The performance of SpanNER is further boosted with our proposed approach, which The results are obtained by testing MINER (Bert large) on TwitterNER . We fix β = 1e03, and the orange line is f1 score when γ = 0.
proved the effectiveness of our method. As shown in table 3, MINER almost outperforms all other SOTA methods without any external resource;3) Compared with Typos data transformation, it is more difficult for models to predict OOV words.
To pre-trained model, typos word may not appear in training set, but they share most subwords with the original token. Moreover, the subword of OOV entity may be rare; 4) It seems that the traditional information bottleneck will not significantly improve the OOV prediction ability of the model. We argue that the traditional information bottlenecks will indiscriminately compress the information in the representation, leading to underfitting; 5) Our model has significantly improved the performance of the model on the entity perturbed methods of typos and OOV, demonstrating that MI improve the robustness substantially in the face of noise; 6) It is clear that our proposed method is universal and can further improve OOV prediction performance for different embedding models, as we get improvements on Bert, Roberta, and Albert stably.

Ablation Study
We also perform ablation studies to validate the effectiveness of each part in MINER. Table 4 demonstrates the results of different settings for the proposed training strategy equipped with BERT. After only adding the L gi loss to enhance context and entity surface form information, we find that the results are better than the original PLMs. A similar phenomenon occurs in L si , too. It reflects that both L gi and L si are beneficial to improve the generalizing ability on OOV entities recognition. Moreover, the results on the three datasets are significantly improved by adding both L gi and The results are obtained by testing MINER (Bert large) on TwitterNER . We fix γ = 1e04, and the orange line is f1 score when β = 0.  L si learning objectives. It means L gi and L si can boost each over, which proves that our method enhances representation via deep understanding of context and entity surface forms and discourages representation from rote memorizing entity names or exploiting biased cues in data.

Sensitivity Analysis of β and γ
To show the different influence of our proposed training objectives L gi and L si , we conduct sensitivity analysis of the coefficient β and γ. Figure  2 shows the performance change under different settings of the two coefficients. The yellow line denotes ablation results without the corresponding loss functions (with β=0 or γ=0). From Figure 2 we can observe that the performance is significantly enhanced with a small rate of β or γ, where the best performance is achieved when β=1e-3 and γ=1e-4, respectively. It probes the effectiveness of our proposed training objectives that enhances representation via deep understanding of context and entity surface forms and discourages representation from rote memorizing entity names or exploiting biased cues in data. As the coefficient rate increases continuously, the performance shows a declining trend, which means the over-constraint of L gi or L si will hurt the generalizing ability of predicting the OOV entities.

Interpretable Analysis
The above experiments show the promising performance of MINER on predicting the unseen entities. To further investigate which part of the sentence MINER focuses on, we visualize the attention weights over entities and contexts. We demonstrate an example in Figure 4 , where is selected from TwitterNER. The attention score is calculated by averaging the attention weight of the 0th layer of BERT. Take the attention weights of the entity "State Street" as an example, it is obvious that baseline model, i.e., SpanNER, focus on entity words themselves. While the scores of our model are more average, it means that our method concerns more context information.
6 Related Work

External Knowledge
This group of methods makes it easier to predict OOV entities using external knowledge. Zhang and Yang (2018) utilize a dictionary to list numerous entity mentions. It is possible to get stronger "lookup" models by integrating dictionary information, but there is no guarantee that entities outside the training set and vocabulary will be correctly identified. To diminish the model's dependency on OOV embedding,  introduce partof-speech tags. External resources are not always available, which is a limitation of this strategy.

OOV word Embedding
The OOV problem can be alleviated by improving the OOV word embedding. The character ngram of each word is used by Bojanowski et al. (2017) to represent the OOV word embedding. Pinter et al. (2017) captures morphological features using character-level RNN. Another technique is to first match the OOV words with the words that have been seen in training, then replace the OOV words' embedding with the seen words' embedding. Peng et al. (2019) trains a student network to predict the closest word representation to the OOV term. Fukuda et al. (2020) referring to pre-trained word embeddings for known words with similar surfaces to target OOV words. This kind of method is learning a static OOV embedding representation, and does not directly utilize the context.

Contextualized Embedding
Contextual information is used to enhance the representation of OOV words in this strategy. (Hu et al., 2019) formulate the OOV problem as a Kshot regression problem and learns to predict the OOV embedding by aggregating only K contexts and morphological features. Pre-trained models contextualized word embeddings via pretraining on large background corpora. Furthermore, contextualized word embeddings can be provided by the pre-trained models, which are pre-trained on large background corpora (Peters et al., 2018;Devlin et al., 2018;Liu et al., 2019). Yan et al. (2021) shows that BERT is not always better at capturing context as compared to Gloe-based BiLSTM-CRFs. Their higher performance could be the result of learning the subword structure better.
information minimization. Experiments on various datasets demonstrate that MINER achieves much better performance in predicting out-of-vocabulary entities.