The Role of Global and Local Context in Named Entity Recognition

Pre-trained transformer-based models have recently shown great performance when applied to Named Entity Recognition (NER). As the complexity of their self-attention mechanism prevents them from processing long documents at once, these models are usually applied in a sequential fashion. Such an approach unfortunately only incorporates local context and prevents leveraging global document context in long documents such as novels, which might hinder performance. In this article, we explore the impact of global document context, and its relationships with local context. We find that correctly retrieving global document context has a greater impact on performance than only leveraging local context, prompting for further research on how to better retrieve that context.


Introduction
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), and is often used as a building block for solving higherlevel tasks. Recently, pre-trained transformerbased models such as BERT (Devlin et al., 2019) or LUKE (Yamada et al., 2020) showed great NER performance and have been able to push the state of the art further.
These models, however, have a relatively short range because of the quadratic complexity of selfattention in the number of input tokens: as an example, BERT (Devlin et al., 2019) can only process spans of up to 512 tokens. For longer documents, texts are usually processed sequentially using a rolling window. Depending on the document, this local window may not always include all the context needed to perform inference, which may be present at the global document level. This leads to prediction errors (Stanislawek et al., 2019): In NER, this often occurs when the type of an entity cannot be inferred from the local context. For instance, in the following sentence from the fantasy novel Elantris, one cannot decide if the entity Elantris is a person (PER) or a location (LOC) without prior knowledge: "Raoden stood, and as he did, his eyes fell on Elantris again." In the novel, this prior knowledge comes from the fact that a human reader can recall previous mentions of Elantris, even at a very long range. A sequentially applied vanilla transformer-based model, however, might make an error without a neighboring sentence clearly establishing the status of Elantris as a city.
While some works propose to retrieve external knowledge to disambiguate entities (Zhang et al., 2022;Wang et al., 2021), external resources are not always available. Furthermore, external retrieval might be more costly or less relevant than performing document-level context retrieval, provided the document contains the needed information, which depends on the type of document. Therefore, we wish to explore the relevance of document-level context when performing NER. We place ourselves at the sentence level, and we distinguish and study two types of contexts: • local context, consisting of surrounding sentences. This type of context can be used directly by vanilla transformer-based models, as their range lies beyond the simple sentence. Fully using surrounding context as in Devlin et al. (2019) is, however, computationally expensive.
• global context, consisting of all sentences available at the document level. To enhance NER prediction at the sentence level, we retrieve a few of these sentences and provide them as context for the model.
In this article, we seek to answer the following question: is local context sufficient when solving the NER task, or would the model obtain better performance by retrieving global document context? 2 Related Works

Sparse Transformers
Since the range problem of vanilla transformerbased models is due to the quadratic complexity of self-attention in the number of input tokens, several works on sparse transformers proposed alternative attention mechanisms in hope of reducing this complexity (Zaheer et al., 2020;Kitaev et al., 2020;Tay et al., 2020b,a;Beltagy et al., 2020;Choromanski et al., 2020;Katharopoulos et al., 2020;Child et al., 2019). While reducing self-attention complexity improves the effective range of transformers, these models still have issues processing very long documents (Tay et al., 2020c).

Context retrieval
Context retrieval in general has been widely leveraged for other NLP tasks, such as semantic parsing (Guo et al., 2019), question answering (Ding et al., 2020), event detection (Pouran Ben Veyseh et al., 2021), or machine translation .
In NER, context retrieval has mainly been used in an external fashion, for example by leveraging names lists and gazetteers (Seyler et al., 2018;Liu et al., 2019), knowledge bases (Luo et al., 2015) or search engines (Wang et al., 2021;Zhang et al., 2022). Meanwhile, we are interested in documentlevel context retrieval, which is comparatively seldom explored. While Luoma and Pyysalo (2020) study document-level context, their study is restricted to neighboring sentences, i.e. local context.

Retrieval Heuristics
In this paper, we wish to understand the role of both local and global contexts for the NER task. We split all documents in our dataset (described in Section 3.3) into sentences. We evaluate both local and global simple heuristics of sentence retrieval in terms of NER performance impact. We evaluate the following local heuristics: • before: Retrieves the closest k sentences at the left of the input sentence.
• after: Same as before, but at the right of the input sentence.
• surrounding: Retrieves the closest k 2 sentences on both sides of the input sentence.
And the following global heuristics: • random: Randomly retrieves a sentence from the whole document.
• samenoun: Randomly retrieves a sentence from the set of all sentences that have a common noun with the input sentence. Intuitively, this heuristic will return sentences that contain entities of the input sentence, allowing for possible disambiguation. We use the NLTK library (Bird et al., 2009) to identify nouns.
• bm25: Retrieves sentences that are similar to the input sentences according to BM25 (Robertson, 1994). Retrieving similar sentences has already been found to increase NER performance (Zhang et al., 2022;Wang et al., 2021).
It has to be noted that global heuristics can sometimes retrieve local context, as they are not restricted in which sentences they can retrieve at the document level. For all configurations, we concatenate the retrieved sentences to the input. During this concatenation step, we preserve the global order between sentences in the document.

Oracles
For each heuristic mentioned in Section 3.1, we also experiment with an oracle version. The oracle version retrieves 16 sentences from the document using the underlying retrieval heuristic, and retain only those that enhance the NER predictions the most. We measure this enhancement by counting the difference in numbers of NER BIO tags errors made with and without the context. In essence, the oracle setup simulates a perfect re-ranker model, and allows us to study the maximum performance of such an approach.

Dataset
To evaluate our heuristics, we use a corrected and improved version of the literary dataset of Dekker et al. (2019). This dataset is comprised of the first chapter of 40 novels, which we consider long enough for our experiments.
Dataset corrections The original dataset suffers mainly from annotation issues. To fix them, we design an annotation guide and apply it consistently using a semi-automated process: 1. We apply a set of simple rules to identify obvious errors (for example, non capitalized entities annotated as PER are often false positives). We manually review each heuristic choice before application.
2. We manually review each difference between BERT (Devlin et al., 2019) predictions on the dataset and annotations.
3. We manually correct the remaining errors.

Further annotations
The original dataset only consists of PER entities. We go further and annotate LOC and ORG entities. The final dataset contains 4476 PER entities, 886 LOC entities and 201 ORG entities.

NER Training
For all experiments, we use a pretrained BERT BASE (Devlin et al., 2019) model, consisting in 110 million parameters, followed by a classification head at the token level to perform NER. We finetune BERT for 2 epochs with a learning rate of 2 · 10 −5 using the huggingface transformers library (Wolf et al., 2020), starting from the bert-base-cased checkpoint.

NER evaluation
We perform cross-validation with 5 folds on our NER dataset. We evaluate NER performance according to the CoNLL-2003 guidelines (Tjong Kim Sang and De Meulder, 2003), using the seqeval (Nakayama, 2018) python library to ensure results can be reproduced.

Retrieval heuristics
The NER performance for retrieval heuristics can be seen in Figure 1. The samenoun heuristic performs the best among global heuristics, whereas the surrounding heuristic is the best for local heuristics. While the top results obtained with both heuristics are quite similar, we consider global heuristics as naive retrieval baselines: they could be bested by more complex approaches, which might enhance performance even more. Interestingly, the performance of both before and bm25 heuristics decrease strongly after four sentences, and even drop behind the no retrieval baseline. For both heuristics, this might be due to retrieving irrelevant sentences after a while. The bm25 heuristic is limited by the similar sentences present in the document: if there are not enough of them, the heuristic will retrieve unrelated ones. Meanwhile, the case of the before heuristic seems more puzzling, and could be indicative of a specific entity mention pattern that might warrant more investigations.

Oracle versions
NER results with the oracle versions of retrieval heuristics can be found in Figure 2. It is worth noting that the performance of the oracle versions of the heuristics always peaks when retrieving a single sentence. This might indicate that a single sentence is usually sufficient to resolve ambiguities, but it might also be a result of the oracle ranking sentences individually, thereby not taking into account their possible combinations. Global heuristics perform better than local ones overall, with the oracle version of the random heuristic even performing better than both the before and after heuristics. These results tend to highlight the benefits of using global document context, provided it can be retrieved accurately.
Retrieved sentences To better understand which sentences are useful for predictions when performing global retrieval, we plot in Figure 3 the distribution of the distance between sentences and their retrieved contexts for heuristics samenoun and bm25. We find that, while useful sentences are most often close to the input sentence, a good number of useful sentences are still distant, highlighting the need for long-range retrieval.
Local context importance To see whether or not local context is an important component of NER performance, we perform an experiment where we restrict the oracle version of the bm25 heuristic from retrieving local surrounding context. Results can be found in Figure 4. NER performance remains about the same without local context, which tends to show that local context is not strictly necessary for performance.

Conclusion and Future Work
In this article, we explored the role of local and global context in Named Entity Recognition. Our results tend to show that retrieving global document context is more effective at enhancing NER performance than retrieving only local context, even when using relatively simple retrieval heuristics. We also showed that a re-ranker model using simple document-level retrieval heuristics could obtain significant NER performance improvements, prompting for further research in how to accurately retrieve global context for NER.

Limitations
We acknowledge the following limitations of our work: • While the oracle selects a sentence according to the benefits it provides when performing NER, it does not consider the interactions between selected sentences. This may lead to lowered performances when the several sentences are retrieved at once.
• The retrieval heuristics considered are naive on purpose, as the focus of this work is not performance. Stronger retrieval heuristics may achieve better results than presented in this article.
• The studied documents only consist in the first chapter of a set of novels. Using longer novel would increase the number of possible information to retrieve for the presented global heuristics.