The Utility and Interplay of Gazetteers and Entity Segmentation for Named Entity Recognition in English

Recent papers have introduced methods to incorporate gazetteer features and entity segmentation techniques in neural named entity recognition models. These papers rely on different resources and include features not related to the use of gazetteers, rendering impossible the comparison of the relative effectiveness of the approaches. Here, we provide a comprehensive overview of methods for incorporating gazetteers and for entity segmentation. We evaluate representative methods from each in similar settings for a fair comparison and identify the ones that are consistently better across datasets and input representations. We further show that gazetteers improve entity segmentation and not just entity typing. Hence, we explore their utility in recognizing long entities, a problem for which entity segmentation techniques were developed. Our work explains the mechanisms via which gazetteers improve the performance of neural NER models.


Introduction
Named Entity Recognition (NER) has the unique property of being a task appealing to researchers and at the same time being fairly robust for immediate practical applications. In many domains, it is of interest to identify segments of text conveying a concept of a given type-a person (Grishman and Sundheim, 1996), an event (Hovy et al., 2006), a disease (Dogan et al., 2014), a gene , a chemical (Krallinger et al., 2015), a food (Magnolini et al., 2019), an item of clothing (Putthividhya and Hu, 2011), a research technique (Augenstein et al., 2017a), etc. Approaches to NER are typically not domainspecific, treating the problem as a sequence labelling task regardless of the categories of interest. Yet, researchers also widely agree that named entity recognition is a knowledge intensive task (Ratinov and Roth, 2009;Seyler et al., 2018): the availability of external knowledge resources in the form of lists of example entities of a given type, or gazetteers, improve performance almost universally. Since gazetteers are readily available, from knowledge bases, databases of products and specialized ontologies, having practical guidance on how to handle gazetteers in NER would be valuable.
In this paper, we provide a survey of how gazetteers have been used in neural approaches to NER in English and compare key approaches with the popular biLSTM-CRF architecture. To ensure that our conclusions accurately characterize the utility of gazetteers, we test the approaches on several datasets from different genres covering newswire, conversations and twitter. The extensive head-to-head comparison reveals that while certain approaches are consistently beneficial, others are variable, impressively improving results on one dataset but reducing performance on others.
Gazetteers typically contain multi-word entries. In contrast, the majority of entity mentions in text are single word, with lower performance of models on longer entities. This discrepancy highlights a potential application of gazetteers for improving NER prediction for mentions of long entities. Recent work on entity segmentation 1 as part of the named entity recognition task aims to recognize long entities better. We overview this work and compare these methods similar to the way we compare methods for incorporating gazetteers.
We find that certain ways for incorporating gazetteers show stable improvements across datasets, while segmentation approaches do not appear to be that useful overall. We further explore the interplay of gazetteers with entity segmentation and their role in recognizing long entities. We find that incorporating gazetteers improves entity segmentation, not just entity typing and depending on the input representation to the model, gazetteer types may be irrelevant. We also find that incorporating gazetteers can serve as an alternate method to recognizing long entities, likely due to the abundant presence of multi-word entities in the Wikipediaderived gazetteers we used.
Our work provides (i) a concise overview of methods for incorporating gazetteers and entity segmentation in NER, (ii) a principled comparison of representative approaches for each aspect, and (iii) novel findings and analyses of the interplay between gazetteers and segmentation. Our findings can inform both future researchers and practitioners interested in NER.

biLSTM-CRF architecture for NER
We explore variants of the now classic biLSTM-CRF architecture for NER (Huang et al., 2015). We overview how gazetteer features and segmentation can be integrated in this paradigm and carry out a comparison of several representative methods. We use two input word representations: the 300-d GloVe vectors trained on Common Crawl (Pennington et al., 2014), which is the dominant representation in NER, and the 1024-d contextual ELMo (Peters et al., 2018) representations trained on the 1B Word Benchmark (Chelba et al., 2014). In each case, character-based word representations learned with CNNs (Ma and Hovy, 2016) are also concatenated. The final concatenated representation is used as input to bidirectional LSTMs (Hochreiter and Schmidhuber, 1997), followed by a CRF (Lafferty et al., 2001) layer. We use the implementation by Lample et al. (2016) for most experiments.

Why Use Gazetteers?
Gazetteers are large dictionaries consisting of lists of entities of a particular type. For example, a person gazetteer may consist of full names and parts of names such as the first names of people. Before we start our discussion of methods for incorporating gazetteers, it is worth considering if we have sufficient evidence that they are needed at all.
Gazetteers are needed for better generalization through improved entity coverage, to predict the type for words that have not been encountered in training and possibly even pre-training. This means the need will be more acute for practical deployment of NER and will be less pronounced on fixed datasets in which train and test data are sampled from overlapping or adjacent time periods, with high overlap of entities across both.
The most compelling example for the need to handle unseen language comes from work on NER on Twitter. Language on Twitter changes rapidly, much more rapidly than for other types of text (Eisenstein, 2013). This change requires models to be retrained periodically to maintain optimal performance for the current time period (Rijhwani and Preotiuc-Pietro, 2020). An alternative to retraining, not yet explored in literature, is to develop methods that can make use of gazetteers that possibly could be updated more quickly and cheaply compared to continuously annotating new training data.
Even in stable domains such as newswire, the ability of models to generalize to words not seen in the training data is low (Augenstein et al., 2017b;Fu et al., 2020a,b). Both traditional models with hand-crafted features (Finkel et al., 2005;Okazaki, 2007) and more recent neural network approaches (Collobert et al., 2011;Huang et al., 2015;Peters et al., 2018;Devlin et al., 2019) achieve lower performance on entities unseen in the training data.
Methods that make use of large pretrained language representations, neural (Collobert et al., 2011) or not (Miller et al., 2004), can ameliorate the problem of coverage to some extent. We do not yet know enough about how pretraining data should be chosen (Cherry and Guo, 2015), though there is some evidence that performance on downstream tasks correlates with the vocabulary coverage in the pre-training data (Dai et al., 2019). Prior work has reported that performance is lowest for words that appear neither in the training nor the pretraining vocabulary (Ma and Hovy, 2016). Moreover, the deteriorated performance on out of vocabulary words is not necessarily a failure of the models: many contexts simply do not provide sufficient knowledge to predict the type of an entity, even for people (Agarwal et al., 2021). Models need to expand their knowledge of entities and gazetteers are a natural way for doing that.

Where Do Gazetteers Come From?
Existing tables, lists, directories, databases and knowledge bases are widely available and can be used to derive gazetteers. Some researchers have specifically compiled various resources to form gazetteers, while others make use of those provided in prior work. Early work collected gazetteers from the CIA factbook for geographic locations, lists of  popular person names, etc (Mikheev et al., 1999). More recently, Ratinov and Roth (2009) derived a gazetteer from the Web and Wikipedia and Chiu and Nichols (2016) used DBPedia.
In our work, we use the Ratinov and Roth (2009) wide-coverage gazetteer. It contains ∼3M entities grouped into ∼30 fine-grained categories. In some experiments, we use all categories, regardless of the entity types in the dataset. In the remaining, we identify the gazetteer category that most closely matches that types in a given dataset and disregard the rest. The mapping can be found in the appendix. Table 1 shows the approximate percentage of entities by length in words in our gazetteer. Most entries are of length two, but unlike NER datasets (Table 3), the remaining entries are evenly distribution between length 1, 3 and ≥4. Most notably, around 24% of gazetteer entities have four or more words. In comparison, in NER datasets where entities appear in the context of a sentence that is often a part of a longer document, such long entities typically make up about 2% of all entities. This distributional difference hints at the possibility of using gazetteer to not only improve coverage but also improve performance on longer entities for which segmentation methods are developed.
Regardless of the distribution of entity lengths, the total number of entities in gazetteers is much higher than that in NER datasets so a higher percentage of longer entities does not equate to a small number of short entities.

Gazetteer Features for NER
Here, we overview the ways gazetteers have been integrated in NER models.

Discrete Gazetteer Lookup Features
Feature-based CRF models for NER used gazetteers to generate indicators for each word in a sentence (Bender et al., 2003;Minkov et al., 2005;Ratinov and Roth, 2009;Ritter et al., 2011;Yang et al., 2016;Seyler et al., 2018). The number of indicators equals the number of entity types in the dataset and indicate (with a binary 1/0 value) if the word is part of a gazetteer entry of the given type.
Many neural network approaches continue to incorporate gazetteers as discrete indicator features concatenated to the pre-trained word embeddings as the input (Collobert et al., 2011;Huang et al., 2015). Adding the features in later stages does not work as well (Magnolini et al., 2019). Both Collobert et al. (2011) and Huang et al. (2015) pre-process datasets to match the gazetteer entries to sentences, using both exact matches and multiword partial matches to gazetteer entries. Chiu and Nichols (2016) perform a similar matching but use four binary values for each label, indicating whether the given word matches the gazetteer entity exactly (S), at the beginning (B), end (E) or the any of the words in between (I).

Continuous Gazetteer Features
The approach above does not use gazetteers very effectively. Gazetteers contain many more entities of each type than are available in even the largest training set (Table 3). One insight is to use the gazetteers as a additional source of training examples. A simple way is to add the gazetteer entries to the labeled data, without any context. Liu et al. (2019a) report that this data augmentation approach led to much worse overall results, presumably because of the great shift in label distributions. Another approach is to augment the training data by replacing entities in place by other entities from gazetteers. Song et al. (2020) reported no improvement with such a random entity replacement, likely due to the need for manual intervention for replacement of entities of some types to maintain coherence of text (Agarwal et al., 2020).
A much more successful alternative is to learn a separate (or sub-) module, trained to predict types for text spans, using the gazetteer entries and synthetic negative examples sampled from a NER training set or even the gazetteer. We will refer to the separate module as a gazetteer network. It is straightforward to integrate the label distribution scores from this model in a semi-Markov CRF for sequence labeling (Ye and Ling, 2018) that operates at the span level (which we describe in greater detail later). The resulting combination is far more effective than discrete indicator gazetteer features. Magnolini et al. (2019) and Liu et al. (2019b) propose a similar approach. They learn a gazetteer network but instead of using the label score distribution, intermediate word representations (gazetteer embeddings henceforth) are incorporated in the NER model. Liu et al. (2019b) use a semi-Markov CRF operating at the span level and generate the gazetteer embeddings for each potential span. They follow the evaluation approach of Ma and Hovy (2016), breaking down results by whether an entity was seen only in training, seen only in pretraining, seen in both and seen in neither. The largest improvement was in the "seen in neither" subset, showing that this approach is particularly helpful for out-of-vocabulary words with respect to the training and pre-training data. Magnolini et al. (2019) use the standard wordlevel CRF and hence do not have spans available so they input the full sentence to the gazetteer network. This makes the training and inference setup for the gazetteer network different as entity phrases are used as input during training. They reported mixed results for this approach. In our experiments, we evaluated their method on a larger number of datasets but used a different approach for negative sampling for the gazetteer network training data. We observed some improvement on almost all datasets, contingent on the input representation.

Contextual Gazetteers
Learning from just the gazetteer has the drawback that the representations do not include any clues about the context in which the entity types are used. The same entity may appear in multiple gazetteers. Given that current methods heavily rely on entity memorization and little on context, this is possibly acceptable. For completeness however, we ought to mention that the link structure of Wikipedia can be used to derive dense representations for entity types directly (Long et al., 2016;Ganea and Hofmann, 2017;Mengge et al., 2020;Ghaddar and Langlais, 2018). Comparing gazetteer representations with and without context would be a direction for exploration in future work.

Entity Segmentation in NER
Early work (Collins and Singer, 1999;Downey et al., 2007;Ritter et al., 2011) treated NER as two subtasks, i.e. entity segmentation: finding spans of text that refer to named entities, and typing: assigning a type to the identified span. Recent efforts have also incorporated entity segmentation explicitly in neural models, with goal of finding longer entities better (Xiao et al., 2019;Ye and Ling, 2018). Such work can be divided into two categories-Multitask learning and semi-Markov CRFs.

Multi-task Learning
Multi-task learning (MTL) involves jointly training multiple related tasks using the same representation such that the auxiliary tasks can help with the performance of the target task. Aguilar et al. (2017) use MTL with hard parameter sharing to add two auxiliary tasks-binary classification to identify entity spans (segmentation) and multi-class classification to type them without CRF. Both tasks use the same biLSTM representation as the target task. The output of the additional tasks is not used in the NER task; they only act as regularizers for NER. Others (Stratos, 2017;Aguilar et al., 2018) also use auxiliary tasks as regularizers but instead of binary classification, the additional tasks performs multiclass classification into B (first word of entity), I (remaining entity words) and O (non-entity).
The auxiliary tasks in MTL can also be used for extra supervision by concatenating their output label distribution to the representation used by NER (Xiao et al., 2019). Unlike prior work, Xiao et al. (2019) do not use the same representation for the target and auxiliary task. Instead, they build a submodule called similarity-based auxiliary classifier (SAC). SAC takes as input the original input representation and adds token position embeddings (Vaswani et al., 2017), followed by multiple convolution layers. It maintains two randomly initialized vectors representing entity and non-entity classes. These vectors are combined with an attention layer to get the final word representations. The attention weights are calculated with the multiplicative attention function over the word representation and each of these two vectors. The final word representation is concatenated with the biLSTM representation for NER and the attention weights are used as proxy for probabilities of the word being an entity or not. The loss of both tasks is jointly optimized, with less weight given to entity segmentation over NER. Yu et al. (2018) also add extra supervision, but with a two-step approach. They learn two character-level language models for entity and non-entity words and use their output as a binary feature in NER. Augenstein and Søgaard (2017) use MTL for NER in scientific texts, adding five auxiliary tasks. They add syntactic chunking, hyperlink prediction, multi-word expression identification, frame target annotation, and semantic super-sense tagging; the first three target segmentation. The auxiliary tasks are used one at a time with the target task. Since the datasets for NER and the auxiliary tasks are  different, at each training step, a random task is chosen, followed by a random training instance.

Semi-Markov CRFs
Semi-Markov CRFs (Sarawagi and Cohen, 2004) are a variant of linear chain CRFs that capture dependencies between adjacent spans of text instead of adjacent words. The Markov assumption still holds across spans but not within the span. The goal is to find the best possible segmentation into spans using scores at span-level. The maximum length of spans is bound to reduce computation cost. Sarawagi and Cohen (2004) use hand-crafted features for span representations but recent work has explored other techniques to represent spans. Gated recursive semi-markov CRF (Zhuo et al., 2016) creates a pyramid-shaped feature extractor for spans. The bottom-most layer consists of word representations and hence length one spans. Representation of adjacent words are combined to form length two spans for the next layer and so on. The top layer consists of a single span with the full sentence. Hybrid semi-Markov CRF or HSCRF (Ye and Ling, 2018) do not use an explicit span representation. Instead they consider the span-level score as sum of the word-level CRF scores of constituent words. Both the word-level and span-level CRF are jointly optimized. Sato et al. (2017) use a two step process to reduce the search space of the spans. They first generate possible spans from a separate model using a score cutoff and then find the best possible labelling over these spans instead of all spans upto a maximum specified length.

Datasets
We evaluate several of the above models on four datasets, to compare their performance. Table 2 shows the entity types in each dataset.
1. CoNLL is the English portion of the CoNLL'03 data (Tjong Kim Sang and De Meulder, 2003), extracted from the Reuters 1996 newswire corpus.
2. ON is the union of broadcast conversation (bc) and telephone conversation (tc) domains in the English portion of Ontonotes (Hovy et al., 2006). 2 The number of entity types is the largest in this dataset. We merge the closely related categories, GPE with LOC and FAC with ORG, to allow us to easily map gazetteer labels to dataset labels.
3. BTC or Broad Twitter Corpus (Derczynski et al., 2016) consists of tweets. We use the recommended train, validation and test splits.
4. TTC or Temporal Twitter Corpus (Rijhwani and Preotiuc-Pietro, 2020) also consists of tweets. It has multiple training splits from years ranging from 2014 to 2018 and validation and test splits from 2019. We use the 2014 training split as it overlaps less with 2019 and hence is more challenging.
Some dataset statistics are shown in Table 3. CoNLL and OntoNotes are the largest and are roughly equal in size. OntoNotes has more long entities and fewer entities that are sequences of capitalized words. Such characteristics will favor methods that perform better segmentation without relying on standard orthography. BTC and TTC are smaller and have more distinct surface forms. The entities in TTC follow capitalization conventions but BTC has many entities that consist of words that are not capitalized. BTC also exhibits a different distribution of capitalization patterns between the training and the test set. In the training set, roughly half of the entities are sequences of word with capitalized first letter but this number falls to just 28% in the test set. Most entities in all datasets consist of a single words. Only about 2% of entities have length more than three. OntoNotes contains the largest percentage of long entities, 10% of all.
We also present statistics on in/out of vocabulary words in the test data, with respect to the pretraining data, the training data and the gazetteer entries. An entity is considered seen in pre-training if the phrase is seen as such in the pre-training corpus or all of the constituent words are seen. For CoNLL and OntoNotes, almost all entities are seen in the pretraining data (>90%), followed by TTC (88%). For BTC, only 65-75% entities are seen in pretraining. Adding ELMo increases the entity coverage by only a small amount in all datasets.  The gazetteer provides a better coverage than the training data. Training data coverage is especially low in TTC, making the dataset more challenging. According to our definition of seen, entities may be seen in the training data but not necessarily with the expected types. We also check if they are seen in the training data with the same type as in the test data. There is a decrease of only 1-5% when taking the type into consideration. Overwhelming, the type in training is also that in testing.

Experiments
Results for models using different word representations, without any gazetteer or segmentation features, are shown in Table 4. We report micro-F1 over all entity types, averaged over three runs. On all datasets, the combination of GloVe, ELMo and character-based representations works best. Given these results, it would have been reasonable to study methods of adding gazetteers only for this representation. However, given that the GloVe along with character-based embeddings is much more commonly used in recent work on NER, we also present results for that.
We now compare one representative approach from each class of methods for adding gazetteer and segmentation features to the biLSTM-CRF architecture. For a fair comparison, we add only the core idea of the model and use the same hyperparameters, noted in the appendix, removing different peripheral features such as part-of-speech tags and word shape, used in papers that introduced the idea.  Table 4: NER F1 of models with varying input representation to biLSTM-CRF. G refer to GloVe, E refers to ELMo and ch refers to character-based representation. The highest value in each column is boldfaced.

Gazetteers
We compare four gazetteer-derived features. In each case, the gazetteer representation vector is passed through a feedforward layer with 32 neurons and ReLU activation and then concatenated to the input word representation to the biLSTM-CRF.
1. WORD GAZ We map gazetteer entity types to dataset types and split gazetteer entries into words using space as a delimiter, to create a vocabulary associated with each entity types, i.e. a list of words that appeared in person names etc. Each word in a sentence is associated with a binary valued vector with length equal to the number of entity types in the dataset. The dimension corresponding to a given type gets value one if the word is in the gazetteer vocabulary for that type and zero otherwise. The vector is all zeros for words that do not appear in the vocabulary for any entity types and can have multiple components with value one, when the word appeared in gazetteer entries of more than one type.  Table 5: NER F1 on using GloVe+char. avg-diff and max-diff are the average and maximum increase over the base model across datasets. The most stable system (highest avg-diff) in each category is boldfaced.
2. GAZ IOBES is similar to WORD GAZ but instead of a single bit, there are four values for each label denoting a matching to an entity exactly (S), to the first word (B), to the last word (E) or other words in between (I).
3. GAZ FINE is also similar to WORD GAZ but instead of selecting types from the gazetteer to match the label types for a given dataset, we use all 30 gazetteer types as is. The length of the gazetteer vector equals the numbers of the original gazetteer types without any mapping.
4. PHRASE GAZ Here, we perform phrase level matching between the sentence and the gazetteer. All subsequences (continuous sequence of words) in a sentence are matched to each entry in the gazetteers, retaining the gazetteer type only for the longest match. For example, only 'New York University' is matched, not the nested 'New York' entity in it. Each word in the longest match subsequence gets the bit corresponding to the matched gazetteer category switched on in the gazetteer vector representation. Both word-level matching and learned gazetteers make it possible for the model to learn phrases beyond the exact entries in the gazetteers. For   example, if the organization gazetteer contains 'GAZ High School' but the dataset contains 'TST High School', phrase-level matching would not match any of the words in 'TST High School' to a gazetteer. A word-level matching would at least mark 'High' and 'School' as organizations. Similarly, learned gazetteer embeddings may learn a similar representation for both and make it possible to recognize 'TST High School' correctly.

Segmentation
We experiment with the different multi-task learning methods of incorporating segmentation and one of the semi-Markov CRF models.  Table 8: Entity segmentation F1 using GloVe+char. avg-diff and max-diff are the average and maximum increase over the base model across datasets. The most stable system (highest avg-diff) in each category is boldfaced.

Results
We test the gazetteer-enhanced and segmentation approaches using both GloVe+char and ELMo+GloVe+char as the input representations. Results are shown in Tables 5 and 6 respectively.

Consistency Across Datasets
We report the average and maximum improvement in F1-score over the base model across datasets. A high average improvement means that the model is consistently better across datasets spanning newswire, conversations and Twitter posts. A high maximum improvement with a low or negative average improvements means that the model can do well on some dataset but fails to perform well when tested on multiple datasets of varied genres.
The word indicator gazetteer features are the best and most consistent on average. With ELMo+GloVe, they also show the maximum improvement. WORD GAZ combines gazetteer labels to dataset labels whereas the GAZ FINE does not do any dataset specific mapping. Prior work has used the former but this experiment shows that we do not need to do such dataset specific modifications of gazetteer labels. We can use fine-grained labels and obtain similar gains. PHRASE GAZ and LRN GAZ improve performance with GloVe+char, with especially high performance on TTC, but they are not as good with ELMo+GloVe+char. Improvement with both representation by incorporating gazetteers is higher on BTC and TTC than CoNLL and OntoNotes. This is likely because the gazetteer features are more ambiguous in CoNLL and OntoNotes (Table 7), with most words appearing in two or more possible categories.  Table 9: Entity Segmentation F1 using ELMo+GloVe+char.
avg-diff and max-diff are the average and maximum increase over the base model across datasets. The most stable system (highest avg-diff) in each category is boldfaced.
Segmentation methods can show vast improvement on specific datasets but they are unstable across datasets. The average improvement is negative in many cases for both representations. SEG SUP is the only segmentation approach that is consistently beneficial across datasets for both representations. HSCRF performs well too, but only when ELMo is used as well. In comparison, including gazetteers is consistently better.

Gazetteers for Segmentation
In this section, we explore if gazetteers can be used as an alternative for segmentation, owing to their stable performance across datasets. We ask the following question:Are gazetteers recognizing new entities not previously marked as entities or are they improving the typing of spans already recognized as entities by adding entity type information? To answer this question, we report the entity segmentation F1 in Tables 8 and 9. Entity segmentation is the task of finding the correct entity span, regardless of the type. Since pre-training data covers most entities (Table 3), one would expect gazetteers to improve typing of entities through a more explicit type signal. However, segmentation results are consistent with those of NER. The models that improved performance of NER also perform well on segmentation and often by similar margins as NER. In fact, following similar trends as NER, gazetteer-enhanced models are more consistent than segmentation methods at entity segmentation as well. Since gazetteer-enhanced models induce segmentation and improve performance despite a near perfect coverage provided by the pretraining data, we expect them to improve performance of even the latest transformer-based models    (Devlin et al., 2019) pre-trained on large corpora. Next, we modify WORD GAZ to mark the presence in any gazetteer without taking the type into consideration (Table 11). Surprisingly, removing type information results in better performance with GloVe, indicating the most benefit came from segmentation and not typing. With ELMo included, however, we do not observe the same trend. While performance is better than the baseline system, it is not better than WORD GAZ that includes types. Though we cannot point conclusively to the reason behind such results with ELMo, we suspect it is due to the ambiguity within the pre-training data. The model may not be using the gazetteer representation if there is a strong signal from the pretrained representation that the word is not an entity. This can only be verified if all representations are trained on the same pre-training corpus.

Gazetteers for Long Entities
To further verify the effectiveness of gazetteers for segmentation, we break down performance by entity length in words, reporting F1 in Table 10 for the top performing gazetteer-enhanced model and segmentation model with ELMo+GloVe+char. Recall that the majority of entities in all corpora are of length one and that entities consisting of more than three words are the rare, about 2% in three of the datasets, expect in OntoNotes, where they are 10%. The highest performance is on entities of length two, followed by length one. Longer entities are recognized much more more poorly.
Typically segmentation methods have been used to improve performance on long entities. But our experiments reveal that gazetteers are better at it. With the exception of TTC, WORD GAZ is better on long entities and HSCRF on shorter ones. This is likely due to the presence of many long entities in gazetteers ( §4, Table 1).

Conclusion
We provided a comprehensive overview of methods for incorporating gazetteers and inducing segmentation in NER. We chose these two areas because even though they have been explored separately in prior work, we find that they are interrelated and achieve similar goals. We implemented representative models from each category for a fair comparison. We found that while segmentation methods can achieve impressive improvements on specific datasets, gazetteer-enhanced models are more stable across datasets. Moreover, the simpler methods of gazetteer enhancement (binary valued discrete feature vector with word-level gazetteer matching) and segmentation (multi-task learning with a extra supervision from and auxiliary binary classification for segmentation) performed better within their respective categories.
Furthermore, contrary to expectation, we found that gazetteer-enhanced models improve entity segmentation, not just entity typing. In fact, one need not perform a gazetteer to dataset label mapping for incorporating gazetteers; using the original gazetteer types works just as well. Even more surprisingly, gazetteer types are even unnecessary depending on the input representation. With GloVe, performance improves by removing gazetteer types altogether. This is likely a consequence of gazetteers inducing segmentation. Lastly, we showed that gazetteers are better at finding long entities, another consequence of inducing segmentation. They are an effective alternative to segmentation techniques developed to identify long entities, which we found are unstable across datasets.