CNNBiF: CNN-based Bigram Features for Named Entity Recognition

Transformer models ﬁne-tuned with a sequence labeling objective have become the dominant choice for named entity recognition tasks. However, a self-attention mechanism with unconstrained length can fail to fully capture local dependencies, particularly when training data is limited. In this paper, we propose a novel joint training objective which better captures the semantics of words corresponding to the same entity. By augmenting the training objective with a group-consistency loss component we enhance our ability to capture local dependencies while still enjoying the advantages of the unconstrained self-attention mechanism. On the CoNLL2003 dataset, our method achieves a test F1 of 93.98 with a single transformer model. More importantly our ﬁne-tuned CoNLL2003 model displays signif-icant gains in generalization to out of domain datasets: on the OntoNotes subset we achieve an F1 of 72.67 which is 0.49 points absolute better than the baseline, and on the WNUT16 set an F1 of 68.22 which is a gain of 0.48 points. Furthermore, on the WNUT17 dataset we achieve an F1 of 55.85, yielding a 2.92 point absolute improvement.


Introduction
Named Entity Recognition (NER) is a fundamental task in knowledge extraction that detects named entities in text and assigns them to pre-defined categories such as persons, organizations, and locations. It plays a critical role in various applications including question answering, information retrieval, co-reference resolution, and topic modeling (Yadav and Bethard, 2019). Pre-trained transformers fine-tuned with a sequence labeling objective have become the de facto standard for the NER task because these models have shown state-of-the-art performance without the human effort of feature engineering.
Despite these achievements, fine-tuning of pretrained transformer models has two potential weak-nesses: first, unconstrained self-attention implements a global receptive field for all interactions, with no inductive bias toward focusing on and composing local dependencies hierarchically (Dehghani et al., 2019;Wang et al., 2019), and second, with small amounts of labeled data, training such models end-to-end is susceptible to overfitting.
To address these limitations we propose a novel joint sequence labeling objective, inspired by BERT's next sentence prediction (NSP) objective (Devlin et al., 2019). In contrast with the NSP objective, which evaluates sentence pairs, we design a word level objective specifically for the NER task. On top of the conventional sequence labeling objective, our novel objective enables modeling of the relationship of adjacent words based on a new tagging scheme, which helps the model to better capture local dependencies in a sequence.
For the additional objective, we employ a simple convolutional architecture based on CNN bigram features (in short, CNNBiF) to better capture the relationships between adjacent words. Under the single loss objective of the conventional sequence labeling approach we have observed that the predictions output by pre-trained transformers quickly converge to the training target labels. Our joint learning approach regularizes these models to encourage them to better capture the semantic and syntactic dependencies between nearby words.
Our key contributions in this paper are: • We propose a novel joint training objective to better capture the semantic and syntactic patterns of text through a single model architecture. The novel objective employs a new tagging scheme and a convolutional neural network architecture.
• We present results illustrating the efficacy of our model, showing (1) a performance increase over strong baseline models on two standard benchmark datasets and (2)   performance gains on out-of-domain datasets, which shows that our approach is effective at reducing overfitting.

Related Work
Recent applications of multi-task objectives with one of the objectives being named entity recognition has demonstrated improved performance on the NER task. Zheng et al., 2017 applied a multitask objective learning to named entity recognition and relation extraction to show improvements over individual tasks. Martins et al., 2019 performed joint learning of NER and entity linking tasks in order to leverage the information in two related tasks, using an LSTM model architecture. Similarly, Eberts and Ulges, 2019 presented a joint learning model based on a single transformer network to leverage interrelated signals between the NER and entity relationship tasks.
Prior to the advent of transformer-based networks, CNN networks were applied successfully to various NLP classification tasks. Kim, 2014 reports on the effectiveness of these networks where a onelayer CNN is applied to pre-trained word vectors (Mikolov et al., 2013).

Proposed Approach
As illustrated in Figure 1 our model leverages a pre-trained transformer network. This network is fine tuned with two sequence labeling objectives applied to the single NER task. The first sequence labeling objective is a standard NER objective with the IOB2 tagging scheme as described in the following. Given an input sequence of n words X = [x 1 , x 2 , ..., x n ], we perform a prediction on every word x i to obtain a corresponding NER-tag sequence Y e = [y 1 , y 2 , ..., y n ], where y n ∈ D e = {O,B-PER,I-PER,B-ORG,I-ORG,...} such that every new entity instance starts with a B tag and all subsequent words belonging to that entity instance are marked with an I tag. Given the example sentence "Obama graduated from Columbia University .", the expected NER-tag sequence is "B-PER O O B-ORG I-ORG O" as shown in Figure 1. The NER objective aims to learn the function F e (Θ) : X → Y e .
The second sequence labeling objective applies a group-consistency loss component with a new Linkage or Separation (shortly LS) tagging scheme. Given the NER-tag sequence Y e , we generate a corresponding LS-tag sequence Y LS la-  beling a word as L when it is internal to a mention (i.e., its NER tag has prefix I-), otherwise the word is labeled as S. Given the example in Figure 1, we label 'University' as L because the word is in the same entity with the previous word, labeling other words as S as shown in Figure 1. Furthermore, the feature vector for this word is computed by applying a convolutional network with a 2 × 1 kernel to the transformer output features for the current and preceding words. The group-consistency objective aims to learn the function F LS (Θ) : X → Y LS . For training, two loss functions are computed: L e = − log p(y e i ) for the NER labeling objective and L LS = − log p(y LS i ) for the LS labeling objective. The total loss is given by an unweighted sum: L = L e + L LS . The input sentence is tokenized by byte-pair encoded (BPE) tokens (Sennrich et al., 2016), and some individual words can be represented by multiple tokens. When a word consists of multiple BPE tokens, we select the first token as its feature vector.

Experiments
We fine-tune the pre-trained transformer model on two popular annotated English NER datasets (CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) and OntoNotes 5.0 1 ) along with inclusion of the CNN-based bigram features. The resulting models are tested with their repective test datasets as an in-domain evaluation.
Next, to assess generalization to out-of-domain data, we use the fine-tuned CoNLL2003 model and evaluate its performance on out-of-domain benchmark datasets: PLONER (Fu et al., 2020), which is a cross-domain generalization evaluation set with three entity types (Person, Location, Organization), and WNUT17 2 . 1 https://catalog.ldc.upenn.edu/LDC2013T19 2 https://noisy-text.github.io/2017/emerging-rare-Benchmark datasets. We benchmark the two popular NER datasets: • CoNLL2003: The CoNLL2003 dataset 3 contains sentences with part-of-speech (POS), syntactic chunk, and named entity annotations from newswire articles. The named entity tags consist of four categories (Person, Location, Organization and Miscellaneous for non-inclusive entities of the previous three groups). We directly employ the training and test set without any change.
• OntoNotes 5.0: The OntoNotes 5.0 dataset 4 is comprised of 1,745k English text data from various text genres (such as telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs, and religious texts), providing deeper 18 named entity categories. The dataset is converted into the IOB2 tagging scheme with open source code 5 .
The benchmark datasets are partitioned into a training, development and test set, with development set used for hyperparameter tuning and test set for evaluation.
Out-of-domain datasets. We employ the OntoNotes and WNUT16 datasets of PLONER (Fu et al., 2020) and WNUT17 test data 6 to evaluate the proposed approach on unseen domains with a finetuned model on CoNLL2003 training data. The out-of-domain evaluation datasets are summarized in Table 1 We merge Corporation and Group into Organization, and Creative work and Product into Miscellaneous to align with the four CoNLL2003 categories.
Following previous work, we measure the precision, recall, and F1 score for each entity category and report the micro-averaged values for each dataset. We use the RoBERTa-Large (RoBERTa-L) transformer model  with a simple linear classifier for sequence labeling as a baseline model. We include the CNNBiF component on the baseline architecture and train the model with two sequence labeling objectives. We also employ the FLERT model proposed by Schweter and Akbik, 2020 to evaluate our approaches. The FLERT model leverages documentlevel features for state-of-the-art NER task results. To reproduce FLERT results, we stay with their proposed XLM-RoBERTa-Large (XLM-R-L) transformer model (Conneau et al., 2020) and finetuning configurations. We add the CNNBiF component on top of that and train the model with two sequence labeling objectives.
As the representation of each word given input sequence we use the last layer of the transformer and a common subword pooling strategy first (Devlin et al., 2019). To fine-tune the transformers we use the AdamW (Loshchilov and Hutter, 2019) optimizer with the fixed same number of 20 epochs. For the RoBERTa-L transformer we use a linear warmup and linear decay learning rate schedule with a learning rate of 1e-5 and for the FLERT (XLM-R-L) model we use a one-cycle training strategy with a learning rate of 5e-6 as suggested in their paper. We use the (RoBERTa-L) transformer model from HuggingFace 7 and FLERT model from flairNLP 8 . CNN-based Bigram Feature Component. On top of the two baseline models (RoBERTa-L and FLERT) we add our proposed CNNBiF along with the NER sequence labeling classifier. The input is the representations of individual words adding padding vectors to both sides. For each pair's representation we employ a simple CNN layer with a 2 × 1 kernel filter considering the previous word as the pair. After we truncate the last representation paired with the last padding vector, we produce the same length and same dimension of input representations. On top of the CNNBiF layer we add a linear classifier to predict Linkage or Separation tags of individual pair representations. Results & Analysis. First, to gain understanding of the impact of CNN-based bigram features, we conduct a comparative evaluation on fine-tuning of RoBERTa-L and FLERT models with and without the CNNBiF module. As Table 2 shows, we find that addition of the CNNBiF approach in the RoBERTa-L model with LS objective outperforms the conventional sequence labeling approach across the CoNLL2003 and OntoNotes 5.0 benchmark data. Similarly, we observe even stronger performance increases in the FLERT model when we include the CNNBiF approach with LS objective, achieving a test F1 of 93.98 on the CoNLL2003 test data.  To investigate the impact of CNN-based bigram features on out-of-domain data, we finetune the RoBERTa-L and FLERT models on the CoNLL2003 training set and then evaluate these models on the out-of-domain datasets including OntoNotes (PLONER version), WNUT16 (PLONER version), and WNUT17. The results are shown in Table 3. We provide additional experiments for the ablation study of the RoBERTa-L model. When we use the CNNBiF layer for the LS task training jointly, we observe a much larger performance gain over the RoBERTa-L sequence labeling model. We see the only IOB2 sequence labeling task shows more mismatching predictions in the multi-word entity mentions compared to the unigram entity mentions and the LS joint task alleviates the weakness of the IOB2 sequence labeling task. To better handle the LS task we see that a single representation of adjacent tokens via a convolutional layer better captures their relationship and brings much higher performance in the LS task. Moreover, the FLERT model clearly show that the addition of the CNNBiF layer with the LS joint task significantly improves performance on the unseen-domains. Interestingly, we observe that the FLERT model is slightly worse than the RoBERTa-L model. We conjecture this is because this model brings more contextual information and therefore it is more susceptible to overfitting and less generalizable to out-of-domain sets.   Table 4 shows how the CNNBiF layer leverages the fine-tuning procedure and the impact on the pre-diction of singleton and multiple-word entities of WNUT17 test set. Very interestingly, we observe that there is a slight performance improvement in singleton entity examples, and a much larger performance gain in multiple-word entities, demonstrating the importance of capturing local dependency patterns for entity recognition task.