DyLex: Incorporating Dynamic Lexicons into BERT for Sequence Labeling

Incorporating lexical knowledge into deep learning models has been proved to be very effective for sequence labeling tasks. However, previous works commonly have difficulty dealing with large-scale dynamic lexicons which often cause excessive matching noise and problems of frequent updates. In this paper, we propose DyLex, a plug-in lexicon incorporation approach for BERT based sequence labeling tasks. Instead of leveraging embeddings of words in the lexicon as in conventional methods, we adopt word-agnostic tag embeddings to avoid re-training the representation while updating the lexicon. Moreover, we employ an effective supervised lexical knowledge denoising method to smooth out matching noise. Finally, we introduce a col-wise attention based knowledge fusion mechanism to guarantee the pluggability of the proposed framework. Experiments on ten datasets of three tasks show that the proposed framework achieves new SOTA, even with very large scale lexicons.


Introduction
Sequence labeling is the task of assigning categorical labels to a text sequence. Many conventional NLP tasks, such as named entity recognition (NER), Chinese word segmentation (CWS), and slot-filling based natural language understanding (NLU), can be formalized as the sequence labeling problem. The deep learning methods, especially the recently proposed BERT and its variants, have achieved great success in such sequence labeling tasks. However, the BERT-based methods are generally built based on word-piece or character embeddings. The word information (e.g., word boundary or type) is not fully exploited, which makes it difficult to accurately determine the entity boundary or correctly predict entity type.  Figure 1: Iron Man can be a name of a smart device or a movie and the system would be unable to react properly upon "Please play Iron Man" from a user. Another case as "Play just a little while longer now on Iron Man" requires the system to classify "Play" between music and movie domains, and whether "now" should be combined with "just a little while longer" as a whole.
As shown in Figure 1, it is infeasible to understand user's utterance correctly without using deterministic domain knowledge that "Iron Man" is the alias of a Smart Speaker or "just a little while longer" is a famous song. In commercial systems, the lexicon is widely used as an effective way to store various domain knowledge. In practice, the size of a lexicon can range from ten to a few million, and we usually need to update the contents of lexicons frequently, which dramatically increases the difficulty of incorporating lexicons into deep models. In this work, we will study how to effectively incorporate large-scale dynamic lexicons into BERT-based sequence labeling models.
Recent works on incorporating lexicon knowledge (Zhang and Yang, 2018;Mu et al., 2020;Li et al., 2020) can be summarized as follows. First, they match an input sentence with several lexicons to obtain all matched items. Second, leveraging the matched item information through modifying the structure of the transformer layer or the feature representation layer. However, 1) current methods normally learn additional embeddings of the words in the lexicons, which bring us a challenge -if the lexicons get updated, the model must be re-trained; 2) they only use the words in the lexicon but ignore the category of words, which is important for many tasks.
In this paper, we propose a general framework DyLex for incorporating frequently updated lexicons into sequence labeling models. The matching results of the input are reconstructed as a wordagnostic tag sequence. Then we design a supervised knowledge denoising module to smooth out noisy matches, and the remaining matches are further used as additional feature input for knowledge fusion. This step is based on a col-wise attention to seamlessly fuse word-piece embeddings of input sentence and the lexicon features. Moreover, since we do not explicitly learn embeddings of the words in lexicons, there is no need to retrain the entire model when updating the lexicons.
We conduct extensive experiments with the CWS, NER, and NLU tasks on various datasets. The results show that our model consistently outperforms the strong baselines and achieves new state-of-the-art results.
We summarize the contribution of this work as follows: 1) We propose a general framework for effectively introducing external lexical knowledge into sequence labeling tasks. Our framework supports dynamic updates of lexicons to facilitate industrial deployment.
2) We devise a novel knowledge denoising module to make full use of large-scale lexicons.
3) Our framework outperforms strong baselines and achieves SOTA results on three different sequence labeling tasks.

Approach
In this section, we will present how to incorporate large-scale lexicons into BERT. As illustrated in Figure 2, the proposed DyLex framework contains two parts, namely the BERT-based sequence tagger and Lexicon Knowledge extractor. The Lexicon Knowledge (LexKg) extractor has three submodules: Matching, Denoising and Fusing.  (Vaswani et al., 2017) layer, which employs multi-head attention to perform self-attention over a sequence individually and finally applies concatenation and linear transformation to the results from each head. Every single head attention in multi-head attention is calculated in a scaled dot product form:

BERT as Encoder
where Q, K, V are input matrices, respectively. Then self-attention can be formalized as: where W Q , W K , W V are parameter matrices to be learned.

The LexKg Extractor
Matching Conventional methods normally learn additional word embeddings of lexicons to incorporate lexicon knowledge, thus it is required to retrain the entire model once the lexicons are updated. Our method is independent of the lexicon size and lexicon word content by designing a wordagnostic representation. Specifically, the Matching module takes a word sequence as input, then uses a prefix tree-based fast matching algorithm (see algorithm 1) to quickly retrieve the lexicons, and finally produces multiple word-agnostic tag sequences. Figure 2 (b) shows a concrete example.
To be detailed, we use the prefix Trie tree (Brass, 2008) to store and retrieve the lexicons. The nonleaf nodes of Trie are made up of word-pieces of lexicon words tokenized by BERT tokenizer, while the leaf nodes are made up by the types of the lexicon words, namely tag name (e.g. 'B-song' and 'I-song' as shown in Figure 2 (b)). For each subsequence of the input, the Trie may match several different candidates. Every single match can be categorized by a tag attached with a leaf node, the rest of the sequence will be filled with 'O' tags.
Formally, we denote the input sequence as U . A tag sequence obtained by fast matching is T (i) , and superscript i represents the index of the tag sequence. The Matching submodule can be formalized as: where E u ∈ R l×hz (here l is sequence length and hz is hidden size) is the representation produced by BERT encoder, E (i) t ∈ R l×hz represents the embedding of i-th tag sequence, and E (i) d ∈ R l×hz is the corresponding output of this module.
Denoising The proposed fast matching algorithm can quickly obtain all potential matched subsequences with the lexicons. However, due to the large scale size of the lexicon, even for an input sequence with only a few words, there may be dozens of incorrect matches. Using Figure 2 (b) as an example, only Row 2 (i.e. the matching to singer Taylor Swift's) and Row 3 (i.e. the matching to song Sparks Fly) are expected matchings, whereas all the other tag sequences contain incorrect matchings, namely the matching nosie mentioned in this work, which will inevitably decrease final performance. Thus we devise a novel supervised knowledge denoising module to smooth them out.
The supervising signal can be automatically derived from the golden sequence labels of the training dataset. In the example of Figure 2 (b), each row corresponds to a single matching tag sequences, and Row 2 and Row 3 are used as positive training samples whereas negative for the other two. Note that, our method can still work even if the category of lexicon (e.g. name or song) is not provided, in that case, a tag sequence degenerates to mark out a lexicon word boundary.
Formally, we first get the representation of i-th tag sequence from its embedding E When classifying each tag sequence, we also need to consider relationships among them. For example, Row 3 and Row 4 in Figure 2 (b) can not be True at the same time since they share some contradicting spans. Taking that into consideration, we first concatenate the [cls] in R d (i.e. first column) of all tag sequences to form a matrix R cls ∈ R nd×hz , where nd denotes the number of tag sequences (e.g. its value is 4 in the example of Figure 2 (b)) and hz denotes the hidden representation size. Then we pass the matrix R cls to a self attention layer to model the interrelation among them, where σ represents the sigmoid function, and Z ∈ R nd is the classification result. The representation of a positively classified tag sequence is denoted as R . These selected positive representations will be fused with the original BERT embedding E u .
Knowledge Fusing In this stage, our framework aims to produce a lexical knowledge enhanced representation E k by fusing BERT-based encoding E u with several selected tag sequences R + d via the proposed col-wise attention. Use the j-th token of input sequence as an example, we take its BERTbased representation E (j) u to act as Query, and its corresponding tag representation R (i,j)+ d as Key and Value, then col-wise attention can formulate: where m = |R + d |. Then concatenate E (i) k for all l positions to get E k .

The Tagger
At last E O is produced by combining E u with E k , and here we use a linear classification layer, as used by BERT tagger.
where O is the classification result for each token.
We can see that the proposed framework is not an intrusive method but rather pluggable. As we take the encoder's output as input and return a knowledge enhanced text representation, the original model structure is not modified.

Experiments
We conduct experiments on several NLP tasks, including CWS (Chinese word segmentation), NER (named entity recognition), and NLU (natural language understanding). The experimental hyperparameter settings are listed in appendix F.

Primary Baselines
BERT-based Sequence Tagger The framework uses BERT as an encoder to represent the input sequence. As can be seen in Figure 2, we can get this baseline by removing the LexKg extractor part of DyLex.
Glyce (Meng et al., 2019) Glyce is the glyphvectors for Chinese character representations. With the lexicon, it has achieved the best performance on Chinese word segmentation so far.
FLAT and HSCRF+Softdict (Li et al., 2020; Liu et al., 2019a) Named entity recognition can benefit greatly from lexicons. FLAT utilizes lexicons with the Lattice structure for Chinese entity recognition, and HSCRF with softdict is used for English named entity recognition, both of them have achieved strong results.

Lexicon Construction
The lexicon mentioned in this article refers to a collection, the entry of which contains item and Category. The item corresponds to the words, and the category corresponds to the type of the words. The category of words is customized according to the task. For example, the category in the NER task can be the song name. Tag is a BIO format that marks the type of a word. Table 1 shows notation and appendix E is a detailed lexicon fragment. The lexicon tag mentioned above is used to mark word categories, namely the value in the lexicon, which is strongly related to the task. Figure 2(b) and the 'Tag' column in Table 1 display some examples.
The lexicons used in our experiments are consistent with the ones used in baseline methods. In the NLU task, since there has not been any related work with using lexicons, we extract labeled spans from the training corpus and merge them with the lexicon used in NER task. The lexicon sizes used in our experiments are listed in appendix B.

Task1: Chinese Word Segmentation
CWS aims to divide a sentence into meaningful chunks. It is a primary task for Chinese text processing. Using lexicons in CWS tasks is a commonly used operation. Brand new words and internet buzzwords emerge every day, and it is essential to add these words into lexicons for better performance.
In this work, we experiment on two popular CWS datasets, i.e., PKU and CITYU(Emerson, 2005). The lexicon used in this experiment is consistent with jieba word segmentation lexicon 2 , which consists of a simplified Chinese lexicon from 2 https://github.com/fxsjy/jieba jieba and an extra traditional Chinese lexicon from Taiwan version of jieba. We converted all traditional Chinese into simplified Chinese for all lexicons and datasets.
To fairly compare our model with the SOTA models, we use the same settings on dataset split with Meng et al. (2019).
As shown in Table 4, our method outperforms all the other compared baselines. Compared with Glyce, which is a strong baseline, our method obtains improvement of 0.44% and 0.7% on PKU and CITYU respectively.

Task2: Named Entity Recognition
Named entity recognition is a typical sequence labeling task, and it heavily relies on external knowledge. Incorporating lexicon as external knowledge can help determine the span and type of entities. To fully verify the capability of the proposed framework in NER, we evaluate our framework on Ontonotes (Weischedel and Consortium, 2013), MSRA (Levow, 2006), Resume (Zhang and Yang, 2018), and Weibo (Peng and Dredze, 2015;He and Sun, 2017)    We first evaluate our framework on the Chinese datasets, and the results are shown in Table 2. Except for the Ontonotes, our approach achieves the best results over all methods with lexicons, averagely 0.69% higher than FLAT. Compared with BERT, which is the best method without using lexicon, our approach improves even more dramatically, with 1.57% higher.
We evaluate our framework on two English datasets (i.e., Conll2003, OnotNotes5.0). The conclusion is similar to Chinese Datasets, as shown in Table 3. Comparing with the HSCRF and CSE, our method is 0.91% and 0.85% higher on average, with and without lexicon respectively. LUKE (Yamada et al., 2020) scores the same as our method on the conll 2003 data set, and also uses information related to entities. they achieved it through pre-training, which is orthogonal to our method.

Task3: Natural Language Understanding
NLU is a more challenging sequence labeling task, which aims to recognize the intent of spoken language and extract slots. As shown in Figure 1, in many practical application scenarios, one cannot tell the real intent unless the entity is provided as prior knowledge.
We evaluate the framework on an industrial data set and two public data sets. The chinese industrial data set is a commercial dataset for mobile phone assistant. The public datasets are Snips 3 and ATIS (Tür et al., 2010). The details of the three datasets are shown in appendix D.
The overall performance of our framework on the industrial dataset is listed in Table 5. For the test set, there are 0.76% and 1.53% improvements in intent detection and slot filling, respectively. Specifically, the gain is more obvious in the SINGLE and MULTI set. The BERT can not distinguish intent between "play music" and "play video" since the model lacks the prior knowledge of whether "Love Story" is a song or a movie. In the MEDIA set, all sentences contain demonstrative words, such as "play music [xxx]" and "play video [xxx]". This type of sentence does not depend on the type of xxx. It is easy to make judgments through the demonstrative words (i.e., music and video), but there is still a 0.5% increase in intent detection, and the increment in the slot filling is even more obvious, reaching 2.21%.
The experimental results on Snips and ATIS are shown in Table 6, the setting follows previous works (E et al., 2019;Goo et al., 2018). It can be seen that our framework outperforms the other methods in all three metrics (except slot of ATIS): slot filling (F1), intent detection (Acc), and sentence accuracy (Acc), with 1.27% higher on average than the previous best method. For ATIS, the improvement is not as much as other methods. This is mainly because the dataset is relatively small and the slot is sparse, lexicons are underutilized.

The Study of Match Length
Given an utterance, the FM(algorithm 1) often produces numerous matching results for each position. On the one hand, we are not sure which result is correct. To retain the correct result, we should keep as many results as possible. On the other hand, most matching results are invalid, bringing a lot of matching noise and increasing computation cost. We have to make a balance between them. As shown in the Figure 3(a), the longer the length is, the higher the accuracy is. Based on this observation, we should select matching results by reverse order of match length. We also studied the number of selected results for each position in the sentence. It is more likely to keep the right matches with a larger number, but it brings more noise. From Figure 3(b), F1 on the three data sets do not increase as the number grows. Taking efficiency into account, we generally select n = 1 or n = 2.

Effect of Dynamic Lexicon
One advantage of our proposed method is the ability to load lexicons dynamically. Instead of using the embedding of updated lexicon entries, we only use the lexicon words' category tags. Thus we can expand the scale of lexicons arbitrarily without retraining. We studied the effect of lexicon size on  performance. From Figure 4, we can see that without using a lexicon, the performance are close to the results of BERT base. With the increasing size of lexicon, the performance will also be improved.

Look Back on Denoising
Indistinguishable lexicon matching will bring huge noise. The quality of denoising will affect the performance of the model. From Table 7, we can see that whether it is Exp-Dict or Sp-Dict, the more precise the denoising, the more improvement will be achieved compared to BERT without using a lexicon. The Sp-Dict here is a specialized collection of domain lexicons. For example, the lexicon only contains entities of the relevant category in the NER task, and the scale is relatively small. In this case, the matching noise brought by Sp-Dict is much smaller. From the Table 7, we can observe that the accuracy of denoising in Sp-dict is better than that in Exp-dict, which directly leads to impressive improvement in the experiment. This also confirms the importance of denoising.  Table 7: Column BERT represents the F1 on each task, the Denoising column represents the accuracy of the denoising module, and the Dylex column is the F1 of our method and its increment versus BERT. Exp-dict is the lexicon corresponding to each experiment above, and Sp-Dict indicates specialized domain-related lexicons.

Fusion in Hard or Soft Way
After Denoising, the results R d should be fused with E u for downstream tasks. The fused methods can be soft or hard. In the soft setting, all of the R d are weighted summed before fusing with E u . The advantage of this is we can use gradient backpropagation to train the model. Different from the soft method, E u in the hard method is selected according to the threshold. As shown in Table 8, the overall performance of hard fusion is better since it mainly fuses more accurate results. Besides, we also adopt Teacher Forcing (Williams and Zipser, 1989) in soft/hard methods, but it does not yield promising accuracies.

Related Work
With the advance of deep learning, sequence labelling tasks, such as segmentation and NER, have achieved excellent performance. More and more methods tend to be character-based (Chen et al., 2006;Lu et al., 2016;, especially in languages, such as Chinese, Japanese, Korean, etc., that require word segmentation. These languages do not have a natural segmentation delimiter as white space in Latin languages. Character-based input in these languages can avoid accumulation of word segmentation errors, then get better performances (He and Wang, 2008;Liu et al., 2010;Li et al., 2014). However, the downside of the purely character-based method is that the word information is not fully exploited.
To make full use of word information, incorporating a lexicon is an effective method. Existing works on incorporating lexicon can be categorized as feature based, lattice based and graph based methods according to implementation complexity.
Feature based Feature based method is a simpler way. Some works directly use lexical information with simple matching features and the others use auxiliary tasks to leverage the lexical information.  builds the template first and uses the template matching lexicon to build features, which help word segmentation tasks. Mu et al. (2020) uses a simple lexicon matching location information as features. Li et al. (2014) and Peters et al. (2017) adopt word-level language modeling objective and multi-task to use word information implicitly. Yang et al. (2017b) transfer crossdomain and cross-lingual knowledge via multi-task learning.
Lattice based Lattice based method is to use lattice structure. Zhang and Yang (2018) proposes Lattice-LSTM for incorporating word lexicons into the character-based NER model. Rather than heuristically choosing a word for the character when matching multiple words in the lexicon, they also introduce an elaborate modification to the sequence modeling layer of the LSTM-CRF model (Huang et al., 2015). Considering that the short path in the lattice structure will cause the word-based structure to degenerate into a characterbased structure, Liu et al. (2019b) propose a novel word character LSTM(WC-LSTM) model to add word information via four strategies. Since the lattice structure is complex and dynamic, most existing lattice-based models cannot fully utilize GPUs' parallel computation and usually have a low inference-speed. Li et al. (2020) propose a Transformer-based model for Chinese NER, which converts the lattice structure into a flat structure.
Graph based Graph based method uses a directed graph structure to fuse lexiconal information. Gui et al. (2019b) uses a GNN-based method to explore multiple graph-based interactions among characters, potential words, and the whole-sentence semantics to effectively alleviate the word ambiguity. Sui et al. (2019) employ a collaborative graph network to assign both self-matched and the nearest contextual lexical terms. To automatically learn how to incorporate multiple gazetteers into a NER system,  propose a novel approach based on graph neural networks with a multidigraph structure. The structure captures the information the gazetteers offer.

Conclusion and Future Work
In this paper, we propose DyLex, a framework incorporating dynamic lexicon to improve BERT-like models' performance in sequence labeling tasks.
To alleviate the problems caused by large-scale dynamic lexicons, we introduce word-agnostic tag embeddings and a knowledge denoising module. As a result, our framework outperforms the stateof-the-art works on many sequence labeling tasks. In future, how to extend it to text classification is a challenge, since denoising corpus cannot be automatically constructed at this time.  In dict  B  I  B  I  B  I  B  I  B  I  B  I  I  I  B  I  B  I  Baseline B  I  B  I  B  I  B  I  B  I  B  I  I  I  B B B  I  DyLex  B  I  B  I  B  I  B  I  B  I  B  I  I  I  B  I In dict  B B  I  I  I  B  I  I  I  B  I  B  I  Baseline B B B  I  I  B  I  I  I  B  I  B  I  DyLex  B B  I  I  I  B  I  I  I  B  I  As showed in above, we randomly select some examples of inconsistent predictions before and after adding the lexicons, example [1-5] is from NER, example [6-9] is from NLU, and example [10-12] is from CWS . Each example contains the input sentence, the related matching result, the baseline prediction, and DyLex prediction. Highlighted parts indicate inconsistent results. We make some interesting observations. CASE I Different type of entities can be placed under a same context. For example [1], "play" can be followed by TRACK or ALBUM (play [XX]). Model would be confused of whether XX is a TRACK or a ALBUM. In this case lexicons can provide enough type information to acquire a correct result.
CASE II Chinese word segmentation granularity is flexible according to the context. "南中国 海(South China Sea)" can be segmented into "南(South)" and "中国海(China Sea)", or it can be regarded as a single word [11]. At this point, an external lexicon will be benefit for controlling the granularity.
CASE III It happens that the word combination in slot have different interpretations, usually when the length of a slot is too long. That may cause the discontinuity of slot extraction. For example, we can see an improper O is inserted in the baseline prediction [7]. By incorporating lexicons, the boundary information can enhance the integrity of slot extraction.
CASE IV Dylex can adapt its prediction to updating lexicons. As example[8-9] illustrated, given different lexicon entries, our framework can understand what "林星辰的音乐盒" is, then dynamicly provide correct slot.    The Chinese industrial NLU dataset is a corpus specially used to train mobile phone assistants. The data set includes 80k Training set, 30k Dev set and 30k Test set. The annotation contains 500 types of intentions commonly used by mobile assistants , which are divided into 8 categories such as setting and control. There are 400 slots categories in total. The data is labeled using crowdsourcing. The cost is about 1 dollar per sentence. Each sentence was marked by 3 people, and finally the result was determined by voting. At last, there is a acceptance sampling, and professionals will spot check the quality of each batch, and the error is controlled within 1%.