Enhanced Language Representation with Label Knowledge for Span Extraction

Span extraction, aiming to extract text spans (such as words or phrases) from plain text, is a fundamental process in Information Extraction. Recent works introduce the label knowledge to enhance the text representation by formalizing the span extraction task into a question answering problem (QA Formalization), which achieves state-of-the-art performance. However, such a QA Formalization does not fully exploit the label knowledge and causes a dramatic decrease in efficiency of training/inference. To address those problems, we introduce a fresh paradigm to integrate label knowledge and further propose a novel model to explicitly and efficiently integrate label knowledge into text representations. Specifically, it encodes texts and label annotations independently and then integrates label knowledge into text representation with an elaborate-designed semantics fusion module. We conduct extensive experiments on three typical span extraction tasks: flat NER, nested NER, and event detection. The empirical results show that 1) our model achieves a new state-of-the-art performance on four benchmarks, and 2) reduces training time and inference time by 76% and 77% on average, respectively, compared with the QA Formalization paradigm.

Span extraction, aiming to extract text spans (such as words or phrases) from plain texts, is a fundamental process in Information Extraction. Recent works introduce the label knowledge to enhance the text representation by formalizing the span extraction task into a question answering problem (QA Formalization), which achieves state-of-the-art performance. However, QA Formalization does not fully exploit the label knowledge and suffers from low efficiency in training/inference. To address those problems, we introduce a new paradigm to integrate label knowledge and further propose a novel model to explicitly and efficiently integrate label knowledge into text representations. Specifically, it encodes texts and label annotations independently and then integrates label knowledge into text representation with an elaborate-designed semantics fusion module. We conduct extensive experiments on three typical span extraction tasks: flat NER, nested NER, and event detection. The empirical results show that 1) our method achieves state-of-the-art performance on four benchmarks, and 2) reduces training time and inference time by 76% and 77% on average, respectively, compared with the QA Formalization paradigm. Our code and data are available at https://github.com/ Akeepers/LEAR.

Introduction
Information Extraction (IE), a fundamental task in natural language processing, aims to extract structured knowledge from unstructured texts. It usually contains the process that extracts text spans (such as words or phrases) from plain text, e.g., NER. Span extraction is usually formulated into the se- Figure 1: Illustration of different paradigms 1 for span extraction. X represents the sequence; Y represents the category-related extra input (e.g., question in QA paradigm); y represents corresponding category; f 1 is the encoder to learning text representation; f 2 is the task layer to decode results; f 1 is the extra encoder to learn the representation of Y ; f is the extra module for the fusion of text semantics and label knowledge. quence labeling problem that assigns a categorical label to each token in a text.
Many efforts have been devoted to span extraction. Early approaches are mainly based on handcrafted features such as domain dictionaries (Sekine and Nobata, 2004;Etzioni et al., 2005) and lexical features (Ahn, 2006). As neural networks show the effectiveness of learning text features automatically, many neural-based methods have been proposed (Huang et al., 2015;Strubell et al., 2017;Liu et al., 2018a;Cui et al., 2020). Recently, self-attention-based pre-trained language models such as BERT (Devlin et al., 2019) are widely used to boost the span extraction task (Devlin et al., 2019;Yang et al., 2019a). However, most existing methods treat labels as independent and meaningless one-hot vectors, neglecting prior information of labels (referred to as label knowl-edge). Figure 2: The Visualization of attention mechanism for the token "judge" (QA Formalization). 2 The darker color indicates the higher attention score.
To alleviate the limitation, several studies Lin et al., 2019a; start to integrate label knowledge into span extraction. Among them, QA Formalization is especially attractive due to its effectiveness (Levy et al., 2017;Li et al., , 2020bDu and Cardie, 2020;Li et al., 2020a). Simply put, QA Formalization treats span extraction as a question answering problem. Taking NER as an example, to extract "PERSON" entities, it is formalized as answering the question "which person is mentioned in the text, in which a person represents a human or individual?" based on the given text. Benefiting from the label knowledge of the category-related questions, QA Formalization usually yields stateof-the-art performance in span extraction even in low-resource scenarios.
QA Formalization, however, exhibits two key weaknesses: 1) Inefficiency: Formalizing the span extraction as QA causes a drastic reduction of training/inference efficiency. Specifically, the typical QA Formalization method concatenates question and text as the input (e.g., [CLS] question [SEP] text [SEP]) and jointly encodes question and text with a transformer-based encoder. The jointencoding has to transform every text into |C| pairs of the form question, text , where |C| is the size of the label category set. This transformation, which increases both the size of the sample set and the length of text sequences, finally increases the time cost of training and inference. 2) Underutilization: The label knowledge is integrated implicitly into text representation based on the self-attention mechanism (Vaswani et al., 2017). As Figure 2 shows, the "attention" of self-attention mechanism will be distracted by text, not entirely focus on the question part. Thus, the label knowledge is not fully exploited to enhance the text representations.
To address aforementioned two problems, we propose a novel paradigm (seen in Figure 1) to integrate label knowledge. First, since joint-encoding causes low efficiency, we decompose question-text 2 The attention score comes from the well-trained model based on the pervious work (Li et al., 2020b). encoding process into two separate encoding modules: the text encoding module f 1 and the question encoding module f 1 . In this way, the size of the sample set is no longer expanded by |C| times. Second, to fully utilize the label knowledge, a fusion module f is designed to explicitly integrate the label and the text representations.
To instantiates the above paradigm, we further propose a model termed as LEAR to learn Labelknowledge EnhAnced Representation. A powerful encoder f 1 is essential for understanding the label annotations 3 . However, training the encoder f 1 from scratch is challenging since the number of label annotations is too small. Thus we share the weights of f 1 and f 1 (called shared encoder), which can learn the label knowledge by large pretrained model and does not introduce extra parameters. Next, the learned label knowledge is integrated into text representations by the semanticsguided attention module. We conduct experiments in five benchmarks on three typical span extraction tasks: flat NER, nested NER, and event detection (ED). Compared with QA Formalization baselines, our model LEAR outperforms them to achieve a new state-of-the-art. Furthermore, LEAR reduces training time and inference time by 76% and 77% on average, respectively.
To sum up, our contributions are as follows: • We propose a new paradigm to exploit label knowledge to boost span extraction, which encodes texts and label annotations independently and integrates label knowledge into text representation explicitly.
• We propose a novel model, LEAR, to instantiate the above paradigm. It designs the shared encoder and semantics-guided attention to tackle the technical challenges.
• The experiments show that our method achieves SOTA performance on four benchmarks, and it is much faster than the previous SOTA approach. Further analysis confirms the effectiveness and efficiency of our model.

Task Formalization
We formulate the following span extraction task: given an input text X = (x 1 , x 2 , · · · , x n ) consisting of n tokens, find out all candidate spans in X and assign a label c ∈ C to each of them, where C is a predefined set of categories (or tag types, interchangeably). This formulation provides a uniform framework for modeling many important problems. For example, when C is the set of event types such as die, attack, marry, and so on, span extraction is exactly the event detection task. In addition, if C consists of entity types such as persons, organizations, locations, span extraction turns into the well-known named entity recognition task.  QA formalization is powerful in span extraction since it incorporates label knowledge. One of its prerequisites is the existence of reasonable questions. Usually, questions are generated by a manually-designed pre-processing step, which is costly and lacks versatility and accessibility. For instance, Du and Cardie (2020) and Li et al. (2020b) (Li et al., 2020b) on flat and nested NER uses the annotations of each category (referred to as label annotations) as the questions. We follow this setting in our work for a fair comparison. Similarly, we utilize the annotations of event types in ACE 2005 event detection task 5 . Table 1 presents an example of those annotations.

Approach
In this section, we first give an overall description of our LEAR architecture. LEAR consists of three crucial modules: semantics encoding module, semantics fusion module, and span decoding module. Our architecture ( Figure 3) takes text X and label annotation Y of category set C as input. The two inputs are respectively processed by two encoder networks whose backbone is BERT (Devlin et al., 2019). The two encoders share weights (referred to as shared encoder) while processing the two inputs. Then the text embedding and label embedding produced by the shared encoder are fused by the semantic fusion module to derive the labelknowledge-enhanced embeddings for the text. Finally, the label-knowledge-enhanced embeddings are used to predict whether or not each token is a start or end index for some category.

Semantics Encoding Module
Semantics encoding module aims to encode the text and the label annotation into real-valued embeddings. Since the number of label annotations is small compared with the whole sample set, it is challenging to build an encoder from scratch for the label annotations. Thus we introduce the shared encoder, which is inspired by siamese networks (Bromley et al., 1993). The shared encoder is efficient in learning the representation of label annotations and does not introduce extra parameters.
Given input text X and label annotations Y , LEAR first extracts their embeddings h X ∈ R n×d and h Y ∈ R |C|×m×d , where n is the length of X, m is the length of label annotation, |C| is the size of the category set C, and d is the vector dimension of the encoder. We denote this operation as:

Semantic Fusion
The semantic fusion module aims at enhancing the text representation with label knowledge explicitly.
To this end, we devise a semantics-guided attention mechanism to achieve this goal. Specifically, we first feed h X and h Y into a fully connected layer, respectively, to map their representations into the same feature space: where U 1 , U 2 ∈ R d×d be the learnable parameters of the fully connected layers. Then, we apply the attention mechanism over the label annotations for each token in the text. For any 1 ≤ i ≤ n, let x i be the ith token of X, and h x i ∈ R d be the ith row of h X . Likewise, for any 1 ≤ i ≤ m and category c ∈ C, let y c j be the jth token of the annotation of c, and h y c j be its embedding from h Y . We compute the dot product of h x i and h y c j , and apply a softmax function to obtain the attention scores: Finally, we get the fine-grained features by attention, which is in turn fused into token embedding by add operation: where tanh (·) is the hyperbolic tangent function, and V ∈ R d×d and b ∈ R d are learnable parameters. Intuitively,ĥ c x i encodes the information related to category c.
Repeating the process for all categories, we obtain the category-related embeddingĥ

Span Decoding
Now we are ready to select spans. Following Li et al. (2020b), we use the start/end tagging schema to annotate the target spans to extract. Specifically, for each token x i , we compute the following vector: where M s ∈ R |C|×d and b s ∈ R d are learnable parameters, • is the element-wise multiplication, and f o (·) is the function that sums up the rows of the input matrix. Intuitively, for any c ∈ C, start c x i indicates the probability that x i starts a span of the category c. Likewise, we obtain the end x i , which indicates the probabilities that x i ends a span, in the same prediction procedure. Then we extract the results case by case, depending on whether or not spans of the same category can be nested 6 .
Flat Span Decoding This is the case without nested spans in the same category.
The most widely adopted method is the nearest matching principle (Du and Cardie, 2020;, which matches a start position of category c with the nearest next end position of c. In contrast, we follow the heuristic matching principle (Yang et al., 2019b), which determines spans from the lens of probability. Roughly speaking, among candidate start and end positions of a category c, we only match those having high probabilities, where the probabilities are derived from vectors defined in formulas (8) For detailed information of heuristic matching, please refer to the algorithm in Appendix A.1.
The two principles for span decoding are further compared by experiments in Appendix A.2.
Nested Span Decoding Now suppose that spans in the same category may be nested or overlapped.
Since the heuristic matching principle does not work anymore, we follow the solution of BERT-MRC (Li et al., 2020b). It employs a binary classifier to predict the probability that a pair of candidate start/end positions should be matched as a span. Specifically, for any category c, define the following binary classifier: where 1 ≤ i, j ≤ n, and M ∈ R 1×2d is the learnable parameter. When P c i,j > 0.5, it will be predicted that x i and x j demarcate a span of c.

Loss Function
Given input text X = (x 1 , x 2 , · · · , x n ) consisting of n tokens and set C of categories, for any c ∈ C, define S c ∈ {0, 1} n to be the vector whose ith entry S c x i = 1 if and only if x i is a ground-truth start position of c. Likewise, define E c ∈ {0, 1} n to indicate the ground-truth end positions. Recall the vectors start c and end c defined in Section 3.3. Define start loss function L s and end loss function L e of our model as follows: where CE stands for the cross entropy.
Flat Span Extraction The final loss function of our model is defined to be L = L s + L e . Nested Span Extraction More notation is needed. Recall the matrix P c ∈ R n×n defined in Formula (9). Let M c ∈ R n×n be the binary matrix such that M c i,j = 1 if and only if the tokens x i and x j demarcate a ground-truth span of category c. Define the match loss function where W c ∈ R n×n is the binary matrix such that W c i,i = 1 if and only if P c i,j > 0.5 or M c i,j = 1. Then the final loss function of our model is defined to be L = α(L s +L e )+βL match , where α, β are hyper-parameters to control the contributions towards the overall training objective.

Experiments
In this section, we present LEAR results on 5 widely-used benchmarks.

Datasets
Dataset We evaluate our model on three span extraction tasks: flat NER, nested NER and event detection. For flat NER, we conduct experiments on MSRA (Levow, 2006) and Chinese OntoNote 4.0 (Pradhan et al., 2011). For nested NER, we evaluate our model on ACE 2004 (Doddington et al., 2004) and ACE 2005 (NER) datasets. For event detection, we use the ACE 2005 7 (ED) dataset.
For MSRA and Chinese OntoNote 4.0, which contains three and four types of entities respectively, we follow the data preprocessing strategies in Li et al. (2020b) and Meng et al. (2019) for fair comparison. ACE 2005 (NER) and ACE 2004 both annotate 7 entity categories. For ACE 2005 (NER), we use the same data split as previous works (Lin et al., 2019b); for ACE 2004, We use the same setup as Katiyar and Cardie (2018). ACE 2005 (ED) annotates 33 types of events and we follow the same settings of Chen et al. (2015) and  to split data into train, development, and test set. More statistics of datasets are listed in Appendix A.4.

Baselines
Named Entity Recognition We use the following models as baselines: (1)  Furthermore, to compare the efficiency between QA Formalization and LEAR, we instantiate the traditional paradigm as a baseline for efficiency comparison in the simplest way, which only contains a BERT encoder and two fully connected layers as the classifiers. We denote this baseline model as Traditional Formalization.

Experimental Setups
We use BERT (Devlin et al., 2019) as the backbone to learning the contextualized representation of the texts. More specifically, we implement our model based on the BERT-large model for NER task, which is the same as BERT-MRC (Li et al., 2020b). In the event detection task, we use the BERT-base model as the backbone. We adopt the adam optimizer (Kingma and Ba, 2015) with a linear decaying schedule to train our model. The detail of hyper-parameters settings is listed in Appendix A.3.
To make results comparable in the efficiency comparison experiment (as shown in Table 3), all models take the BERT-base as the backbone and set all hyperparameters to the same except max_seq_len of QA Formalization. The higher max_seq_len meets the requirement of taking the question as extra input for QA Formalization.

Effectiveness Evaluation
We use micro-average precision, recall, and F1 as evaluation metrics. A prediction is considered correct only if both its boundary and category are predicted correctly.

Main Results
Effectiveness Table 2 shows the performance of our LEAR compared with the above state-of-the-art methods on the test sets. We can see that our LEAR outperforms all other models on four benchmarks, i.e., +3.01%, +0.84%, +0.21%, +0.17%, respectively on ACE 2005 (ED), OntoNote 4.0, MSRA and ACE 2004. This improvement indicates that the explicit fusion with a dedicated module is better than the implicit fusion based on the self-attention mechanism. Since the joint-encoding of QA Formalization, the "attention" of self-attention mechanism will be distracted by text, not entirely focus on the question. Thus the label knowledge introduced by label annotation is not fully exploited. By con-  trast, our LEAR learns knowledge-enhanced representations for each token by a semantics-guided fusion module, whose attention entirely focuses on the label annotation.
Efficiency Table 3 shows that our LEAR is much faster than QA Formalization, i.e., reducing the training and inference time by 76% and 77% on average, respectively. The reduction in training/inference time is positively correlated with the number of categories |C|, which benefits from breaking the joint-encoding limitation of QA Formalization. As Table 4 shows, the time complexity of LEAR during inference is O(n 2 + |C|mn), in which we ignore the cost for the encoding of label annotations in our LEAR. Because LEAR only encodes all label annotations once and reuses their representations during the inference, which is favorable for industrial applications in the resourcelimited online environment. In contrast, the time complexity of QA Formalization is O(|C| · (n + m) 2 ), causing a dramatic decrease in efficiency of inference. To summarize, the fundamental starting points of the proposed paradigm include: 1) decomposing question-text joint encoding into two separate encoding modules; 2) explicitly integrating label knowledge by a dedicated module. The above experiments confirm that our LEAR, an instantiation of the proposed paradigm, outperforms previous SOTA methods in effectiveness and efficiency.

Analysis for Model Variants
To demonstrate the effectiveness of our method, we build a series of variants of LEAR. For the semantics encoding module, we set: 1) Label Embedding Layer (LEL): replacing the encoder module of label annotations with a label embedding layer, which is initialized by glove (Pennington et al., 2014). The F1 scores drop 0.86% on average. The results show that the improvement of our LERA comes from understanding the label annotation, which is handled well by the shared encoder. 2) Label Name Encoding (LNE): replacing the label annotations with corresponding label names. The results drop 0.53% on average, indicating that label names contain less label knowledge than label annotation.
In order to survey the semantics fusion strategy, we set: 1) Average Pooling & Add (AP & Add): replacing the semantics-guided attention mechanism with average pooling and integrating label knowledge by add operation. The F1 scores drop by 0.80% on average. 2) Sentence Features & Similarity (SF & Sim): using the sentence-level features of label annotations (i.e., the embedding of [CLS] symbol) instead of token-level features. Thus the semantics-guided attention mechanism turns into the similarity calculation between token embedding and label feature. The F1 scores drop by 0.56%. The above two settings retain the extra learnable parameters introduced by the fusion module. The results show that the improvement comes from the better exploitation of label knowledge, not the larger parameters. Besides, the results demonstrate that fine-grained (i.e., token-level) features are more effective.
All the above experiments show the effectiveness of our LEAR. Furthermore, the worst-performing variants of LERA still rival the QA Formalization method, which powerfully demonstrates the superiority of the proposed paradigm.   To verify that exploiting label knowledge is beneficial in data-scarce scenarios, we introduce LEAR w/o for comparison. LEAR w/o is short for LEAR without label knowledge, whose settings are the same with LEAR except that BERT alone rather than shared encoder and label semantic fusion module are used (i.e., the standard fine-tuning). We conduct two sets of experiments for each dataset using various proportions of the training data: 1shot and 5-shot. For the 1-shot setting, we sample one sentence for each category in the training set, and the setting of 5-shot is similar. We repeat each experiment 5 times. Tabel 6 shows that our LEAR demonstrates superior performance, for example, obtaining up to +12% absolute improvement and +6.8% on average across all datasets in the 1-shot setting. This is in line with our expectation since LEAR enhances the text representation with label knowledge, which provides more prior information.
In the appendix, we list the further analysis about the effect of different span decoding strategies and the comparison between solving span extraction in the multi-label classification (our LEAR) or sequence-labeling manner (e.g., a CRF layer).

Related Work
Event Detection (ED). Event Detection aims at extracting event triggers from a text and classifying them. It is dominantly solved in a representationbased manner, where triggers are represented by embedding. In case of no extra information, the representation can be obtained by a powerful text encoder which is usually based on CNN (Chen et al., 2015), RNN (Nguyen et al., 2016), or attention mechanism (Yang et al., 2019b;Tong et al., 2020). Besides, the representation can be enhanced by extra information. Examples of typical extra information include syntactic information (Liu et al., 2018b;Cui et al., 2020) and knowledge base (Liu et al., 2016;. In particular, label knowledge is attracting more and more attention (Li et al., 2020a;Du and Cardie, 2020), which usually formalizes ED as a QA problem.
Named Entity Recognition (NER). Named entity recognition seeks to locate named entities in an unstructured text and classify them into pre-defined categories such as person, organization, location, etc. Traditional methods treat it as a classification task and use CRFs (Lafferty et al., 2001;Sutton et al., 2007) as the backbone. Then neural networks become a prevalent tool in NER with the development of deep learning. Recently, the performance of NER has been further improved by large-scale language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). When label knowledge is available, state-of-the-art performance can be obtained by formulating NER as a QA problem.

Conclusion
In this paper, we propose a novel paradigm to exploit label knowledge to boost the span extraction task and further instantiate a model named LEAR. Unlike the existing QA Formalization methods, LEAR first encodes the text and label annotations independently, and uses a semantic fusion module to integrate label knowledge into the text representation explicitly. In this way, we can overcome the inefficiency and underutilization problems of QA Formalization. Experimental results show that our model outperforms the previous works and enjoys a significantly faster training/inference speed. Algorithm 1 contains a finite state machine, which changes from one state to another in response to start c , end c . There are three states totally: 1) Neither start nor end has been detected; 2) Only a start has been detected; 3) A start as well as an end have been detected. Specially, the state changes according to the following rules: State 1 changes to State 2 when the current token is a start; State 2 changes to State 3 when the current token is an end; State 3 changes to State 2 when the current token is a new start. Notably, if there has been a start and another start arises, we will choose the one with higher probability, and the same for end.  treats span decoding as a multi-label classification problem with 2 × |C| binary classifiers, which aims to predict the boundary of a span. This strategy is inspired by the QA task and it is adopted in BERTspan and our LEAR. BERT-span v1 employs the heuristic match principle, and BERT-span v2 uses the nearest match principle, both mentioned in section 3.3. (2) The most commonly-used Strategy B treats span decoding as a multi-class classification problem with BIO or BIOS schema, and is adopted in BERT-softmax and BERT-crf. Compared with BERT-softmax, BERT-crf adds a conditional random field (CRF) layer to model the dependencies between predictions, usually yielding better performance but worse efficiency. The results show that: (1) The strategy used by LEAR has better performance than the traditional way. The reason might be that, the span decoding strategy in our approach is start/end position matching, which only needs to predict the span's boundary. In contrast, the strategy adopted in previous methods needs to predict both boundary and internal words, which is much harder, especially for a longer span. (2) The comparison between BERTspan v1 and BERT-span v2 shows that, the heuristic match principle could achieve better results by making the most of information from probability. (3) Besides, there is an extra benefit for Strategy A. It naturally tackles the nested span issue, which means that candidate span overlaps with different categories.

A.3 Details of Hyper-Parameters Settings
All hyper-parameters of our model are listed in Table 8 in detail.