Injecting Knowledge Base Information into End-to-End Joint Entity and Relation Extraction and Coreference Resolution

We consider a joint information extraction (IE) model, solving named entity recognition, coreference resolution and relation extraction jointly over the whole document. In particular, we study how to inject information from a knowledge base (KB) in such IE model, based on unsupervised entity linking. The used KB entity representations are learned from either (i) hyperlinked text documents (Wikipedia), or (ii) a knowledge graph (Wikidata), and appear complementary in raising IE performance. Representations of corresponding entity linking (EL) candidates are added to text span representations of the input document, and we experiment with (i) taking a weighted average of the EL candidate representations based on their prior (in Wikipedia), and (ii) using an attention scheme over the EL candidate list. Results demonstrate an increase of up to 5% F1-score for the evaluated IE tasks on two datasets. Despite a strong performance of the prior-based model, our quantitative and qualitative analysis reveals the advantage of using the attention-based approach.


Introduction
Information extraction (IE) comprises several subtasks, e.g., named entity recognition (NER), coreference resolution (coref), relation extraction (RE). State-of-the-art results mainly report performance on single tasks, usually solving them on a sentence level (especially NER, RE). However, in practice, IE system decisions should be consistent on the document level, e.g., when processing news articles to automatically link entities (aside from potentially learning, e.g., new relations). Yet, the challenge of solving the tasks jointly on a document level has not received as much attention and remains hard (Durrett and Klein, 2014;Yao et al., 2019;Zaporojets et al., 2021). * Equal contribution On the other hand, it is well established that IE models benefit from incorporating background information of knowledge bases (KBs). Still, so far this has been shown from the perspective of solving individual tasks such as relation classification or entity typing (e.g., Peters et al. (2019); Liu et al. (2020)). Integrating KBs in joint models, realizing and analyzing the more complex end-to-end setting, has been left unexplored.
In terms of the nature of KBs adopted in IE, current approaches use either (i) structured knowledge graphs comprising (subj,rel,obj) triples, e.g., Wikidata (Yang and Mitchell, 2017;Han et al., 2018;Zhang et al., 2019), or (ii) textual descriptions, usually in hyperlinked documents, e.g., Wikipedia (Martins et al., 2019;Yamada et al., 2020). It has not been established to what extent KB-text and KB-graph entity representations complement each other in boosting IE performance.
We address both research gaps of (a) integrating KB information into a joint end-to-end IE model for solving named entity recognition, coreference resolution and relation extraction, and (b) analyzing what KB representation is more beneficial for IE, either KB-graph trained on Wikidata, or KBtext trained directly on Wikipedia. We particularly contribute: (i) a first span-based end-to-end architecture incorporating KB knowledge in a joint entity-centric setting, exploiting unsupervised entity linking (EL) to select KB entity candidates, (ii) exploration of prior-and attention-based mechanisms to combine the EL candidate representations into the model, (iii) assessment of the complementarity of KB-graph and KB-text representations, and (iv) consistent gains of up to 5% F1-score when incorporating KB knowledge in 3 document-level IE tasks evaluated on 2 different datasets.

Entity Representations
We experiment with 3 possible entity representations: KB-text, KB-graph, and concatenating both. KB-text: We follow Yamada et al. (2016) to obtain the entity representations using a skip-gram architecture (Mikolov et al., 2013a,b), training to jointly predict (i) the linked entities (through Wikipedia hyperlinks) given the target entity, and (ii) the neighboring words for a given entity hyperlink. KB-graph: We adopt Joulin et al. (2017) to train the entity embeddings directly on Wikidata triples (subj,rel,obj) by optimizing a linear classifier to predict the obj entity from the subj entity and the relation type rel.

KB module
For a span s i from token l to r, we obtain the representation g i as input to the KB module by concatenating the respective hidden LSTM states h l and h r , and an embedding ψ r−l for the corresponding span width r − l: (1) We look up a given span s i in a dictionary built from Wikipedia, to determine its candidate entities set 1 C i , as well as the prior probability p ij for each c ij ∈ C i , as per Yamada et al. (2016, §3).
To combine the KB candidates c ij , we either use (i) a uniform average (Uniform), (ii) the prior weights p ij (Prior), (iii) an attention scheme (Attention), or (iv) attention with prior information (AttPrior). The unnormalized attention scores for Attention and AttPrior are: where K ∈ {KB-text, KB-graph, both} refers to the entity representations from §2.1, ξ K returns such representation for c ij , and F * is a feed-forward neural network (FFNN). The KB representation for span s i is a weighted average of its candidates C i : where weights α ij either are uniform (1/ |C i |), the prior p ij , or softmax-normalized attention scores (softmax over Φ from eq. (2) or eq. (3)). The concatenation [g i ; e K i ] forms the KB-enriched representation for span s i , as input for IE modules ( §2.3).

Joint IE model
The joint IE model comprises 3 modules (Fig. 1)  and using a weighted combination of the 3 module losses to minimize during training. Note that NER and RE are framed as multi-label classification. NER module: We use a FFNN on each span s i to produce scores Φ NER (s i ) ∈ R |L E | , with L E the set of possible entity types. At inference, we accept type l ∈ L E for span Coref module: We use the coreference scheme proposed by Lee et al. (2017), using a FFNN to produce scores Φ coref (s i , s j ): at inference time, the highest scoring antecedent of span s j is then chosen (potentially s j itself). Indeed, to allow for singletons we accept self-references (s j , s j ) if NER predicts the span s j to be an entity. RE module: Similar to Luan et al. ( , 2018, we use a FFNN to produce scores Φ RE (s i , s j ) ∈ R |L R | for each pair of spans (s i , s j ), with L R the set of relation types. We accept relation l ∈ L R for pair IE unification: Above modules make span level predictions. We obtain entity-centric predictions using the coref clusters, by assigning the union of predicted entity/relation types within a coref cluster to all its members, as do Zaporojets et al. (2021).

Results
We summarize the comparison of various model choices for both DWIE and DocRED datasets in Table 2. First, looking into (Q1), we note that including background information from KB-graph and KB-text significantly boosts performance compared to the Baseline without any KB. Additionally, our model outperforms the results from Zaporojets et al. (2021) (not listed in the table) by about 2 percentage points F1, using the same input (GloVe) representations. Furthermore, we observe a general improvement in results when combining both representations, suggesting that a (hyper)text corpus (Wikipedia) and a knowledge graph (Wikidata) embed complementary information for raising IE performance.
Deeper analysis reveals that adding KB representations mainly benefits performance for "rare" entity types: e.g., limiting the test set to entity types that occur ≤50 times in the training set for DWIE, compared to Baseline, F1 for NER goes up by +13.9 for KB-both with AttPrior, while the benefit gradually decreases for more frequently occurring entity types. For RE, we note that overall we also see a clear performance gain from adding KB information (e.g., +5.1% F1 for both KB sources with AttProp compared to Baseline for DWIE), yet the boost is not as clear for relations with fewer training instances. (The latter makes sense, since we inject KB representations of entities rather than explicitly also for relations; we leave studying adding relation embedding information for future work.) Second, for (Q2), we note that the AttPrior scheme is the overall winner among the different EL candidate weigthing schemes. We observed that in terms of ranking EL candidates, Prior performs quite well on DWIE -for 86.5% of entity mentions it assigns the highest score to the correct EL candidate, while Attention and AttPrior achieve it for 46.2%, resp. 77.2% of the mentions -which basically confirms that DWIE has a similar entity distribution as Wikipedia. 3 Yet, it seems necessary to include alternative candidates, and  the attention-based schemes thus can correct EL mistakes of Prior, as illustrated in Fig. 2. This correction leads to a resulting boost for the IE tasks as reported in Table 2. E.g., we found that for DWIE, looking at clusters with entity mentions for which Prior makes wrong EL predictions, the AttPrior weighting scheme retrieves +3.7% more of the gold standard annotated named entities (as opposed to just +0.6% in the clusters with correct Prior EL candidates). Perfecting the EL prediction would potentially boost IE performance even more.

Related Work
As stated earlier, we studied how to integrate (i) knowledge base information into IE, and particularly (ii) end-to-end IE combining multiple tasks (NER, relation extraction, coreference resolution), while (iii) taking an entity-centric perspective, i.e., focus on making consistent decisions on the document level. For (i), integrating KB into IE has been applied for individual tasks: relation classification (Poerner et al., 2020;Zhang et al., 2019;Yang and Mitchell, 2017), entity typing (Peters et al., 2019) and NER (Yamada et al., 2020). For (ii), recently span-based architectures (Lee et al., 2017;Fei et al., 2020) have been proposed. Our work unifies the KB integration concept into such span-based IE system, in particular an entity-centric one (as per (iii)), building on Jia et al. (2019); Zaporojets et al. (2021). For the KB integration approach, we exploit entity representations trained on a hypertext corpus, as in (Yamada et al., 2016;Ganea and Hofmann, 2017;Yamada et al., 2020) or learnt from a knowledge graph (Yang and Mitchell, 2017;Han et al., 2018;Zhang et al., 2019). Our results show that both offer complementary value for IE. Similarly to our work, Yamada and Shindo (2019) also explore using an attention-weighted combination of entity representations, but they use it to build a full document representation (with mentions having the entities as candidates) for a text classification task. In contrast, our span-based attention model is able to "inject" knowledge in each of the mentions separately, for more fine-grained downstream IE tasks that are mention-dependent, e.g., coreference resolution, relation extraction and NER.

Conclusion
We propose an end-to-end model for joint IE (NER + relation extraction + coreference resolution) incorporating entity representations from a background knowledge base (KB), using a span-based system. We find that representations built from a knowledge graph and a hypertext corpus are complementary in boosting IE performance. To combine candidate entity representations for text spans, we explore various weighting schemes: an attention-based combination is successful in combining prior frequency information from a hypertext corpus with contextual information to identify the relevant entity, and achieves highest IE performance.