An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning

We present a joint model for entity-level relation extraction from documents. In contrast to other approaches - which focus on local intra-sentence mention pairs and thus require annotations on mention level - our model operates on entity level. To do so, a multi-task approach is followed that builds upon coreference resolution and gathers relevant signals via multi-instance learning with multi-level representations combining global entity and local mention information. We achieve state-of-the-art relation extraction results on the DocRED dataset and report the first entity-level end-to-end relation extraction results for future reference. Finally, our experimental results suggest that a joint approach is on par with task-specific learning, though more efficient due to shared parameters and training steps.


Introduction
Information extraction addresses the inference of formal knowledge (typically, entities and relations) from text. The field has recently experienced a significant boost due to the development of neural approaches (Zeng et al., 2014;Zhang and Wang, 2015;Kumar, 2017). This has led to two shifts in research: First, while earlier work has focused on sentence level relation extraction (Hendrickx et al., 2010;Han et al., 2018;Zhang et al., 2017), more recent models extract facts from longer text passages (document-level). This enables the detection of inter-sentence relations that may only be implicitly expressed and require reasoning across sentence boundaries. Current models in this area do not rely on mention-level annotations and aggregate signals from multiple mentions of the same entity.
The second shift has been towards multi-task learning: While earlier approaches tackle entity mention detection and relation extraction with separate models, recent joint models address these tasks The Portland Golf Club is a private golf club in the northwest United States, in suburban Portland, Oregon. The PGC is located in the unincorporated Raleigh Hills area of eastern Washington County, southwest of downtown Portland and east of Beaverton. PGC was established in the winter of 1914, when a group of nine businessmen assembled to form a new club after leaving their respective clubs. The golf club hosted the Ryder Cup matches of 1947, the first renewal in a decade, due to World War II. The U.S. team defeated Great Britain 11 to 1 in wet conditions in early November. at once (Bekoulis et al., 2018;Nguyen and Verspoor, 2019;. This does not only improve simplicity and efficiency, but is also commonly motivated by the fact that tasks can benefit from each other: For example, knowledge of two entities' types (such as person+organization) can boost certain relations between them (such as ceo of).
We follow this line of research, and present JEREX 1 ("Joint Entity-Level Relation Extractor"), a novel approach for joint information extraction. JEREX is to our knowledge the first approach that combines a multi-task model with entity-level relation extraction: In contrast to previous work, our model jointly learns relations and entities without annotations on mention level, but extracts document-level entity clusters and predicts relations between those clusters using a multi-instance learning (MIL) (Dietterich et al., 1997;Riedel et al., 2010;Surdeanu et al., 2012) approach. The model is trained jointly on mention detection, coreference resolution, entity classification and relation extraction ( Figure 1).
While we follow best practices for the first three tasks, we propose a novel representation for relation extraction, which combines global entity-level representations with localized mention-level ones. We present experiments on the DocRED (Yao et al., 2019) dataset for entity-level relation extraction. Though it is arguably simpler compared to recent graph propagation models (Nan et al., 2020) or special pre-training (Ye et al., 2020), our approach achieves state-of-the-art results.
We also report the first results for end-to-end relation extraction on DocRED as a reference for future work. In ablation studies we show that (1) combining a global and local representations is beneficial, and (2) that joint training appears to be on par with separate per-task models.

Related Work
Relation extraction is one of the most studied natural language processing (NLP) problems to date. Most approaches focus on classifying the relation between a given entity mention pair. Here various neural network based models, such as RNNs (Zhang and Wang, 2015), CNNs (Zeng et al., 2014), recursive neural networks (Socher et al., 2012) or Transformer-type architectures (Wu and He, 2019) have been investigated. However, these approaches are usually limited to local, intrasentence, relations and are not suited for documentlevel, inter-sentence, classification. Since complex relations require the aggregation of information distributed over multiple sentences, document-level relation extraction has recently drawn attention (e.g. Quirk and Poon 2017;Verga et al. 2018;Gupta et al. 2019;Yao et al. 2019). Still, these models rely on specific entity mentions to be given. While progress in the joint detection of entity mentions and intra-sentence relations has been made (Gupta et al., 2016;Bekoulis et al., 2018;Luan et al., 2018), the combination of coreference resolution with relation extraction for entity-level reasoning in a single, jointly-trained, model is widely unexplored. Document-level Relation Extraction Recent work on document-level relation extraction directly learns relations between entities (i.e. clusters of mentions referring to the same entity) within a document, requiring no relation annotations on mention level. To gather relevant information across sentence boundaries, multi-instance learning has successfully been applied to this task. In multiinstance learning, the goal is to assign labels to bags (here, entity pairs), each containing multiple instances (here, specific mention pairs). Verga et al. (2018) apply multi-instance learning to detect domain-specific relations in biological text. They compute relation scores for each mention pair of two entity clusters and aggregate these scores using a smooth max-pooling operation.   variant that is pre-trained on detecting co-referring phrases. They show that replacing RoBERTa with CorefRoBERTa improves performance on DocRED.
All these models have in common that entities and their mentions are both assumed to be given. In contrast, our approach extracts mentions, clusters them to entities, and classifies relations jointly.

Joint Entity Mention and Relation Extraction
Prior joint models focus on the extraction of mention-level relations in sentences. Here, most approaches detect mentions by BIO (or BILOU) tagging and pair detected mentions for relation classification, e.g. (Gupta et al., 2016;Bekoulis et al., 2018;Nguyen and Verspoor, 2019;Miwa and Bansal, 2016). However, these models are not able to detect relations between overlapping entity mentions. Recently, so-called span-based approaches (Lee et al., 2017) were successfully applied to this task (Luan et al., 2018;Eberts and Ulges, 2019): By enumerating each token span of a sentence, these models handle overlapping mentions by design. Sanh et al.
(2019) train a multi-task model on named entity recognition, coreference resolution and relation extraction. By adding coreference resolution as an auxilary task,  propagate information through coreference chains. Still, these models rely on mention-level annotations and only detect intra-sentence relations between mentions, whereas our model explicitly constructs clusters of co-referring mentions and uses these clusters to detect complex entity-level relations in long documents using multi-instance reasoning.

Approach
JEREX processes documents containing multiple sentences and extracts entity mentions, clusters them to entities, and outputs types and relations on entity level. JEREX consists of four task-specific components, which are based on the same encoder and mention representations, and are trained in a joint manner. An input document is first tokenized, yielding a sequence of n byte-pair encoded (BPE) (Sennrich et al., 2016) tokens. We then use the pretrained Transformer-type network BERT (Devlin et al., 2019) to obtain a contextualized embedding sequence (e 1 , e 2 , ...e n ) of the document. Since our goal is to perform end-to-end relation extraction, neither entities nor their corresponding mentions in the document are known in inference.

Model Architecture
We suggest a multi-level model: First, we localize all entity mentions in the document (a) by a spanbased approach (Lee et al., 2017). After this, detected mentions are clustered into entities by coreference resolution (b). We then classify the type (such as person or company) of each entity cluster by a fusion over local mention representations (entity classification) (c). Finally, relations between entities are extracted by a reasoning over mention pairs (d). The full model architecture is illustrated in Figure 2.
(a) Entity Mention Localization Here our model performs a search over all document token subsequences (or spans). In contrast to BIO/BILOU-based approaches for entity mention localization, span-based approaches are able to detect overlapping mentions. Let s := (e i , e i+1 , ..., e i+k ) denote an arbitrary candidate span. Following Eberts and Ulges (2019), we first obtain a span representation by max-pooling the span's token embeddings: Our mention classifier takes the span representation e(s) as well as a span size embedding w s k+1 (Lee et al., 2017) as meta information. We perform binary classification and use a sigmoid activation to obtain a probability for s to constitute an entity mention:ŷ where • denotes concatenation and FFNN s is a two-layer feedforward network with an inner ReLu activation. Span classification is carried out on all token spans up to a fixed length L. We apply a filter threshold α s on the confidence scores, retaining all spans withŷ s ≥ α s and leaving a set S of spans supposedly constituting entity mentions.
(b) Coreference Resolution Entity mentions referring to the same entity (e.g. "Elizabeth II." and "the Queen") can be scattered throughout the input document. To later extract relations on entity level, local mentions need to be grouped to document-level entity clusters by coreference resolution. We use a simple mention-pair (Soon et al., 2001) model: Our component classifies pairs (s 1 , s 2 ) ∈ S×S of detected entity mentions as coreferent or not, by combining the span representations e(s 1 ) and e(s 2 ) with an edit distance embedding w c d : We compute the Levenshtein distance (Levenshtein, 1966) between spans d := D(s 1 , s 2 ) and use a learned embedding w c d . A mention pair representation x c is constructed by concatenation:  Similar to span classification, we conduct binary classification using a sigmoid activation, obtaining a similarity score between the two mentions: where FFNN c follows the same architecture as FFNN s . We construct a similarity matrix C ∈ R m×m (with m referring to the document's overall number of mentions) containing the similarity scores between every mention pair. By applying a filter threshold α c , we cluster mentions using complete linkage (Müllner, 2011), yielding a set E containing clusters of entity mentions. We refer to these clusters as entities or entity clusters in the following.
(c) Entity Classification Next, we map each entity to a type such as location or person: We first fuse the mention representations of an entity cluster {s 1 , s 2 , ..., s t } ∈ E by max-pooling: x e := max-pool(e(s 1 ), e(s 2 ), ..., e(s t )) (5) Entity classification is then carried out on the entity representation x e , allowing the model to draw information from mentions spread across different parts of the document. x e is fed into a softmax classifier, yielding a probability distribution over the entity types: We assign the highest scored type to the entity.
(d) Relation Classification Our final component assigns relation types to pairs of entities. Note that the directionality, i.e. which entity constitutes the head/tail of the relation, needs to be inferred, and that the input document can express multiple relations between different mentions of the same entity pair. Let R denote a set of pre-defined relation types. The relation classifier processes each entity pair (e 1 , e 2 ) ∈ E×E, estimating which, if any, relations from R are expressed between these entities.
To do so, we score every candidate triple (e 1 ,r i ,e 2 ), expressing that e 1 (as head) is in relation r i with e 2 (as tail). We design two types of relation classifiers: A global relation classifier, serving as a baseline, which consumes the entity cluster representations x e , and a multi-instance classifier, which assumes that certain entity mention pairs support specific relations and synthesizes this information into an entity-pair level representation.
Global Relation Classifier (GRC) The global classifier builds upon the max-pooled entity cluster representations x e 1 and x e 2 of an entity pair (e 1 , e 2 ). We further embed the corresponding entity types (w e 1 / w e 2 ), which was shown to be beneficial in prior work (Yao et al., 2019), and compute an entity-pair representation by concatenation: This representation is fed into a 2-layer FFNN (similar to FFNN s ), mapping it to the number of relation types #R. The final layer features sigmoid activations for multi-label classification and assigns any relation type exceeding a threshold α r : Multi-instance Relation Classifier (MRC) In contrast to the global classifier (GRC), the multiinstance relation classifier operates on mention level: Since only entity-level labels are available, we treat entity mention pairs as latent variables and estimate relations by a fusion over these mention pairs. For any pair of entity clusters e 1 ={s 1 1 , s 1 2 , ..., s 1 t 1 } and e 2 ={s 2 1 , s 2 2 , ..., s 2 t 2 }, we compute a mention-pair representation for any (s 1 , s 2 )∈e 1 ×e 2 . This representation is obtained by concatenating the global entity embeddings (Equation (5)) with the mentions' local span representations (Equation (1)) Further, as we expect close-by mentions to be stronger indicators of relations, we add meta embeddings for the distances d s ,d t between the two mentions, both in sentences (d s ) and in tokens (d t ).
In addition, following Eberts and Ulges (2019), the max-pooled context between the two mentions (c(s 1 , s 2 )) is added. This localized context provides a more focused view on the document and was found to be especially beneficial for long, and therefore noisy, inputs: u (s 1 ,s 2 ):=u(s 1 ,s 2 ) • c(s 1 ,s 2 ) • w r ds • w r dt (10) This mention-pair representation is mapped by a single feed-forward layer to the original token embedding size (768): These focused representations are then combined by max-pooling: x r =max-pool({u (s 1 , s 2 )|s 1 ∈e 1 ,s 2 ∈e 2 }) (12) Akin to GRC, we concatenate x r with entity type embeddings w e 1 /w e 2 and apply a two-layer FFNN (again, similar to FFNN s ). Note that for both classifiers (GRC/MRC), we need to score both (s 1 , r i , s 2 ) and (s 2 , r i , s 1 ) to infer the direction of asymmetric relations.

Training
We perform a supervised multi-task training, whereas each training document features ground truth for all four subtasks (mention localization, coreference resolution, as well as entity and relation classification). We optimize the joint loss of all four components: L := β s · L s + β c · L c + β e · L e + β r · L r (13) L s , L c and L r denote the binary cross entropy losses of the span, coreference and relation classifiers. We use a cross entropy loss (L e ) for the entity classifier. A batch is formed by drawing positive and negative samples from a single document for all components. We found such a singlepass approach to offer significant speed-ups both in learning and inference: • Entity mention localization: We utilize all ground truth entity mentions S gt of a document as positive training samples, and sample a fixed number N s of random non-mention spans up to a pre-defined length L s as negative samples. Note that we only train and evaluate on the full tokens according to the dataset's tokenization, i.e. not on byte-pair encoded tokens, to limit computational complexity. Also, we only sample intra-sentence spans as negative samples. Since we found intra-mention spans to be especially challenging ("New York" versus "New York City"), we sample up to Ns 2 intra-mention spans as negative samples.
• Coreference resolution: The coreference classifier is trained on all span pairs drawn from ground truth entity clusters E gt as positive samples. We further sample a fixed number N c of pairs of random ground truth entity mentions that do not belong to the same cluster as negative samples.
• Entity classification: Since the entity classifier only receives clusters that supposedly constitute an entity during inference, it is trained on all ground truth entity clusters of a document.
• Relation classification: Here we use ground truth relations between entity clusters as positive samples and N r negative samples drawn from E gt ×E gt that are unrelated according to the ground truth.
Each component's loss is obtained by averaging over all samples. We learn the weights and biases of sub-component specific layers as well as the  Table 1: Test set evaluation results of our multi-level end-to-end system JEREX on DocRED (using the end-to-end split). We either train the model jointly on all four sub-components (left) or arrange separately trained models in a pipeline (right) ( * joint results are for MRC except for the last row). meta embeddings during training. BERT is finetuned in the process.

Experiments
We evaluate JEREX on the DocRED dataset (Yao et al., 2019). DocRED ist the most diverse relation extraction dataset to date (6 entity and 96 relation types). It includes over 5,000 documents, each consisting of multiple sentences. According to Yao et al. (2019), DocRED requires multiple types of reasoning, such as logical or common-sense reasoning, to infer relations. Note that previous work only uses DocRED for relation extraction (which equals our relation classifier component) and assumes entities to be given (e.g. Wang et al. 2019;Nan et al. 2020). On the other hand, DocRED is exhaustively annotated with mentions, entities and entity-level relations, making it suitable for end-to-end systems. Therefore, we evaluate JEREX both as a relation classifier (to compare it with the state-of-the-art) and as a joint model (as reference for future work on joint entity-level relation extraction).
While prior joint models focus on mention-level relations (e.g. Gupta et al. 2016;Bekoulis et al. 2018;Chi et al. 2019), we extend the strict evaluation setting to entity level: A mention is counted as correct if its span matches a ground truth mention span. An entity cluster is considered correct if it matches the ground truth cluster exactly and the corresponding mention spans are correct. Likewise, an entity is considered correct if the cluster as well as the entity type matches a ground truth entity. Lastly, we count a relation as correct if its argument entities as well as the relation type are correct. We measure precision, recall and micro-F1 for each sub-task and report micro-averaged scores.  Dataset split The original DocRED dataset is split into a train (3,053 documents), dev (1,000) and test (1,000) set. However, test relation labels are hidden and evaluation requires the submission of results via Codalab. To evaluate end-to-end systems, we form a new split by merging train and dev. We randomly sample a train (3,008 documents), dev (300 documents) and test set (700 documents). Note that we removed 45 documents since they contained wrongly annotated entities with mentions of different types. Table 2 contains statistics of our end-to-end split. We release the split as a reference for future work.
Hyperparameters We use BERT BASE (cased) 2 for document encoding, an attention-based language model pre-trained on English text (Devlin et al., 2019). Hyperparameters were tuned on the end-to-end dev set: We adopt several settings from (Devlin et al., 2019), including the usage of the Adam Optimizer with a linear warmup and linear decay learning rate schedule, a peak learning rate of 5e-5 3 and application of dropout with a rate of 0.1 throughout the model. We set the size of meta embeddings (w s , w c , w e , w r ds , w r dt ) to 25 and the number of epochs to  20. Performance is measured once per epoch on the dev set, out of which the best performing model is used for the final evaluation on the test set. A grid search is performed for the mention, coreference and relation filter threshold (α s =0.85, α c =0.85, α r (GRC)=0.55, α r (MRC)=0.6) with a step size of 0.05. The number of negative samples (N s =N c =N r =200) and sub-task loss weights (β s =β c =β r =1, β e =0.25) are manually tuned. Note that some documents in DocRED exceed the maximum context size of BERT (512 BPE tokens). In this case we train the remaining position embeddings from scratch.

End-to-End Relation Extraction
JEREX is trained and evaluated on the end-to-end dataset split (see Table 2). We perform 5 runs for each experiment and report the averaged results. To study the effects of joint training, we experiment with two approaches: (a) All four sub-components are trained jointly in a single model as described in Section 3.2 and (b) we construct a pipeline system by training each task separately and not sharing the document encoder. Table 1 illustrates the results for the joint (left) and pipeline (right) approach. As described in Section 3, each sub-task builds on the results of the previous component during inference. We observe the biggest performance drop for the relation classification task, underlining the difficulty in detecting document-level relations. Furthermore, the multi-instance based relation classifier (MRC) out-  Table 4: Single-task performance of the joint model (left) and separate models (right) on the end-to-end split ( * joint results are for MRC except for the last row).
performs the global relation classifier (GRC) by about 2.4% F1 score. We reason that the fusion of local evidences by multi-instance learning helps the model to focus on appropriate document sections and alleviates the impact of noise in long documents. Moreover, we found the multi-instance selection to offer good interpretability, usually selecting the most relevant instances (see Figure 3 for examples). Overall, we observe a comparable performance by joint training versus using the pipeline system. This is also confirmed by the results reported in Table 4, where we evaluate the four components independently, i.e. each component receives ground truth samples from the previous step in the hierarchy (e.g. ground truth mentions for coreference resolution). Again, we observe the performance difference between the joint and pipeline model to be negligible. This shows that it is not necessary to build separate models for each task, which would result in training and inference overhead due to multiple expensive BERT passes. Instead, a single neural model is able to jointly learn all tasks necessary for document-level relation extraction, therefore easing training, inference and maintenance.

Relation Extraction
We also compare our model with the state-of-theart on DocRED's relation extraction task. Here, entity clusters are assumed to be given. We train and test our relation classification component on the original DocRED dataset split. Since test set labels are hidden, we submit the best out of 5 runs on the development set via CodaLab to retrieve the test set results. Table 3 includes previously reported results from current state-of-the-art models. Note that our global classifier (GRC) is similar to Queequeg is a fictional character in the 1851 novel Moby-Dick by American author Herman Melville . The son of a South Sea chieftain who left home to explore the world, Queequeg is the first principal character encountered by the narrator, Ishmael. The quick friendship and relationship of equality between the tattooed cannibal and the white sailor shows Melville's basic theme of shipboard democracy and racial diversity... Shadowrun:Hong Kong is a turn-based tactical role-playing video game set in the Shadowrun universe. It was developed and published by Harebrained Schemes , who previously developed Shadowrun Returns and its standalone expansion. It includes a new single -player campaign and also shipped with a level editor that lets players create their own Shadowrun campaigns and share them with other players. In January 2015, Harebrained Schemes launched a Kickstarter campaign in order to fund additional features and content they wanted to add to the game, but determined would not have been possible with their current budget. The initial funding goal of US $ 100,000 was met in only a few hours. The campaign ended the following month, receiving over $ 1.2 million. The game was developed with an improved version of the engine used with Shadowrun Returns and Dragonfall. Harebrained Schemes decided to develop the game only for Microsoft Windows, OS X, and Linux, ... Figure 3: Two example documents of the DocRED dataset. Highlighted are relations "creator" between "Queequeg" and "Herman Melville" (top) and "developer" between "Shadowrun Returns" and "Harebrained Schemes" (bottom). Bordered pairs are the top selections of the multi-instance relation classifier. the baseline by (Yao et al., 2019). However, we replace mention span averaging with max-pooling and also choose max-pooling to aggregate mentions into an entity representation, yielding considerable improvement over the baseline. Using the multi-instance classifier (MRC) instead further improves performance by about 4.5%. Here our model also outperforms complex methods based on graph attention networks (Nan et al., 2020) or specialized pre-training (Ye et al., 2020), achieving a new state-of-the-art result on DocRED's relation extraction task.

Ablation Studies
We perform several ablation studies to evaluate the contributions of our proposed multi-instance relation classifier enhancements: We remove either the global entity representations x e 1 , x e 2 (Equation 5) (a) or the localized context representation c(s 1 , s 2 ) (Equation 10) (b). The performance drops by about 0.66% F1 score when global entity representations are omitted, indicating that multi-instance reasoning benefits from the incorporation of entity-level context. When the localized context representation is omitted, performance is reduced by about 0.90%, confirming the importance of guiding the model to relevant input sections. Finally, we limit the model to fusing only intra-sentence mention pairs (c). In case no such instance exists for an entity pair, the closest (in token distance) mention pair is selected. Obviously, this modification reduces computational complexity and memory consumption, especially for large documents. Nevertheless, while we observe intra-sentence pairs to cover most relevant signals, exhaustively pairing all mentions of an entity pair yields an improvement of 0.67%.

Model F1
Relation Classification (MRC) 59.76 -(a) Entity Representations 59.10 -(b) Localized Context 58.85 -(c) Exhaustive Pairing 59.09 Table 5: Ablation studies for the multi-level relation classifier (MRC) using the end-to-end split. We either remove global entity representations (a), the localized context (b) or only use intra-sentence mention pairs (c). The results are averaged over 5 runs.

Conclusions
We have introduced JEREX, a novel multi-task model for end-to-end relation extraction. In contrast to prior systems, JEREX combines entity mention localization with coreference resolution to extract entity types and relations on an entity level. We report first results for entity-level, end-to-end, relation extraction as a reference for future work. Furthermore, we achieve state-of-the-art results on the DocRED relation extraction task by enhancing multi-instance reasoning with global entity representations and a localized context, outperforming several more complex solutions. We showed that training a single model jointly on all subtasks instead of using a pipeline approach performs roughly on par, eliminating the need of training separate models and accelerating inference. One of the remaining shortcomings lies in the detection of false positive relations, which may be expressed according to the entities' types but are actually not expressed in the document. Exploring options to reduce these false positive predictions seems to be an interesting challenge for future work.