Sequential Cross-Document Coreference Resolution

Relating entities and events in text is a key component of natural language understanding. Cross-document coreference resolution, in particular, is important for the growing interest in multi-document analysis tasks. In this work we propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings and achieves competitive results for both entity and event coreference while providing strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings. Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters, approximating a higher-order model. In addition, we conduct extensive ablation studies that provide new insights into the importance of various inputs and representation types in coreference.


Introduction
Relating entities and events in text is a key component of natural language understanding. For example, whether two news articles describing hurricanes are referring to the same hurricane event. A crucial component of answering such questions is reasoning about groups entities and events across multiple documents.
The goal of coreference resolution is to compute these clusterings of entities or events from extracted spans of text. While within-document coreference has been studied extensively (e.g., Lee et al. (2017), there has been relatively less work on the cross-document task. However, growing interest in multi-document applications, such as summarization (e.g., Liu and Lapata (2019); Fabbri et al. (2019)) and reading comprehension (e.g., Yan et al. (2019); Welbl et al. (2018)), highlights the importance of developing efficient and accurate * * Work done during internship at Amazon AI cross-document coreference models to minimize error-propagation in complex reasoning tasks.
In this work we focus on cross-document coreference (CDCR), which implicitly requires withindocument coreference (WDCR), and propose a new model that improves both coreference performance and computational complexity. Recent advances in within-document entity coreference resolution have shown that sequential prediction (i.e., making coreference predictions from left to right in a text 1 ) achieves strong performance (Lee et al., 2017) with lower computational costs. This paradigm is also well suited to real-world streaming settings, where new documents are received every day, since it can easily find corefering events and entities with already processed documents, while most non-sequential models would require a full rerun. In this work, we show how this technique can first be extended to cross-document entity coreference and then adapted to cross-document event coreference.
Our method is also able to take advantage of the history of previously made coreference decisions, approximating a higher-order model (i.e., operating on mentions as well as structures with mentions). Specifically, for every mention, a coreference decision is made not over a set of individual mentions but rather over the current state of coreference clusters. In this way, the model is able to use knowledge about the mentions currently in a cluster when making its decisions. While higher-order models have achieved state-of-the-art performance on entity coreference , they have been used infrequently for event coreference. For example, Yang et al. (2015) use one Chinese restaurant process for WDCR and then a second for CDCR over the within-document clusters. In contrast, our models make within-and cross-document corefer-ence decisions in a single pass, taking into account all prior coreference decisions at each step.
Our contributions are: (1) we present the first study of sequential modeling for cross-document entity and event coreference and achieve competitive performance with a large reduction in computation time, (2) we conduct extensive ablation studies both on input information and model features, providing new insights for future models.

Related Work
Prior work on coreference resolution is generally split into either entity or event coreference. Entity coreference is relatively well studied (Ng, 2010), with the largest focus on within-document coreference (e.g., Raghunathan et al. (2010);Fernandes et al. (2012); Durrett and Klein (2013); Björkelund and Kuhn (2014); Martschat and Strube (2015); Wiseman et al. (2016); Clark and Manning (2016); ; Kantor and Globerson (2019)). Recently, Joshi et al. (2019) showed that pre-trained language models, in particular BERTlarge (Devlin et al., 2019), achieve state-of-theart performance on entity coreference. In contrast to prior work on entity coreference, which is primarily sequential (i.e., left to right) and only within-document, our work extends the sequential paradigm to cross-document coreference and also adds incremental candidate composition.
There has been less work on event coreference since the task is generally considered harder. This is largely due to the more complex nature of event mentions (i.e., a trigger and arguments) and their syntactic diversity (e.g., both verb phrases and noun-phrases). Prior work on event coreference typically involves pairwise scoring between mentions followed by a standard clustering algorithm to predict coreference links (Pandian et al., 2018;Choubey and Huang, 2017;Cremisini and Finlayson, 2020;Meged et al., 2020;Yu et al., 2020b;Cattan et al., 2020), classification over a fixed number of clusters (Kenyon-Dean et al., 2018) and template-based methods (Cybulska and Vossen, 2015b,a). While pairwise scoring (e.g., graphbased models, see §3.7) with clustering is effective, it requires tuned thresholds (for the clustering algorithm) and cannot use already predicted scores to inform later ones, since all scores are predicted independently. To the best of our knowledge, our work is the first to apply sequential models to crossdocument event coreference.
Although a few previous works attempt to use information about existing clusters through incremental construction (Yang et al., 2015;Lee et al., 2012) or argument sharing (Barhom et al., 2019;Choubey and Huang, 2017), these either continue to rely on pairwise decisions or use shallow, noncontextualized features that have limited efficacy. For example, Xu and Choi (2020) explore a variant of cluster merging for WD entity conreference only that still relies on scores between individual mentions. Additional recent work on WD coreference investigates incremental construction of clusters for prediction (Xia et al., 2020;Toshniwal et al., 2020) and cluster ranking (Yu et al., 2020a) In contrast, our method makes coreference decisions between a mention and all existing coreference clusters across multiple documents using contextualized features and so takes advantage of interdependencies between mentions, even across documents, while making all decisions in one pass.

Overview and Task Definition
We propose a new sequential model for crossdocument coreference resolution (see Figure 1) that predicts links between mentions and incrementally constructed coreference clusters computed across multiple documents. In the following sections, we will first describe our model for entity coreference ( §3.2-3.5), then discuss adaptations to event coreference ( §3.6), and finally conduct a time comparison with prior models ( §3.7).
The goal of entity coreference is to determine whether two entity mentions refer to the same real-world entity, with an analogous definition for event coreference. Formally, define an entity mention x = e, V where e is an entity and V is a set of events in which e participates. We adopt the definition of an event as "a specific occurrence of something that happens" (Cybulska and Vossen, 2014). More specifically V = [ t 1 , r 1 , . . . , t n , r n ] where t i is an event trigger and r i ∈ R is the role e takes in in the event with trigger t i , from a fixed set of argument roles.

Entity Mention Representation
To construct a representation for entity mention x, we first embed the entity e, along with its context, as h e using the embeddings from BERT (Devlin et al., 2019) of the start and end sub-word tokens of the entity span. We similarly embed each event  Figure 1: Our model using sequential prediction with incremental clustering for cross-document entity coreference.
v i ∈ V as h v i Then we compute an aggregated event representation h v using a BiLSTM (Hochreiter and Schmidhuber, 1997) over all of the h v i , followed by mean-pooling. Finally, we combine the entity representation and event representations using an affine transformation to obtain the full mention representation h x .

Incremental Candidate Composition
Let L e = {P 1 , . . . , P n } be a set of coreference clusters over the antecedents of mention x i . We compute a candidate cluster representation h P for each set P of coreferring entity antecedents in L e . In a similar manner to composition functions in neural dependency parsing, which incrementally combine head-word and modifier information to construct a subtree representation (Dyer et al., 2015(Dyer et al., , 2016de Lhoneux et al., 2019), we incrementally combine document-and mention-level information to form a complete candidate cluster representation h P . That is, for each x j ∈ P , we combine h x j and h CLS j , the CLS token embedding from the document containing x j , using a non-linear transformation where W x , W CLS ∈ R dm×dm and b c ∈ R dm are learned parameters. Then we average the representations h c j for all x j ∈ P . To allow the model to predict singleton mentions, we add an additional candidate S, with representation h s = h CLS , to the set of candidates for x i , where coreference with S indicates x i is a singleton. As we update L e , we incrementally update h P for all P ∈ L e . Note that L e can be either the gold coreference clusters over seen mentions (during training) or the current set of predicted clusters (during inference).

Coreference Link Prediction
We predict coreference links between a query entity mention x and a set of candidates by passing a set of similarity features through a softmax layer. Let We first compute the similarity between each candidate P j and the query using both cosine similarity f cos and multi-perspective cosine (MP cosine) similarity f mpcos (Wang et al., 2017). For multi-perspective cosine similarity, we first project the candidate and query into k shared spaces using k separate linear projections. Then, for each of the k new spaces, we compute the cosine similarity between the projected representations in that space.
Next, we combine these features with the product and the difference of the candidate and query to obtain the final feature representation: Then, for all candidates P j we compute the probability p(x, P j ) that the query x corefers with as We predict a link between x and the candidate with maximum coreference probability. If that candidate is S, then x is predicted as a singleton.

Sequential Cross-Document Prediction
To predict cross-document coreference links, we propose an algorithm that iterates through a list of documents and predicts coreference links between entity mentions in the current document and candidate clusters computed across all preceeding documents ( Figure 2).
We first impose an arbitrary ordering on a list of documents D. Then, for each i ∈ {1, . . . , |D|} and each entity mention x n in document D[i] we compute candidates clusters C xn ( §3.3) from the end for 17: end for 18: return L e

coreference clusters across all documents D[j]
where j < i. Note that this includes both withindocument and cross-document clusters.
After computing the candidate clusters for entity mention x n , we compute similarity features and use these to predict a coreference link n between x n and one candidate P j ∈ C xn ( §3.4). Finally, we update the predicted clustering to account for the new link n and compute new candidates for x n+1 .
Since the number of possible candidates for each x n grows as the number of preceding documents (i) increases, we reduce the computational cost by only considering previous documents D[j] that are similar to D[i]. We define similar as having the same topic from a fixed set of topics T .
During training, we use gold entity clusters to compute the candidates (as in Figure 2) and gold document topic clusters T . In contrast, during inference we using the currently predicted coreference clusters to compute candidates. That is, we use L e in place of G e in lines 9 and 15 in Figure 2. Furthermore, we use predicted topic clustersT , computed using K-means ( §4.5), in place of T .
Our model is trained to minimize cross-entropy loss computed in batches. Here, all M entity men-tions in a single document form one batch and the loss is computed after M sequential predictions.

Adaptations for Event Coreference
We also adapt the same architecture and algorithm to cross-document event coreference resolution. Define an event x = t, A where t is the event trigger and A = [ e 1 , r 1 , . . . , e m , r m ] is the set of its event arguments (i.e., entity-role pairs). If no entity takes some role r i , then e i = ∅.
We compute the event representation h x analogously to the entity representations ( §3.2). That is, we combine the event trigger representation with an aggregated entity representation, computed over event arguments A. We then compute candidateclusters and predict coreference links in the same manner as for entities ( §3.3, §3.4) with an additional feature, indicating whether event arguments corefer, in Equation 2. Under the definition of event coreference, two events corefer when both their triggers and all of their arguments corefer. In practice, we relax the second requirement to most of their arguments, since argument role labeling may be noisy. We compute a binary feature for g r l for each argument role r l to indicate the coreference of e l (the entity with role r l in x) and e (P j ) l (the entity with role r l in candidate cluster P j ). We compute a feature only for roles r l ∈ R in which both the candidate and the query have some entity present (e (i) l = ∅ and e (P j ) l = ∅). Then, for each r l ∈ R, if the two entities corefer then g r l = 1 and if they do not corefer then g r l = 0. Finally, we map each g r l to a learned embedding f r l ∈ R d f and compute an aggregated argument feature representation is the set of roles filled in both x and P j . This feature is then concatenated into Equation 2 before prediction.
The cross-document iteration algorithm for event coreference is analogous to Figure 2 with the modification that ComputeFeatures (line 11) now also takes the gold entity coreference clusters G e .

Time Comparison with Prior Methods
Algorithms for coreference resolution fall into two paradigms: sequential models (i.e., left to right prediction) and graph-based models (i.e., finding optimal connected components from a graph of pairwise similarity scores). This dichotomy is analogous to that in dependency parsing between transition-based parsers (i.e., greedy left-to-right models) and graph-based parsers. While the differences between the paradigms have been studied for dependency parsing Nivre, 2007, 2011), comparisons for coreference have been limited to WD entity coreference only (Martschat and Strube, 2015). In part, this is due to the usages of the two paradigms; sequential models are primarily used for WDCR while graph-based models are used for CDCR. However, as in dependency parsing, the sequential models can be made much more computationally efficient than the graph-based models as we show with our model.
Let D be a set of documents with m mentions that form c coreference clusters. In a graph-based model, scores are computed between all pairs of mentions in all documents and in a general sequential model scores are computed between a specific mention and all antecedents. Our model is a higherorder sequential model; it combines the general sequential paradigm with higher-order inference through incremental candidate clustering. Therefore, while both general sequential and graph-based models always require computing ∼m 2 scores, our model only needs to compute cm scores. Since in practice c m, our model is more efficient. We also note that graph-based models require an additional step to compute clusters using pairwise scores. In practice, agglomerative clustering is often used but for an arbitrary distance matrix this is O(m 3 ). In contrast, a higher-order sequential model computes clusters simultaneously with scores, alleviating the need for an additional step, and therefore is substantially more efficient.
These improvements in efficiency are even more important in real-world streaming scenarios. Namely, given a known set of clusters C for the document set D, compute coreference with the mentions in a new document. In both a general sequential model and a graph-based model, scores need to be computed between the new mentions, and all mentions in D. However, our model only needs to compare the new mentions to the existing clusters C. Hence, our model can better handle the temporal-component of many usage settings and is better suited to life-long learning.

Data
We conduct experiments using the ECB+ dataset (Cybulska and Vossen, 2014), the largest available dataset for both within-document and  , 3, 4, 6-11, 13, 14, 16, 19-20, 22, 24-33}, development {2, 5, 12, 18, 21, 23, 34, 35}. and test 36-45 cross-document event and entity coreference. The ECB+ dataset is an extension of the Event Coreference Bank dataset (ECB) (Bejan and Harabagiu, 2010), which consists of news articles clustered into topics by seminal events (e.g., "6.1 earthquake Indonesia 2009"). The extension of ECB adds an additional seminal event to each topic (e.g., "6.1 earthquake Indonesia 2013"). Documents on each of the two seminal events then form subtopic clusters within each topic in ECB+. Following the recommendations of Cybulska and Vossen (2015b), we use only the subset of annotations that have been validated for correctness in our experiments (see Table 1). As a result, our results are comparable to recent studies (e.g., Barhom 2020)) but not earlier methods (see Upadhyay et al. (2016) for a more complete overview of evaluation settings). We use the standard partitions of the dataset into train, development and test split by topic and use subtopics (gold or predicted) for document clustering.

Identifying Event Structures
The ECB+ dataset does not include relations between events and entities. Although prior work used the Swirl (Surdeanu et al., 2007) semantic role labeling (SRL) parser to extract predicate-argument structures, this does not take advantage of recent advances in SRL. In fact, prior works on coreference using ECB+ have added a number of additional rules on top of the parser output to improve its coverage and linking. For example, Barhom et al. (2019) used a dependency parser to identify additional mentions. Therefore, in this work we use the current state-of-the-art SRL parser on the standard CoNLL-2005 shared task , which has improved performance by ∼10 F1 points both in-and out-of-domain.
Following prior work, we restrict the event struc-ture to the following four argument roles: ARG0, ARG1, TIME, and LOC. However, we additionally add a type constraint during pre-processing that requires entities of type TIME and LOC only fill matching roles (TIME and LOC respectively).

Domain Adaptive Pre-training
Since BERT was trained on the BooksCorpus and Wikipedia (Devlin et al., 2019) and the ECB+ dataset contains news articles, there is a domain mismatch. In addition, the use of a domain corpus for pre-training helps address the data scarcity issue for events and entities, indicated by Ma et al. (2020). Therefore, before training our coreference models, we first fine-tune BERT using the English Gigaword Corpus 2 with both BERT losses, as this has been shown to be effective for domain transfer (Gururangan et al., 2020). Following Ma et al. (2020), we randomly sample 50k documents (626k sentences) and pre-train for 10k steps, using the hyperparameter settings from Devlin et al. (2019).

Baselines and Models
We experiment with the following baseline variations of our model: BERT-SeqWD -computes coreference scores using only the entity (or event) representations, without any cross-document linking, and BERT-SeqXdoc -computes coreference scores across documents but without candidate composition. This means the baseline BERT-SeqXdoc computes scores between the query mention and all antecedent mentions across all prior documents, rather than between the query and the clusters computed with candidate composition. For both event and entity coreference we experiment with our model, SeqXdoc+IC with (+Adapt) and without adaptive pre-training. For entity coreference we compare against the following models: former (Beltagy et al., 2020) with crossdocument attention during pre-training • Lemma -a strong baseline that links mentions with the same head-word lemma in the same document topic cluster (Barhom et al., 2019).

Implementation Details
Our models are tuned for a maximum of 80 epochs with early-stopping on the development set (using CoNLL F1) with a patience of 20 epochs. All models are optimized using Adam (Kingma and Ba, 2015) with a learning rate of 2e-5 and treat all mentions in a document as a batch. We clip gradients to 30 to prevent exploding gradients. Document ordering is fixed within a epoch but randomized between epochs. We encode each document using BERT-base and a maximum document length of 600 tokens for BERT. A threshold of 600 is the default for our system that also accommodates longer documents and there are no documents over 512 tokens in ECB+. In our system, long documents will not be truncated but rather will be split into multiple document pieces, that will be merged in our algorithm. Following adaptive pre-training, we do not fine-tune BERT.
To encode arguments/events we use an LSTM with hidden size 128, for the argument coreference features we use two learned embeddings of dimension d f = 50, and for the multi-perspective cosine similarity we use k = 1 projection layers with dimension 50 for entity coreference and k = 3 projection layers with the same dimension for event coreference. We do not tune our hyperparameters.
We follow Barhom et al. (2019) and use Kmeans to compute document clusters for inference from their implementation with K = 20. Specifically, as features, we use the TF-IDF scores of unigrams, bigrams, and trigrams in the unfiltered dataset, excluding stop words. We select K = 20 as this is the number of gold document clusters in the test data but this can be modified without affecting our algorithm.
For entity coreference, our model outperforms most prior methods (see Table 2) and for event coreference our model demonstrates strong performance (see Table 3). Although Caciularu et al. (2021) achieve the highest performance, we observe first that much of their gains come from the use of a Longformer (80.4 CoNLL F1 for entities, 84.6 for events) -a language-model specifically designed for long contexts. Additionally, they fine-tune with coreference specific data and special tokens, neither of which our models use. As observed by Xu and Choi (2020), improvements in the underlying language model can result in largegains in coreference performance without requiring algorithmic improvements. However, our model focuses on improving the coreference prediction algorithm, while using a standard BERT language model.
We observe that our algorithm provides large gains for both event and entity coreference. In particular, while naive applications of the sequential paradigm in the cross-document setting (BERT-SeqWD and BERT-SeqXdoc) perform poorly, the addition of incremental candidate clustering even without adaptive pre-training yields competitive results (+8.9 and +2.8 CoNLL F1 for entities and events respectively). Adaptive pre-training, which handles domain-mismatch in a similar way to taskspecific fine-tuning (e.g., as in Caciularu et al. (2021)), provides further gains (4.1 and 2.2 CoNLL F1 for entities and events respectively). Our results highlight the importance of higher-order inference (e.g., composition) when extending sequential prediction to cross-document settings.
We note that prior work (Barhom et al., 2019) used predicted entity coreference clusters. In a comparable setting, using the output from our best entity coreference model to compute argument coreference features ( §3.4), we do not observe any drop in performance (i.e., the performance is identical). In addition, we use predicted document clusters for our experiments on both entity and event coreference. Due to the high-quality document clustering 3 , we only observe a drop of ∼1 CoNLL F1 point when using these predicted clusters, compared to the gold document clusters. However, we note that such a small decrease relies on the quality of the clustering, as shown by the larger gap (3 F1 points) observed by Cremisini and Finlayson (2020) with less accurate clusters.
Finally, we observe that the sequential paradigm with incremental candidate composition has several additional advantages. First, without candidate composition, sequential coreference resolution is typically a multi-label task. However, with composition, each mention now has exactly one correct coreferring antecendent during training (possibly a cluster rather than an individual mention) and this simplifies the learning task. While heuristics (e.g., choosing the closest antecedent as the gold antecendent, discussed in Martschat and Strube (2015)), also convert the task from multi-to single-label, they are problematic because they limit the amount of information that can be used during training. Additionally, candidate composition allows sequential models to make use of information not only about antecedents (in contrast to the graph-based models) but also about prior coreference decisions (in contrast to non-compositional sequential models).
Our results are consistent with prior work on the efficacy of sequential models (cf. mention-ranking models for WD entity coreference) (Martschat and Strube, 2015) and the importance of higher-order inference mechanisms (e.g., incremental candidate clustering) in cross-document tasks (Zhou et al., 2020). In addition, our results demonstrate the importance of algorithmic improvements, in addition to improvements in the underlying language model, for strong coreference performance.

Feature Ablation
Since mention representations in coreference vary widely, we conduct extensive feature ablations to provide insights for future work (see Table 4).
First we examine the vector representations used to encode mentions. While prior work used ELMo and pre-trained GloVE (Pennington et al., 2014) word and character embeddings, recent models use   RoBERTa (Cattan et al., 2020;Yu et al., 2020b). We experiment with replacing BERT-base with RoBERTa-base and with using GloVE in addition BERT in our models (see Appendix B for implementation) and observe large drops in performance. We hypothesize that the substantial performance difference between BERT and RoBERTa is due to the Next Sentence Prediciton (NSP) used to train BERT but not RoBERTa. The NSP may force BERT to learn attention multiple sentences, and therefore to understand the document as a whole, an ability that is important for coreference resolution. Therefore, we hypothesize that without taskspecific fine-tuning, adaptive pre-training is most beneficial for coreference on ECB+.
We also observe that our entity coreference model is relatively less susceptible to feature changes than the event coreference model. For example, the event coreference model is particularly reliant on the argument features. Both replacing the argument composition BiLSTM with a meanpooling operation (−Arg comp) and removing all argument information (−Args) result in large drops in performance (-2.5 and -2.1 respectively).
Finally, the contribution of the multi-perspective cosine similarity underscores the importance of cosine similarity as observed by Cremisini and Finlayson (2020). These ablations, including on the importance of document-level information (− CLS) suggest new directions for token and docu- ment representations in coreference.

Effects of SRL
We investigate the impact of using a recent SRL parser to extract event structures ( §4.2), compared to the Swirl parser used in prior work (see Table 5). We first observe that the additional extraction rules used in Barhom et al. (2019) are not necessary when using the new SRL parser. In fact, these rules actually result in a decrease in performance for both entity and event coreference (−1.6 and −0.4 respectively). In addition, when using the Swirl parser and additional rules (Swirl+Bh-rules), we observe a large drop for event coreference (−2.1) compared to entity coreference. This aligns with the heavier dependence of event coreference models on arguments ( § 5.2), which will lead to greater model sensitivity to errors in the entity-event structures (from the SRL). Furthermore, we also see that the type constraint improves event coreference more when using the Swirl SRL (∆ = 1.3) than when using the new SRL (∆ = 0.4). Note that because we do not use role information for entity coreference (i.e., no argument coreference feature), adding or removing the type constraint does not affect entity coreference. These results highlight the importance of minimizing error propagation from the SRL into the coreference resolution.

Conclusion
In this paper, we propose a new model for crossdocument coreference resolution that extends the efficient sequential prediction paradigm to multiple documents. The sequential prediction is combined with incremental candidate composition that allows the model to use the history of past coreference decisions at every step. Our model achieves competitive results for both entity and event coreference and our analysis provides strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings. In future, we intend to adapt this model to coreference across document streams and investigate alternatives to greedy prediction (e.g., beam search).

A Implementation details
The dataset is available here: http: //www.newsreader-project.eu/ results/data/the-ecb-corpus/.
Our models have approximately 9million parameters and are trained with one Tesla V100-SXM2 GPU.
We evaluate our models using three coreference metrics. MUC counts discrepencies in links between the gold and predicted clusters (and thus ignores singletons). B 3 computes, for each mention m, the difference between the gold cluster containing m and the predicted cluster containing m. Finally, CEAF-e finds the injective alignment between predicted and gold clusters that gives the highest similarity under a defined function. For more details on metrics, refer to Cai and Strube (2010).
We report validation results for both entity (Table 6) and event coreference (Table 7).

B Feature Ablation
We experiment with 300-dimensional pre-trained GloVE (Pennington et al., 2014) embeddings in our model. Following (Barhom et al., 2019), we use both GloVE and BERT in the entity mention representations (for entity coreference) and event trigger representations (for event coreference). In the argument representations we use only GloVE. Let x = e, V be an entity in document d and let s d 1 , . . . , s dm be the static GloVE embeddings for the tokens in d. First we apply a non-linear transformation to each s d i = tanh(W t s d i + b t ) where W t ∈ R 1536×300 , where 1536 is 2 * the dimension of the BERT embeddings (2 because we use the start and end tokens of a mention) and 300 is the dimension of GloVE embeddings. Then, we take the average of s d i to obtain s CLS ∈ R 1536 , a static document representation. Next we extract the representation for the entity x as in section 3.2, s x . Finally, we combine these representations with the BERT representations where h c is the representation (as in § 1) and W s x , W s CLS , W B x , W B CLS ∈ R 1536×1536 and b x , b CLS , b c ∈ R 1536 are learned parameters.