Highly Parallel Autoregressive Entity Linking with Discriminative Correction

Generative approaches have been recently shown to be effective for both Entity Disambiguation and Entity Linking (i.e., joint mention detection and disambiguation). However, the previously proposed autoregressive formulation for EL suffers from i) high computational cost due to a complex (deep) decoder, ii) non-parallelizable decoding that scales with the source sequence length, and iii) the need for training on a large amount of data. In this work, we propose a very efficient approach that parallelizes autoregressive linking across all potential mentions and relies on a shallow and efficient decoder. Moreover, we augment the generative objective with an extra discriminative component, i.e., a correction term which lets us directly optimize the generator's ranking. When taken together, these techniques tackle all the above issues: our model is>70 times faster and more accurate than the previous generative method, outperforming state-of-the-art approaches on the standard English dataset AIDA-CoNLL. Source code available at https://github.com/nicola-decao/efficient-autoregressive-EL

Employing autoregressive language models better leverages the implicit knowledge accumulated during pre-training, exploiting a full cross-encoder of entities and their context. For ED, autoregressive generation is remarkably good (even in multilingual settings), while for EL, although state-of-theart on multiple datasets, it suffers from several and critical limitations. The generative model of De Cao et al. (2021a) outputs a version of the input document which is markup-annotated with mentions linked to their respective entities. This necessitates using an autoregressive decoder, precluding parallelism across mentions. Generation also has a high computational cost due to relying on a complex and deep Transformer (Vaswani et al., 2017) decoder. Transformers are state-less and their memory footprint scales with sequence length, making them memory-consuming when generating long sequences. Additionally, Transformers-based decoders are notably data-hungry, and their effective training requires large amounts of data. For example, De Cao et al. (2021a) had to pre-train their model on Wikipedia abstracts.
In this work, we revisit the generative approach to EL and generate mention-entity pairs conditionally independently given the input. This allows for parallelism across mentions, which we exploit by employing a shallow LSTM-based decoder. To optimize more explicitly the generator's ranking, we use a discriminative correction term that pushes the score of the correct predictions to be higher than the rest. Moreover, to enable conditioning on long inputs, we employ an efficient Transformer  Figure 1: Outline of our model: a Transformer-based document encoder embeds a document into vectors (the encoder is designed to support long text). Then, an entity detection module classifies which spans in the document are entity mentions. Conditioning on a mention embedding, an entity linking module first uses an LSTM to either generate or score candidates' textual identifiers and then a classifier to re-rank the candidates. encoder (Beltagy et al., 2020) designed to support long sequences. Figure 1 outlines our model.

Contributions
We propose a highly parallel model for autoregressive entity linking that retains the advantages of being generative while being >70 times faster than a previous generative formulation and as fast as non-generative models. We optimize for the correctness of the decoder's ranking with a discriminative loss to improve autoregressive EL further. The model outperforms state-of-the-art approaches on the standard English AIDA dataset.

Background
Task Entity Linking (EL) is the task of predicting a set Y of mention-entity pairs contained in some input text x (Hoffmann et al., 2011). Each mention m is a pair of start and end positions m s , m e indicating a span in x. Each mention m refers to an entity e in a fixed Knowledge Base (KB)-note that entities can be referred to with multiple ambiguous surface forms (e.g. in Wikidata "NYC" and "New York" both refers to the entity "New York City" 2 ).
Related work EL is typically decomposed in Mention Detection (MD, i.e., the task of finding mention spans in text) and Entity Disambiguation (ED, i.e., the task of disambiguating a mention to its respective entity). Many methods (Hoffart et al., 2011;Piccinno and Ferragina, 2014;Steinmetz and Sack, 2013) treat these sub-tasks separately, training different modules. More modern approaches -known as end-to-end EL -instead use a shared (typically neural) architecture. Kolitsas et al. (2018) use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) as an encoder and then local and global scoring functions to link mentions. They exploit pre-computed entity embeddings by Ganea and Hofmann (2017) and match the embeddings to contextualized mention representations. Martins et al. (2019) also explore joint learning of Named Entity Recognition (NER) and EL showing that the two tasks benefit from joint training, while Li et al. (2020) approach EL specifically for questions.
In this work, we focus on monolingual EL in English while there is a line of work that explores cross-lingual entity linking (McNamee et al., 2011;Ji et al., 2015), that is linking from any source language to a standard one (e.g. English), and multilingual entity linking (Botha et al., 2020) that is a generalization of both.
Autoregressive Linking The GENRE model by De Cao et al. (2021a) departs from framing EL as matching in vector space, and instead frames it as a sequence-to-sequence problem. GENRE tackles MD and ED for all mention-entity pairs jointly by autoregressively generating a version of the input markup-annotated with the entities' unique identifiers expressed in natural language. Although we focus on EL, GENRE was also applied to ED alone as well as to page-level document retrieval for fact-checking, open-domain question answering, slot filling, and dialog (Petroni et al., 2021). mGENRE (De Cao et al., 2021b) is the multilingual extension of GENRE.
Modern techniques (Wu et al., 2020;Botha et al., 2020) are based on a dense retriever module that uses maximum inner-product search (MIPS) to match mention vectors to entity embeddings. In contrast with MIPS for linking, generative models i) exploit knowledge learned during pre-training, ii) are memory-efficient as they do not need to store pre-computed entity representations, and iii) are full cross-encoders of context and entity since decoders can use attention to context. Bi-encoders solutions may be sub-optimal and memory inefficient although memory-efficient dense retrieval has recently received attention (Izacard et al., 2020;Min et al., 2021;. A caveat of joint modeling all mention-entity pairs with an autoregressive model (i.e., without any independence assumptions) is the lack of parallelism, which makes GENRE extremely slow for the complete task of EL. In addition, generation of open-ended text calls for a deep decoder and thus requires very large corpora for training.

Method
Our method learns by generating observed mentionentity pairs Y given an input document x. To enable, we assume that, given the document x, each mention-entity pair m, e ∈ Y is independent of one another. Moreover, each pair's probability is further factorized as a product of an MD and an ED components: (1) where θ = θ MD ∪ θ ED is a shared set of parameters (and θ MD ∩ θ ED need not be empty). To provide our models with a rich representation of the document, we encode it using a Longformer (Beltagy et al., 2020), a Transformer pre-trained with a masked language model objective that is designed to support long sequences.
Mention Detection There are different ways to model p(m|x, θ MD ) (i.e., the probability that the span m in x contains a mention). One is to score all possible spans which requires a number of evaluations that is quadratic in sequence length. For long documents, that is clearly unfeasible. Thus, for maximizing efficiency, we opt for factorizing the probability of a span as the probability of its start m s times the conditional probability of its end m e given the start: The first term is the probability that position m s starts a mention, and the second is the probability that the mention has size m e − m s + 1, to which we give categorical treatment. 3 Such factorization allows both for fast training and inference. During training, mentions are known. For inference, we consider only the positions for which the probability of starting a mention exceeds a threshold chosen to maximise micro-F 1 on the validation set.

Entity Disambiguation
The disambiguation module learns to generate the unique name of e autoregressively (token by token) from left to right: where t is the unique name of e in the KB. To fully exploit our design's potential for parallelism across mentions, we use a small single-layered LSTM (Hochreiter and Schmidhuber, 1997). This language model is not constrained to generating only valid entity names, besides, maximum likelihood training does not directly optimize for the correctness of the generator's ranking. To mitigate those issues, when training the architecture, we employ an auxiliary loss based on a discriminative classifier that assigns probability where f is an MLP (details in Section 4.2), t is the unique name of e and the normalization is over all entities in the KB (i.e., their unique names).

Parameter Estimation
We estimate the parameters of all components jointly as to maximize the model's likelihood given a dataset of observations using stochastic gradient descent (SGD; Robbins and Monro, 1951;Kiefer and Wolfowitz, 1952;Bottou, 2012). For the language model component, we employ length normalization (Sutskever et al., 2011 and label smoothing (Szegedy et al., 2016). All components are further regularized with dropout (Srivastava et al., 2014 et al., 2016) and all systems we compare to. As in several previous approaches, for linking we assume the availability of a pre-computed set of candidates instead of considering the whole KB. For that, we use the candidates by Pershina et al. (2015). We also use these candidates to provide negative samples for the discriminative loss during training (see Equation 4).

Architecture details
As the document encoder, we use a Longformer (Beltagy et al., 2020). A Longformer is a RoBERTa (Liu et al., 2019) model with a limited attention window (we use 128 tokens). It has 12 layers, of which we use the first 8 (for faster computation), a hidden size of 768, 12 heads, for a total of 149M parameters. The MD modules (i.e., p(m s |x, θ MD ) and p(m e |x, m s , θ MD )) are both implemented as feed forward NNs that take as inputs contextualized token embeddings. They have architecture: [LayerNorm, 128, ReLU, LayerNorm, 1]. We applied dropout of 0.1 before linear projections. The autoregressive ED module p(t i |m, x, t <i , θ ED ) is implemented with an LSTM. Three feed-forward NNs predict the first hidden state, the first context vector, and a vector to append to each decoding step. The predictions are a function of the start and end embeddings of a mention.

Training details
We optimize our model employing Adam (Kingma and Ba, 2015) with weight decay of 1e-2. We use a learning rate of 1e-4 for the Longformer and a learning rate of 1e-3 for all other components. We use a learning rate linear decay schedule for a maximum of 10,000 steps with 500 warm-up steps. We train with a batch size of 32 for a maximum of 100 epochs, and we do model selection on micro-F 1 on the validation set. We also optimized the threshold for the MD component with a grid search between -5 and 5 with steps 0.1 measuring micro-F 1 on the validation set. Training takes approximately one hour on 4 GPUs Nvidia Titan X 12 GB.  tion (MD) and Entity Disambiguation (ED) scores. Our method gets an MD micro-F 1 score of ≈94 and an ED micro-F 1 score of ≈92 (note that the EL task scores a prediction as correct when both mention detection and disambiguation are done correctly). Unfortunately, most of the baselines we compare to do not report this decomposition, and thus is difficult to systematically investigate where our method stands for MD and ED scores. Nevertheless, Kolitsas et al. (2018) is the second-best system in terms of EL micro-F 1 , and the authors reported a ≈89 ED micro-F 1 . As a comparison, Broscheit (2019) reported ≈88 and van Hulst et al.

Results
(2020) ≈84 ED micro-F 1 . This suggests that our improvement mainly comes from improving ED.

Performance Evaluation
In Table 3, we compare the speed of our system against the top-2 best baseline models from Table 2. We run 3 independent runs on the validation set and report the number of queries per second on GPU 4 feeding the models with one input at a time (i.e., batch size of 1). For GENRE (De Cao et al., 2021a), we truncate sequences to the maximum supported length. Our model parallelizes the generation of all entity identifiers and dispenses with generating superfluous text (i.e., the non-mentions) being >70 times faster than GENRE, which has to re-generate the whole source input left-to-right in order to fill in the mention-entity markup sequentially. Notably, our model is also slightly faster than Kolitsas et al. (2018) which is a well-established model for EL.

Analysis
We investigate the importance of different aspects of our model formulation in an ablation study. In Table 2 (bottom-half) we report all results.

Discriminative Correction
We train with and without the discriminative correction term of Equation 4 to appreciate its impact in results. Using only 4 One Nvidia Titan X 12GB.
the LM component results in a 4% drop in performance: this is due to not optimizing directly for the correctness of the generator's ranking. Using the classifier alone also leads to a 4% drop in performance. Those ablations indicate that the auxiliary loss helps improve the generator's ranking.
Beam Search vs Complete Scoring To compare with previous work, we use pre-computed candidates for ED. This is feasible because the number of candidates to score is relatively small. However, in general, candidates might be too many and thus impractical to score them all. Thus, we test our model using Constrained Beam Search (CBS) as an approximation. When using CBS (with a beam size of 5), performance drops by <1%, and micro-F1 remains higher than that of every other baseline, demonstrating that our formulation is robust even in this setting.
Ablating Candidates One of the benefits of the generative formulation is the ability to generate entity names (autoregressively through CBS) without the need for candidates. Thus, we test our model using CBS without candidates (i.e., all entities in the KB are viable candidates). In this setting, our model does not excel (42% drop in performance). The drop is not surprising: our generative component has only seen a fraction of entities identifiers (1537 out of ≈500,000 in the KB). Indeed, previous methods (e.g., De Cao et al. (2021a)) were pre-trained on the whole Wikipedia to mitigate this issue. We do not have the computational budget to do such pre-training so we leave this for follow-up work.

Conclusion
We revisit the generative approach to EL exploiting independence assumptions that enable parallelism across mentions with a shallow LSTM decoder.