LNN-EL: A Neuro-Symbolic Approach to Short-text Entity Linking

Entity linking (EL) is the task of disambiguating mentions appearing in text by linking them to entities in a knowledge graph, a crucial task for text understanding, question answering or conversational systems. In the special case of short-text EL, which poses additional challenges due to limited context, prior approaches have reached good performance by employing heuristics-based methods or purely neural approaches. Here, we take a different, neuro-symbolic approach that combines the advantages of using interpretable rules based on first-order logic with the performance of neural learning. Even though constrained to use rules, we show that we reach competitive or better performance with SoTA black-box neural approaches. Furthermore, our framework has the benefits of extensibility and transferability. We show that we can easily blend existing rule templates given by a human expert, with multiple types of features (priors, BERT encodings, box embeddings, etc), and even with scores resulting from previous EL methods, thus improving on such methods. As an example of improvement, on the LC-QuAD-1.0 dataset, we show more than 3% increase in F1 score relative to previous SoTA. Finally, we show that the inductive bias offered by using logic results in a set of learned rules that transfers from one dataset to another, sometimes without finetuning, while still having high accuracy.

A particular type of entity linking, focused on short text (i.e., a single sentence or question), has attracted recent attention due to its relevance for downstream applications such as question answering (e.g., (Kapanipathi et al., 2021)) and conversational systems.Short-text EL is particularly challenging because the limited context surrounding mentions results in greater ambiguity (Sakor et al., 2019).To address this challenge, one needs to exploit as many features from as many sources of evidence as possible.
Consider the question in Figure 1(a), containing mention 1 (Cameron) and mention 2 (Titanic).1 DBpedia contains several person entities whose last name matches Cameron.Two such entities are shown in Figure 3(b), James_Cameron and Roderick_Cameron, along with their string similarity scores (in this case, character-level Jaccard similarity) to mention 1 .In this case, the string similarities are quite close.In the absence of reliable discerning information, one can employ a prior such as using the more popular candidate entity, as measured by the in-degree of the entity in the KG (see Figure 3(b)).Given the higher in-degree, we can (correctly) link mention 1 to James_Cameron.However, for mention 2 , the correct entry is Titanic_(1997_film)  Figure 1: (a) Question with 2 mentions that need to be disambiguated against DBpedia.(b) For each mentioncandidate entity pair, the character-level Jaccard similarity is shown along with the in-degree of the entity in the knowledge graph.(c) (Partial) Ego networks for entities James_Cameron and Roderick_Cameron.
Titanic the ship, but it actually has a lower string similarity.To link to the correct entity, one needs to exploit the fact that James_Cameron has an edge connecting it to Titanic_(1997_film) in the KG (see ego network on the left in Figure 1(c)).Linking co-occurring mentions from text to connected entities in the KG is an instance of collective entity linking.This example provides some intuition as to how priors, local features (string similarity) and collective entity linking can be exploited to overcome the limited context in short-text EL.
While the use of priors, local features and nonlocal features (for collective linking) has been proposed before (Ratinov et al., 2011), our goal in this paper is to provide an extensible framework that can combine any number of such features and more, including contextual embeddings such as BERT encodings (Devlin et al., 2019) and Query2box embeddings (Ren et al., 2020), and even the results of previously developed neural EL models (e.g., BLINK (Wu et al., 2020)).
Additionally, such a framework must not only allow for easy inclusion of new sources of evidence but also for interpretability of the resulting model (Guidotti et al., 2018).An approach that combines disparate features should, at the very least, be able to state, post-training, which features are detrimental and which features aid EL performance and under what conditions, in order to enable actionable insights in the next iteration of model improvement.
Our Approach.We propose to use rules in firstorder logic (FOL), an interpretable fragment of logic, as a glue to combine EL features into a coherent model.Each rule in itself is a disambiguation model capturing specific characteristics of the overall linking.While inductive logic programming (Muggleton, 1996) and statistical relational learning (Getoor and Taskar, 2007) have for long focused on learning FOL rules from labeled data, more recent approaches based on neuro-symbolic AI have led to impressive advances.In this work, we start with an input set of rule templates (given by an expert or available as a library), and learn the parameters of these rules (namely, the thresholds of the various similarity predicates as well as the weights of the predicates that appear in the rules), based on a labeled dataset.We use logical neural networks (LNN) (Riegel et al., 2020), a powerful neuro-symbolic AI approach based on real-valued logic that employs neural networks to learn the parameters of the rules.Learning of the rule templates themselves will be the focus of future work.

Summary of contributions
• We propose, to the best of our knowledge, the first neuro-symbolic method for entity linking (coined "LNN-EL") that provides a principled approach to learning EL rules.• Our approach is extensible and can combine disparate types of local and global features as well as results of prior black-box neural methods, thus building on top of such approaches.• Our approach produces interpretable rules that humans can inspect toward actionable insights.• We evaluate our approach on three benchmark datasets and show competitive (or better) performance with SotA black-box neural approaches (e.g., BLINK (Wu et al., 2020)) even though we are constrained on using rules.• By leveraging rules, the learned model shows a desirable transferability property: it performs well not only on the dataset on which it was trained, but also on other datasets from the same domain without further training.

Related Work
Entity Linking Models.Entity Linking is a wellstudied problem in NLP, especially for long text.Approaches such as (Bunescu and Pasca, 2006;Ratinov et al., 2011;Sil et al., 2012;Hoffart et al., 2011;Shen et al., 2015) use a myriad of classical ML and deep learning models to combine priors, local and global features.These techniques, in general, can be applied to short text, but the lack of sufficient context may render them ineffective.The recently proposed BLINK (Logeswaran et al., 2019;Wu et al., 2020) uses powerful transformer-based encoder architectures trained on massive amounts of data (such as Wikipedia, Wikia) to achieve SotA performance on entity disambiguation tasks, and is shown to be especially effective in zero-shot settings.BLINK is quite effective on short text (as observed in our findings); in our approach, we use BLINK both as a baseline and as a component that is combined in larger rules.
For short-text EL, some prior works (Sakor et al., 2019;Ferragina and Scaiella, 2012;Mendes et al., 2011) address the joint problem of mention detection and linking, with primary focus on identifying mention spans, while linking is done via heuristic methods without learning.(Sakor et al., 2019) also jointly extracts relation spans which aide in overall linking performance.The recent ELQ (Li et al., 2020) extends BLINK to jointly learn mention detection and linking.In contrast, we focus solely on linking and take a different strategy based on combining logic rules with learning.This facilitates a principled way combining multiple types of EL features with interpretability and learning using promising gradient-based techniques.Rule-based Learning.FOL rules and learning have been successfully applied in some NLP tasks and also other domains.Of these, the task that is closest to ours is entity resolution (ER), which is the task of linking two entities across two structured datasets.In this context, works like (Chaudhuri et al., 2007;Arasu et al., 2010;Wang et al., 2012;Hernández et al., 2013) use FOL rules for ER.Approaches such as (Singla and Domingos, 2006;Pujara and Getoor, 2016) induce probabilistic rules using MLNs (Richardson and Domingos, 2006) and PSL (Bach et al., 2017), respectively.None of these approaches use any recent advances in neural-based learning; moreover, they are focused on entity resolution, which is a related task but distinct from short-text EL.
Given text T , a set M = {m 1 , m 2 , ...} of mentions, where each m i is contained in T , and a knowledge graph (KG) comprising of a set E of entities, entity linking is a many-to-one function that links each mention m i ∈ M to an entity e ij ∈ C i , where C i ⊆ E is a subset of relevant candidates for mention m i .More generally, we formulate the problem as a ranking of the candidates in C i so that the "cor-rect" entity for m i is ranked highest.Following existing approaches(e.g.(Sakor et al., 2019;Wu et al., 2020), we use off-the-shelf lookup tools such as DBpedia lookup 2 to retrieve top-100 candidates for each mention.While this service is specific to DBpedia, we assume that similar services exist or can be implemented on top of other KGs.

Logical Neural Networks
Fueled by the rise in complexity of deep learning, recently there has been a push towards learning interpretable models (Guidotti et al., 2018;Danilevsky et al., 2020).While linear classifiers, decision lists/trees may also be considered interpretable, rules expressed in first-order logic (FOL) form a much more powerful, closed language that offer semantics clear enough for human interpretation and a larger range of operators facilitating the expression of richer models.To learn these rules, neuro-symbolic AI typically substitutes conjunctions (disjunctions) with differentiable t-norms (t-conorms) (Esteva and Godo, 2001).However, since these norms do not have any learnable parameters (more details in Appendix A.1), their behavior cannot be adjusted, thus limiting their ability to model well the data.
In contrast, logical neural networks (LNN) (Riegel et al., 2020) offer operators that include parameters, thus allowing to better learn from the data.To maintain the crisp semantics of FOL, LNNs enforce constraints when learning operators such as conjunction.Concretely, LNN-∧ is expressed as: , 1] is a hyperparameter.Note that max(0, min(1, •)) clamps the output of LNN-∧ between 0 and 1 regardless of β, w 1 , w 2 , x, and y.The more interesting aspects are in the constraints.While Boolean conjunction only returns 1 or true when both inputs are 1, LNNs relax this condition by using α as a proxy for 1 (and conversely, 1 − α as a proxy for 0).In particular, Constraint (1) forces the output of LNN-∧ to be greater than α when both inputs are greater than α.Similarly, Constraints (2) and (3) constrain the behavior of LNN-∧ when one input is low and the other is high.For instance, Constraint (2) forces the output of LNN-∧ to be less than 1−α for y = 1 and x ≤ 1 − α.This formulation allows for unconstrained learning when x, y ∈ [1 − α, α].By changing α a user can control how much learning to enable (increase to make region of unconstrained learning wider or decrease for the opposite).While the former increases slowly with increasing x, y, LNN-∧ produces a high output when both inputs are ≥ α and stays high thereafter, thus closely modeling Boolean conjunction semantics.
In case the application requires even more degrees of freedom, the hard constraints (1), ( 2) and ( 3) can be relaxed via the inclusion of slacks: where δ 1 , δ 2 , and ∆ denote slack variables.If any of Constraints (1), ( 2) and (3) in LNN-∧ are unsatisfied then slacks help correct the direction of the inequality without putting pressure on parameters w 1 , w 2 , and β during training.For the rest of the paper, by LNN-∧ we refer to the above formulation.LNN negation is a pass-through operator: LNN-¬(x) = 1 − x, and LNN disjunction is defined in terms of LNN-∧: While vanilla backpropagation cannot handle linear inequality constraints such as Constraint (1), specialized learning algorithms are available within the LNN framework.For more details, please check Riegel et al. (2020).

LNN-EL
An overview of our neuro-symbolic approach for entity linking is depicted in Figure 3.We next discuss the details about feature generation component that generates features using a catalogue of feature functions (Section 4.1) followed by proposed model that does neuro-symbolic learning over user provided EL algorithm in Section 4.2.
Given the input text T , together with labeled data in the form (m i , C i , L i ), where m i ∈ M is a mention in T , C i is a list of candidate entities e ij (drawn from lookup services3 ) for the mention m i , and where each l ij ∈ L i denotes a link/notlink label for the pair (m i , e ij ).The first step is to generate a set , where f k is a feature function drawn from a catalog F of user provided functions.

Feature Functions
Our collection of feature functions include both non-embedding and embedding based functions.
Non-embedding based.We include here a multitude of functions (see Table 1) that measure the similarity between the mention m i and the candidate entity e ij based on multiple types of scores.
Name: a set of general purpose string similarity functions4 such as Jaccard, Jaro Winkler, Levenshtein, Partial Ratio, etc. are used to compute the similarity between m i and e ij 's name.
Context: aggregated similarity of m i 's context to the description of e ij .Here, we consider the list of all other mentions m k ∈ M (k = i) as m i 's context, together with e ij 's textual description obtained using KG resources5 .The exact formula we use is shown in Table 1, where Partial Ratio(pr) measures the similarity between each context mention and the description.(Partial Ratio computes the

Final scores
Learnable parameters: θifeature thresholds, f wifeature weights, rwirule weights Figure 3: Overview of our approach maximum similarity between a short input string and substrings of a second, longer string.)For normalizing the final score, we apply a min-max rescaling over all entities e ij ∈ C i .
Type: the overlap similarity of mention m i 's type to e ij 's domain (class) set, similar to the domain-entity coherence score proposed in (Nguyen et al., 2014).Unlike in (Nguyen et al., 2014), instead of using a single type for all mentions in M , we obtain type information for each mention m i using a trained BERT-based entity type detection model.We use KG resources 5 to obtain e ij 's domain set, similar to Context similarity.
Entity Prominence: measure the prominence of entity e ij as the number of entities that link to e ij in target KG, i.e., indegree(e ij ).Similar to Context score normalization, we apply min-max rescaling over all entities e ij ∈ C i .
Embedding based.We also employ a suite of pretrained or custom trained neural language models to compute the similarity of m i and e ij .
Pre-trained Embedding Models.These include SpaCy's semantic similarity 6 function that uses Glove (Pennington et al., 2014) trained on Common Crawl.In addition to SpaCy, we also use scores from an entity linking system such as BLINK (Wu et al., 2020)  a pre-trained graph embedding Wiki2Vec (Yamada et al., 2020), i.e., e ij = Wiki2Vec(e ij ).The candidates are ranked in order of the cosine similarity to m i , i.e., Sim cos (m i , e ij ).The mini EL model is optimized with margin ranking loss so that the correct candidate is ranked higher.
BERT with Box Embeddings.While features such as Context (see Table 1) can exploit other mentions appearing within the same piece of text, they only do so via textual similarity.A more powerful method is to jointly disambiguate the mentions in text to the actual entities in the KG, thus exploiting the structural context in the KG.Intuitively, the simultaneous linking of co-occurring mentions in text to related entities in the KG is a way to reinforce the links for each individual mention.To this end, we adapt the recent Query2Box (Ren et al., 2020), whose goal is to answer FOL queries over a KG.The main idea there is to represent sets of entities (i.e., queries) as contiguous regions in embedded space (e.g., axis-parallel hyper-rectangles or boxes), thus reducing logical operations to geometric operations (e.g., intersection).
Since Query2Box assumes a well-formed query as input, one complication in directly applying it to our setting is that we lack the information necessary to form such an FOL query.For instance, in the example from Section 1, while we may assume that the correct entities for our Cameron and Titanic mentions are connected in the KG, we do not know how these are connected, i.e., via which relation.To circumvent this challenge, we introduce a special neighborhood relation N , such that v ∈ N (u) whenever there is some KG relation from entity u to entity v.We next define two box operations: The first operation represents mention m i as a box, by taking the smallest box that contains the set C i of candidate entities for m i .This can be achieved by computing the dimension-wise minimum (maximum) of all entity embeddings in C i to obtain the lower-left (upper-right) corner of the resulting box.The second operation takes m i 's box and produces the box containing its neighbors in the KG.Query2Box achieves this by representing Box N via a center vector ψ and offset vector ω, both of which are learned parameters.The box of neighbors is then obtained by translating the center of m i 's box by ψ and adding the offset ω to its side.
Figure 4 shows how these operations are used to disambiguate Titanic while exploiting the cooccurring mention Cameron and the KG structure.We take the box for Cameron, compute its neighborhood box, then intersect with the Titanic box.This intersection contains valid entities that can disambiguate Titanic and are connected to the entity for Cameron.For the actual score of each such entity, we take its distance to the center of the intersection box and convert it to a similarity score Sim box (m i , e ij ).We then linearly combine this with the BERT-based similarity measure: where β box is a hyper-parameter that adjusts the importance of the two scores.The approach described can be easily extended to more than two mentions.

Model
In this section, we describe how an EL algorithm composed of a disjunctive set of rules is reformulated into LNN representation for learning.Entity Linking Rules are a restricted form of FOL rules comprising of a set of Boolean predicates connected via logical operators: conjunction (∧) and disjunction (∨).A Boolean predicate has the form f k > θ, where f k ∈ F is one of the feature functions, and θ can be either a user provided or a learned threshold in [0, 1]. Figure 5(a) shows two example rules R 1 and R 2 , where, for instance, R 1 (m i , e ij ) evaluates to True if both the predicate jacc(m i , e ij ) > θ 1 and Ctx(m i , e ij ) > θ 2 are True.Rules can be disjuncted together to form a larger EL algorithm, as the one shown in Figure 5(b), which states that Links(m i , e ij ) evalu- ates to True if any one of its rules evaluates to True.
The Links predicate is meant to store high-quality links between mention and candidate entities that pass the conditions of at least one rule.The EL algorithm also acts as a scoring mechanism.In general, there are many ways in which scores can computed.In a baseline implementation (no learning), we use the scoring function in Figure 5(c), where rw i denote manually assigned rule weights, while fw i are manually assigned feature weights.An EL algorithm is an explicit and extensible description of the entity linking logic, which can be easily understood and manipulated by users.However, obtaining competitive performance to that of deep learning approaches such as BLINK (Wu et al., 2020) requires a significant amount of manual effort to fine tune the thresholds θ i , the feature weights (fw i ) and the rule weights (rw i ).LNN Reformulation.To facilitate learning of the thresholds and weights in an EL algorithm, we map the Boolean-valued logic rules into the LNN formalism, where the LNN constructs -LNN-∨ (for logical OR) and LNN-∧ (for logical AND)allow for continuous real-valued numbers in [0, 1].As described in Section 3.2, LNN-∧ and LNN-∨ are a weighted real-valued version of the classical logical operators, where a hyperparameter α is used as a proxy for 1.Each LNN operator produces a value in [0, 1] based on the values of the inputs, their weights and bias β.Both the weights and β are learnable parameters.The score of each link is based on the score that the LNN operators give, with an added complication related to how we score the feature functions.To illustrate, for the EL rules in Figure 5, the score of a link is computed as: LC-QuAD 1.0 (Trivedi et al., 2017) 4,000 6,823 1000 1,721 QALD-9 (Usbeck et al., 2018) 408 568 150 174 WebQSPEL (Li et al., 2020) 2974 3,237 1603 1,798 Here the top-level LNN-∨ represents the disjunction R 1 ∨ R 2 , while the two inner LNN-∧ capture the rules R 1 and R 2 respectively.For the feature functions with thresholds, a natural scoring mechanism would be to use score(f > θ) = f if f > θ else 0, which filters out the candidates that do not satisfy the condition f > θ, and gives a non-zero score for the candidates that pass the condition.However, since this is a step function which breaks the gradient flow through a neural network, we approximate it via a smooth function , where σ is Sigmoid function and θ is the learnable threshold that is generated using σ, i.e., θ = σ(γ), to ensure that it lies in [0, 1].
Training.We train the LNN formulated EL rules over the labeled data and use a margin-ranking loss over all the candidates in C i to perform gradient descent.The loss function L(m i , C i ) for mention m i and candidates set C i is defined as Here, e ip ∈ C i is a positive candidate, C i \ {e ip } is the set of negative candidates, and µ is a margin hyper parameter.The positive and negative labels are obtained from the given labels L i (see Figure 3).Inference.Given mention m i and candidate set C i , similar to training, we generate features for each mention-candidate pair (m i , e ij ) in the feature generation step.We then pass them through the learned LNN network to obtain final scores for each candidate entity in C i as shown in Figure 3.

Evaluation
We first evaluate our approach w.r.t performance & extensibility, interpretability and transferability.We also discuss the training and inference time.
Datasets.As shown in Table 2, we consider three short-text QA datasets.LC-QuAD and QALD-9 are datasets comprising of questions (Q) over DBpedia together with their corresponding SPARQL queries.We extract entities (E) from SPARQL queries and manually annotate mention spans.WebQSP EL dataset (Li et al., 2020) comprises of both mention spans and links to the correct entity.Since the target KG for WebQSP is Wikidata, we translate each Wikidata entity to its DBpedia counterpart using DBpedia Mappings7 .In addition, we discard mentions that link to DBpedia concepts (e.g., heaviest player linked to dbo:Person) and mentions m i with empty result (i.e., C i = φ) or all not-link labels (i.e, Baselines.We compare our approach to (1) BLINK (Wu et al., 2020), the current state-of-theart on both short-text and long-text EL, (2) three BERT-based models -(a) BERT: both mention and candidate entity embeddings are obtained via BERT base pre-trained encoder, similar to (Gillick et al., 2019), (b) BERTWiki: mention embeddings are obtained from BERT base , while candidate entity is from pretrained Wiki2Vec (Yamada et al., 2020), (c) Box: BERTWiki embeddings finetuned with Query2Box embeddings (see Section 4.1).In addition to the aforementioned black-box neural models, we also compare our approach to (3) two logistic regression models that use the same feature set as LNN-EL: LogisticRegression without BLINK and LogisticRegression BLIN K with BLINK.
Furthermore, we use the following variants of our approach: (4) RuleEL: a baseline rule-based EL approach with manually defined weights and thresholds, (5) LogicEL: a baseline approach built on RuleEL where only the thresholds are learnable, based on product t-norm (see Section 3.2), (6) LNN-EL: our core LNN-based method using non-embedding features plus SpaCy, and (7) LNN-EL ens : an ensemble combining core LNN-EL with additional features from existing EL approaches, namely BLINK and Box (we consider Box, as it outperforms BERT and BERTWiki on all datasets).Detailed rule templates are provided in Appendix A.3.
Setup.All the baselines are trained for 30 epochs, except for BLINK which we use as a zero-shot approach.For BERT approaches, we use BERT base as pretrained model.We used two Nvidia V100 GPUs with 16GB memory each.We perform hyperparameter search for margin µ and learning rates in the range [0.6, 0.95], [10 −5 , 10 −1 ] respectively.Table 3: Performance comparison of various baselines with our neuro-symbolic variants.

Results
Overall Performance.As seen in Table 3, among logic-based approaches, LNN-EL outperforms LogicEL and RuleEL, showing that parameterized real-valued LNN learning is more effective than the non-parameterized version with t-norm (Log-icEL) and the manually tuned RuleEL.Logistic regression models which also learn weights over features achieve competitive performance to LNN-EL models; however they lack the representation power that LNN-EL offer in the form of logical rules comprising of conjunctions and disjunctions.
In other words, LNN-EL allows learning over a richer space of models that help in achieving better performance as observed in Table 3.On the other hand, simple BERT-based approaches (BERT, BERTWiki, Box) that are trained on the QA datasets underperform the logic-based approaches, which incorporate finer-grained features.BLINK (also a BERT-based approach, but trained on the entire Wikipedia) is used as zero-shot approach and achieves SotA performance (when not counting the LNN-EL variants).The core LNN-EL version is competitive with BLINK on LC-QuAD and QALD-9, despite being a rule-based approach.Furthermore, LNN-EL ens , which combines the core LNN-EL with both BLINK and Box features, easily beats BLINK on LC-QuAD and QALD-9 and slightly on WebQSP EL .
Table 4 shows the Recall@k performance of LNN-EL against the BLINK model.Both LNN-EL and LNN-EL ens have better Recall@k performance against BLINK on LC-QuAD and QALD-9 datasets, however BLINK's Recall@k achieves a slightly better performance for WebQSP EL dataset.Extensibility.Here, we inspect empirically how a multitude of EL features coming from various black-box approaches can be combined in a principled way with LNN-EL, often leading to an overall better performance than the individual approaches.A detailed ablation study of the core LNN-EL ver-  Transferability.To study the transferability aspect, we train LNN-EL on one dataset and evaluate the model on the other two, without any finetuning.We use the core LNN-EL variant for this, but similar properties hold for the other variants.
Table 6 shows F1 scores on different train-test configurations, with diagonal (underlined numbers) denoting the F1 score when trained and tested on the same dataset.We observe that LNN-EL transfers reasonably well, even in cases where training is done on a very small dataset.For example, when we transfer from QALD-9 (with only a few hundred questions to train) to WebQSP EL , we obtain an F1-score of 83.06 which is within 2 percentage points of the F1-score when trained directly on WebQSP EL .We remark that the zero-shot BLINK by design has very good transferability and achieves F1 scores of 87.04, 89.14, 92.10 on LC-QuAD, QALD-9, WebQSP EL respectively.However, BLINK is trained on the entire Wikipedia, while LNN-EL needs much less data to achieve reasonable transfer performance.

Conclusions
We introduced LNN-EL, a neuro-symbolic approach for entity linking on short text.Our approach complements human-given rule templates through neural learning and achieves competitive performance against SotA black-box neural models, while exhibiting interpretability and transferability without requiring a large amount of labeled data.While LNN-EL provides an extensible framework where one can easily add and test new features in existing rule templates, currently this is done manually.A future direction is to automatically learn the rules with the optimal combinations of features.

A Appendix
A.1 t-norm and t-conorm While linear classifiers, decision lists/trees may also be considered interpretable, rules expressed in first-order logic (FOL) form a much more powerful, closed language that offer semantics clear enough for human interpretation and a larger range of operators facilitating the expression of richer models.To learn these rules, neuro-symbolic AI substitutes conjunctions (disjunctions) with differentiable t-norms (t-conorms) (Esteva and Godo, 2001).However, since it does not have any learnable parameters, this behavior cannot be adjusted, which limits how well it can model the data.For example, while linear classifiers such as logistic regression can only express a (weighted) sum of features which is similar to logic's disjunction (∨) operator, logic also contains other operators including, but not limited to, conjunction (∧), and negation (¬).
As opposed to inductive logic programming (Muggleton, 1996) and statistical relational learning (Getoor and Taskar, 2007), neuro-symbolic AI utilizes neural networks to learn rules.Towards achieving this, the first challenge to overcome is that classical Boolean logic is non-differentiable and thus, not amenable to gradient-based optimization (e.g., backpropagation).To address this, neuro-symbolic AI substitutes conjunctions (disjunctions) with differentiable t-norms (t-conorms) (Esteva and Godo, 2001).For example, product t-norm, used in multiple neuro-symbolic rulelearners (Evans and Grefenstette, 2018;Yang et al., 2017), is given by x ∧ y ≡ xy, where x, y ∈ [0, 1] denote input features in real-valued logic.Product t-norm agrees with Boolean conjunction at the extremities, i.e., when x, y are set to 0 (false) or 1 (true).However, when x, y ∈ [0, 1] \ {0, 1}, its behavior is governed by the product function.More importantly, since it does not have any learnable parameters, this behavior cannot be adjusted, which limits how well it can model the data.

A.2 Ablation Study
To understand the roles of eac rule in LNN-EL, we also conduct ablation study on the largest benchmark dataset LC-QuAD (see Table 8).We observe that Context is the most performant rule alone.Although PureName rule is behind the other two alone, PureName + Context improves the performance of Context by 1%.Meanwhile, Context + Type only improves Context's performance by 0.05%.Interestingly, the combination of three rules performs slightly worse than PureName + Context by 0.35%.These results show that Type rule is less important among the three rules.To be consistent with the RuleEL system, we apply "PureName + Context + Type" setting for LNN-EL in our experiments.Additionally, we also show the transferability of LR in Table 9.This must be compared with the corresponding LNN-EL results in the earlier Table 6.In particular, we observe that LNN-EL outperforms LR in 4 out of 6 transferability tests, demonstrating that LNN-EL has superior transferability.

A.3 LNN-EL Rules
In our experiments, we explore the following modules, implemented in PyTorch.Name Rule: Context Rule: Type Rule: Blink Rule:

Figure 4 :
Figure 4: Candidates for linking the 'Titanic' mention appear in the intersection of the two boxes.

Figure 5 :
Figure 5: Example of entity linking rules and scoring.
insights into this behavior by looking at the feature weights in each model.In Figure6(left), the disjunction tree with the Box feature is given a low weight of 0.26, thus discounting some of the other useful features in the same tree.Removal of the Box feature leads to a re-weighting of the features in the model; the modified disjunction tree (Figure6(left)) has now a weight of 0.42.Such visualization can help the rule designer to judiciously select features to combine towards building a performant model.
as opposed to

Table 1 :
∈M \{m i } pr(m k , eij.desc) where m k is a mention in the context of mi Non-embedding based feature functions.

Table 2 :
Characteristics of the datasets.

Table 6 :
EL +BLINK and LNN-EL ens trained on WebQSP EL dataset, where we observed that LNN-EL ens 's performance is inferior to LNN-EL +BLINK even though the former model has more features.A human expert can find F1 scores of LNN-EL in transfer settings.

Table 7 :
Time per question for candidate & feature generation, along with train and inference time per question for LNN-EL ens .All numbers are in seconds.