Encoding Semantic Resources in Syntactic Structures for Passage Reranking

In this paper, we propose to use semantic knowledge from Wikipedia and large-scale structured knowledge datasets available as Linked Open Data (LOD) for the answer passage reranking task. We represent question and candidate answer passages with pairs of shallow syntactic/semantic trees, whose constituents are connected using LOD. The trees are processed by SVMs and tree kernels, which can automatically exploit tree fragments. The experiments with our SVM rank algo-rithm on the TREC Question Answering (QA) corpus show that the added relational information highly improves over the state of the art, e.g., about 15.4% of relative improvement in P@1.


Introduction
Past work in TREC QA, e.g. (Voorhees, 2001), and more recent work (Ferrucci et al., 2010) in QA has shown that, to achieve human performance, semantic resources, e.g., Wikipedia 1 , must be utilized by QA systems. This requires the design of rules or machine learning features that exploit such knowledge by also satisfying syntactic constraints, e.g., the semantic type of the answer must match the question focus words. The engineering of such rules for open domain QA is typically very costly. For instance, for automatically deriving the correctness of the answer passage in the following question/answer passage (Q/AP) pair (from the TREC corpus 2 ): Q: What company owns the soft drink brand "Gatorade"? A: Stokely-Van Camp bought the formula and started marketing the drink as Gatorade in 1967. Quaker Oats Co. took over Stokely-Van Camp in 1983. 1 http://www.wikipedia.org 2 It will be our a running example for the rest of the paper.
we would need to write the following complex rules: is(Quaker Oats Co.,company), own(Stokely-Van Camp,Gatorade), took over(Quaker Oats Co.,Stokely-Van Camp), took over(Y, Z)→own(Z,Y), and carry out logic unification and resolution. Therefore, approaches that can automatically generate patterns (i.e., features) from syntactic and semantic representations of the Q/AP are needed. In this respect, our previous work, e.g., (Moschitti et al., 2007;Moschitti and Quarteroni, 2008;Moschitti, 2009), has shown that tree kernels for NLP, e.g., (Moschitti, 2006), can exploit syntactic patterns for answer passage reranking significantly improving search engine baselines. Our more recent work, (Severyn and Moschitti, 2012;Severyn et al., 2013b;Severyn et al., 2013a), has shown that using automatically produced semantic labels in shallow syntactic trees, such as question category and question focus, can further improve passage reranking and answer extraction (Severyn and Moschitti, 2013).
However, such methods cannot solve the class of examples above as they do not use background knowledge, which is essential to answer complex questions. On the other hand, Kalyanpur et al. (2011) and Murdock et al. (2012) showed that semantic match features extracted from largescale background knowledge sources, including the LOD ones, are beneficial for answer reranking.
In this paper, we tackle the candidate answer passage reranking task. We define kernel functions that can automatically learn structural patterns enriched by semantic knowledge, e.g., from LOD. For this purpose, we carry out the following steps: first, we design a representation for the Q/AP pair by engineering a pair of shallow syntactic trees connected with relational nodes (i.e., those matching the same words in the question and in the answer passages). Secondly, we use YAGO (Suchanek et al., 2007), DBpedia (Bizer et al., 2009) andWord-Net (Fellbaum, 1998) to match constituents from Q/AP pairs and use their generalizations in our syntactic/semantic structures. We employ word sense disambiguation to match the right entities in YAGO and DBpedia, and consider all senses of an ambiguous word from WordNet.
Finally, we experiment with TREC QA and several models combining traditional feature vectors with automatic semantic labels derived by statistical classifiers and relational structures enriched with LOD relations. The results show that our methods greatly improve over strong IR baseline, e.g., BM25, by 96%, and on our previous stateof-the-art reranking models, up to 15.4% (relative improvement) in P@1.

Reranking with Tree Kernels
In contrast to ad-hoc document retrieval, structured representation of sentences and paragraphs helps to improve question answering (Bilotti et al., 2010). Typically, rules considering syntactic and semantic properties of the question and its candidate answer are handcrafted. Their modeling is in general time-consuming and costly. In contrast, we rely on machine learning and automatic feature engineering with tree kernels. We used our state-of-the-art reranking models, i.e., (Severyn et al., 2013b;Severyn et al., 2013a) as a baseline. Our major difference with such approach is that we encode knowledge and semantics in different ways, using knowledge from LOD. The next sections outline our new kernel-based framework, although the detailed descriptions of the most inno-vative aspects such as new LOD-based representations are reported in Section 3.

Framework Overview
Our QA system is based on a rather simple reranking framework as displayed in Figure 1: given a question Q, a search engine retrieves a list of candidate APs ranked by their relevancy. Next, the question together with its APs are processed by a rich NLP pipeline, which performs basic tokenization, sentence splitting, lemmatization, stopword removal. Various NLP components, embedded in the pipeline as UIMA 3 annotators, perform more involved linguistic analysis, e.g., POStagging, chunking, NE recognition, constituency and dependency parsing, etc.
Each Q/AP pair is processed by a Wikipedia link annotator. It automatically recognizes ngrams in plain text, which may be linked to Wikipedia and disambiguates them to Wikipedia URLs. Given that question passages are typically short, we concatenate them with the candidate answers to provide a larger disambiguation context to the annotator.
These annotations are then used to produce computational structures (see Sec. 2.2) input to the reranker. The semantics of such relational structures can be further enriched by adding links between Q/AP constituents. Such relational links can be also generated by: (i) matching lemmas as in (Severyn and Moschitti, 2012); (ii) matching the question focus type derived by the question classifiers with the type of the target NE as in (Severyn et al., 2013a); or (iii) by matching the constituent types based on LOD (proposed in this paper). The resulting pairs of trees connected by semantic links are then used to train a kernel-based reranker, which is used to re-order the retrieved answer passages.

Relational Q/AP structures
We use the shallow tree representation that we proposed in (Severyn and Moschitti, 2012) as a baseline structural model. More in detail, each Q and its candidate AP are encoded into two trees, where lemmas constitute the leaf level, the partof-speech (POS) tags are at the pre-terminal level and the sequences of POS tags are organized into the third level of chunk nodes. We encoded structural relations using the REL tag, which links the related structures in Q/AP, when there is a match between the lemmas in Q and AP. We marked the parent (POS tags) and grand parent (chunk) nodes of such lemmas by prepending a REL tag.
However, more general semantic relations, e.g., derived from the question focus and category, can be encoded using the REL-FOCUS-<QC> tag, where <QC> stands for the question class. In (Severyn et al., 2013b;Severyn et al., 2013a), we used statistical classifiers to derive question focus and categories of the question and of the named entities in the AP. We again mark (i) the focus chunk in the question and (ii) the AP chunks containing named entities of type compatible with the question class, by prepending the above tags to their labels. The compatibility between the categories of named entities and questions is evaluated with a lookup to a manually predefined mapping (see Table 1 in (Severyn et al., 2013b)). We also prune the trees by removing the nodes beyond a certain distance (in terms of chunk nodes) from the REL and REL-FOCUS nodes. This removes irrelevant information and speeds up learning and classification. We showed that such model outperforms bag-of-words and POS-tag sequence models (Severyn et al., 2013a).
An example of a Q/AP pair encoded using shallow chunk trees is given in Figure 2. Here, for example, the lemma "drink" occurs in both Q and AP (we highlighted it with a solid line box in the figure). "Company" was correctly recognized as a focus 4 , however it was misclassified as "HUMAN" ("HUM"). As no entities of the matching type "PERSON" were found in the answer by a NER system, no chunks were marked as REL-FOCUS on the answer passage side.
We slightly modify the REL-FOCUS encoding into the tree. Instead of prepending REL-FOCUS-<QC>, we only prepend REL-FOCUS to the target chunk node, and add a new node QC as the rightmost child of the chunk node, e.g, in Figure 2, the focus node would be marked as REL-FOCUS and the sequence of its children would be [WP NN HUM]. This modification in-

LOD for Semantic Structures
We aim at exploiting semantic resources for building more powerful rerankers. More specifically, we use structured knowledge about properties of the objects referred to in a Q/AP pair. A large amount of knowledge has been made available as LOD datasets, which can be used for finding additional semantic links between Q/AP passages.
In the next sections, we (i) formally define novel semantic links between Q/AP structures that we introduce in this paper; (ii) provide basic notions of Linked Open Data along with three of its most widely used datasets, YAGO, DBpedia and Word-Net; and, finally, (iii) describe our algorithm to generate linked Q/AP structures.

Matching Q/AP Structures: Type Match
We look for token sequences (e.g., complex nominal groups) in Q/AP pairs that refer to entities and entity classes related by isa (Eq. 1) and isSubclas-sOf (Eq. 2) relations and then link them in the structural Q/AP representations. (2) Here, entities are all the objects in the world both real or abstract, while classes are sets of entities that share some common features. Information about entities, classes and their relations can be obtained from the external knowledge sources such as the LOD resources. isa returns true if an entity is an element of a class (false otherwise), while isSubclassOf(class1,class2) returns true if all elements of class1 belong also to class2.
We refer to the token sequences introduced above as to anchors and the entities/classes they refer to as references. We define anchors to be in a Type Match (TM) relation if the entities/classes they refer to are in isa or isSubclassOf relation. More formally, given two anchors a 1 and a 2 belonging to two text passages, p 1 and p 2 , respectively, and given an R(a, p) function, which returns a reference of an anchor a in passage p, we define T M (r 1 , r 2 ) as where r 1 = R(a 1 , p 1 ), r 2 = R(a 2 , p 2 ) and isEntity(r) and isClass(r) return true if r is an entity or a class, respectively, and false otherwise. It should be noted that, due to the ambiguity of natural language, the same anchor may have different references depending on the context.

LOD for linking Q/A structures
LOD consists of datasets published online according to the Linked Data (LD) principles 5 and available in open access. LOD knowledge is represented following the Resource Description Framework (RDF) 6 specification as a set of statements. A statement is a subject-predicate-object triple, where predicate denotes the directed relation, e.g., hasSurname or owns, between subject and object. Each object described by RDF, e.g., a class or an entity, is called a resource and is assigned a Unique Resource Identifier (URI).
LOD includes a number of common schemas, i.e., sets of classes and predicates to be reused when describing knowledge. For example, one of them is RDF Schema (RDFS) 7 , which contains predicates rdf:type and rdfs:SubClassOf similar to the isa and subClassOf functions above. LOD contains a number of large-scale crossdomain datasets, e.g., YAGO (Suchanek et al., 2007) and DBpedia (Bizer et al., 2009). Datasets created before the emergence of LD, e.g., Word-Net, are brought into correspondence with the LD principles and added to the LOD as well.
3.2.1 Algorithm for detecting TM Algorithm 1 detects n-grams in the Q/AP structures that are in TM relation and encodes TM knowledge in the shallow chunk tree representations of Q/AP pairs. It takes two text passages, P 1 and P 2 , and a LOD knowledge source, LOD KS , as input. We run the algorithm twice, first with AP as P 1 and Q as P 2 and then vice versa. For example, P 1 and P 2 in the first run could be, according to our running example, Q and AP candidate, respectively, and LOD KS could be YAGO, DBpedia or WordNet.
In case when LOD KS is YAGO or DBpedia, we benefit from the fact that both YAGO and DBpedia are aligned with Wikipedia on entity level by construction and we can use the so-called wikification tools, e.g., (Milne and Witten, 2009), to detect the anchors. The wikification tools recognize ngrams that may denote Wikipedia pages in plain text and disambiguate them to obtain a unique Wikipedia page. Such tools determine whether a certain n-gram may denote a Wikipedia page(s) by looking it up in a precomputed vocabulary created using Wikipedia page titles and internal link network (Csomai and Mihalcea, 2008;Milne and Witten, 2009).
Obtaining references. In line 2 of Algorithm 1 for each anchor, we determine the URIs of entities/classes it refers to in LOD KS . Here again, we have different strategies for different LOD KS . In case of WordNet, we use the all-senses strategy, i.e., getURI procedure returns a set of URIs of synsets that contain the anchor lemma.
In case when LOD KS is YAGO or DBpedia, we use wikification tools to correctly disambiguate an anchor to a Wikipedia page. Then, Wikipedia page URLs may be converted to DBpedia URIs by substituting the en.wikipedia.org/wiki/ prefix to the dbpedia.org/resource/; and YAGO URIs by querying it for subjects of the RDF triples with yago:hasWikipediaUrl 8 as a predicate and Wikipedia URL as an object.
For instance, one of the anchors detected in the running example AP would be "Quaker oats", a wikification tool would map it to wiki: Quaker_Oats_Company 9 , and the respective YAGO URI would be yago:Quaker_Oats_ Company.
Obtaining type information. Given a uri, if it is an entity, we look for all the classes it belongs to, or if it is a class, we look for all classes for which it is a subclass. This process is incorporated in the getTypes procedure in line 3 of Algorithm 1. We call such classes types. If LOD KS is WordNet, then our types are simply the URIs of the hypernyms of uri. If LOD KS is DBpedia or YAGO, we query these datasets for the values of the rdf:type and rdfs:subClassOf properties of the uri (i.e., objects of the triples with uri as subject and type/subClassOf as predicates) and add their values (which are also URIs) to the types set. Then, we recursively repeat the same queries for each retrieved type URI and add their results to the types. Finally, the getTypes procedure returns the resulting types set.
The extracted URIs returned by getTypes are HTTP ids, however, frequently they have humanreadable names, or labels, specified by the rdfs: label property. If no label information for a URI is available, we can create the label by removing the technical information from the type URI, e.g., http prefix and underscores. type.labels denotes a set of type human-readable labels for a specific type. For example, one of the types extracted for yago:Quaker_Oats_Company would have label "company".
Checking for TM. Further, the checkMatch procedure checks whether any of the labels in the type.labels matches any of the chunks in P 1 returned by getChunks, fully or partially (line 5 of Algorithm 1). Here, getChunks procedure returns a list of chunks recognized in P 1 by an external chunker.
More specifically, given a chunk, ch, and a type label, type.label, checkMatch checks whether the ch string matches 10 type.label or its last word(s). If no match is observed, we remove the first token from ch and repeat the procedure. We stop when the match is observed or when no tokens in ch are left. If the match is observed, check-Match returns all the tokens remaining in ch as matchedT okens. Otherwise, it returns an empty set. For example, the question of the running ex-9 wiki: is a shorthand for the http prefix http://en. wikipedia.org/wiki/ 10 case-insensitive exact string match ample contains the chunk "what company", which partially matches the human readable "company" label of one of the types retrieved for the "Quaker oats" anchor from the answer. Our implementation of the checkM atch procedure would return "company" from the question as one of the matchedTokens.
If the matchedT okens set is not empty, this means that T M R anchor, P 2 , R matchedT okens, P 1 in Eq. 3 returns true. Indeed, a 1 is an anchor and a 2 is the matched-Tokens sequence (see Eq. 3), and their respective references, i.e., URI assigned to the anchor and URI of one of its types, are either in subClassOf or in isa relation by construction. Naturally, this is only one of the possible ways to evaluate the T M function, and it may be noise-prone.
Marking TM in tree structures. Finally, if the TM match is observed, i.e., matchedTokens is not an empty set, we mark tree substructures corresponding to the anchor in the structural representation of P 2 (P 2 .parseT ree) and those corresponding to matchedTokens in that of P 1 (P 1 .parseT ree) as being in a TM relation. In our running example, we would mark the substructures corresponding to "Quaker oats" anchor in the answer (our P 2 ) and the "company" matchedToken in the question (our P 1 ) shallow syntactic tree representations. We can encode TM match information into a tree in a variety of ways, which we describe below.
3.2.2 Encoding TM knowledge in the trees a 1 and a 2 from Eq. 3 are n-grams, therefore they correspond to the leaf nodes in the shallow syntactic trees of p 1 and p 2 . We denote the set of their preterminal parents as N T M . We considered the following strategies of encoding TM relation in the trees:  Fig. 3 shows an example of the T M N D annotation.

Wikipedia-based matching
Lemma matching for detecting REL may result in low coverage, e.g., it is not able to match different variants for the same name. We remedy this by using Wikipedia link annotation. We consider two word sequences (in Q and AP, respectively) that are annotated with the same Wikipedia link to be in a matching relation. Thus, we add new REL tags to Q/AP structural representations as described in Sec. 2.2.

Experiments
We evaluated our different rerankers encoding several semantic structures on passage retrieval task, using a factoid open-domain TREC QA corpus. The AQUAINT corpus 11 is used for searching the supporting passages. Pruning. Following (Severyn and Moschitti, 2012) we prune the shallow trees by removing the nodes beyond distance of 2 from the REL, REL-FOCUS or TM nodes. LOD datasets. We used the core RDF distribution of YAGO2 12 , WordNet 3.0 in RDF 13 , and the datasets from the 3.9 DBpedia distribution 14 .
Feature Vectors. We used a subset of the similarity functions between Q and AP described in (Severyn et al., 2013b). These are used along with the structural models. More explicitly: Termoverlap features: i.e., a cosine similarity over question/answer, sim COS (Q, AP ), where the input vectors are composed of lemma or POS-tag 11 http://catalog.ldc.upenn.edu/ LDC2002T31 12 http://www.mpi-inf.mpg.de/yago-naga/ yago1_yago2/download/yago2/yago2core_ 20120109.rdfs.7z 13 http://semanticweb.cs.vu.nl/lod/wn30/ 14 http://dbpedia.org/Downloads39 n-grams with n = 1, .., 4. PTK score: i.e., output of the Partial Tree Kernel (PTK), defined in (Moschitti, 2006), when applied to the structural representations of Q and AP, sim P T K (Q, AP ) = P T K(Q, AP ) (note that, this is computed within a pair). PTK defines similarity in terms of the number of substructures shared by two trees. Search engine ranking score: the ranking score of our search engine assigned to AP divided by a normalizing factor. SVM re-ranker. To train our models, we use SVM-light-TK 15 , which enables the use of structural kernels (Moschitti, 2006) in SVM-light (Joachims, 2002). We use default parameters and the preference reranking model described in (Severyn and Moschitti, 2012;Severyn et al., 2013b). We used PTK and the polynomial kernel of degree 3 on standard features. Pipeline. We built the entire processing pipeline on top of the UIMA framework.We included many off-the-shelf NLP tools wrapping them as UIMA annotators to perform sentence detection, tokenization, NE Recognition, parsing, chunking and lemmatization. Moreover, we used annotators for building new sentence representations starting from tools' annotations and classifiers for question focus and question class. Search engines. We adopted Terrier 16 using the accurate BM25 scoring model with default parameters. We trained it on the TREC corpus (3Gb), containing about 1 million documents. We performed indexing at the paragraph level by splitting each document into a set of paragraphs, which are then added to the search index. We retrieve a list of 50 candidate answer passages for each question.

Wikipedia link annotators.
We use the Wikipedia Miner (WM) (Milne and Witten, 2009)  Metrics. We used common QA metrics: Precision at rank 1 (P@1), i.e., the percentage of questions with a correct answer ranked at the first position, and Mean Reciprocal Rank (MRR). We also report the Mean Average Precision (MAP). We perform 5-fold cross-validation and report the metrics averaged across all the folds together with the std.dev.

Baseline Structural Reranking
In these experiments, we evaluated the accuracy of the following baseline models: BM25 is the BM25 scoring model, which also provides the initial ranking; CH+V is a combination of tree structures encoding Q/AP pairs using relational links with the feature vector; and CH+V+QC+TFC is CH+V extended with the semantic categorial links introduced in (Severyn et al., 2013b). Table 1 reports the performance of our baseline systems. The lines marked with (CoNLL, 2013) contain the results we reported in (Severyn et al., 2013b). Lines four and five report the performance of the same systems, i.e., CH+V and CH+V+QC+TFC, after small improvement and changes. Note that in our last version, we have a different set of V features than in (CoNLL, 2013). Finally, CH+V+QC+TFC* refers to the performance of CH+V+QC+TFC with question type information of semantic REL-FOCUS links represented as a distinct node (see Section 2.2). The results show that this modification yields a slight improvement over the baseline, thus, in the next experiments, we add LOD knowledge to CH+V+QC+TFC*.

Impact of LOD in Semantic Structures
These experiments evaluated the accuracy of the following models (described in the previous sections): (i) a system using Wikipedia to establish the REL links; and (ii) systems which use LOD knowledge to find type matches (TM).
The first header line of the Table 2 shows which baseline system was enriched with the TM knowledge. Type column reports the TM encoding strat-egy employed (see Section 3.2.2). Dataset column reports which knowledge source was employed to find TM relations. Here, yago is YAGO2, db is DBpedia, and wn is WordNet 3.0. The first result line in Table 2 reports the performance of the strong CH+V and CH+V+QC+TFC* baseline systems. Line with the "wiki" dataset reports on CH+V and CH+V+QC+TFC* using both Wikipedia link annotations provided by ML and MW and hard lemma matching to find the related structures to be marked by REL (see Section 3.3 for details of the Wikipedia-based REL matching). The remainder of the systems is built on top of the baselines using both hard lemma and Wikipedia-based matching. We used bold font to mark the top scores for each encoding strategy.
The tables show that all the systems exploiting LOD knowledge, excluding those using DBpedia only, outperform the strong CH+V and CH+V+QC+TFC* baselines. Note that CH+V enriched with TM tags performs comparably to, and in some cases even outperforms, CH+V+QC+TFC*. Compare, for example, the outputs of CH+V+T M N DF using YAGO, WordNet and DBpedia knowledge and those of CH+V+QC+TFC* with no LOD knowledge.
Adding TM tags to the top-performing baseline system, CH+V+QC+TFC*, typically results in further increase in performance. The best-performing system in terms of MRR and P@1 is CH+V+QC+TFC*+T M N F system using the combination of WordNet and YAGO2 as source of TM knowledge and Wikipedia for RELmatching. It outperforms the CH+V+QC+TFC* baseline by 3.82% and 4.15% in terms of MRR and P@1, respectively. Regarding MAP, a number of systems employing YAGO2 in combination with WordNet and Wikipedia-based RELmatching obtain 0.37 MAP score thus outperforming the CH+V+QC+TFC* baseline by 4%.
We used paired two-tailed t-test for evaluating the statistical significance of the results reported in Table 2. ‡ and † correspond to the significance levels of 0.05 and 0.1, respectively. We compared (i) the results in the wiki line to those in the none line; and (ii) the results for the TM systems to those in the wiki line.
The table shows that we typically obtain better results when using YAGO2 and/or WordNet. In our intuition this is due to the fact that these resources are large-scale, have fine-grained class Table 2: Results in 5-fold cross-validation on TREC QA corpus taxonomy and contain many synonymous labels per class/entity thus allowing us to have a good coverage with TM-links. DBpedia ontology that we employed in the db experiments is more shallow and contains fewer labels for classes, therefore the amount of discovered TM matches is not always sufficient for increasing performance. YAGO2 provides better coverage for TM relations between entities and their classes, while Word-Net contains more relations between classes 19 . Note that in (Severyn and Moschitti, 2012), we also used supersenses of WordNet (unsuccessfully) whereas here we use hypernymy relations and a different technique to incorporate semantic match into the tree structures.
Different TM-knowledge encoding strategies, T M N , T M N D , T M N F , T M N DF produce small changes in accuracy. We believe, that the difference between them would become more significant when experimenting with larger corpora.

Conclusions
This paper proposes syntactic structures whose nodes are enriched with semantic information from statistical classifiers and knowledge from LOD. In particular, YAGO, DBpedia and Word-Net are used to match and generalize constituents from QA pairs: such matches are then used in