An Empirical Investigation of Implicit and Explicit Knowledge-Enhanced Methods for Ad Hoc Dataset Retrieval

,


Introduction
Tens of millions of datasets have been published on the Web (Benjelloun et al., 2020), providing government data, scientific data, etc. Accordingly, ad hoc dataset retrieval is becoming an important specialized information retrieval task (Kato et al., 2021;Lin et al., 2022), aiming at finding datasets that are relevant to a user's query, which is ad hoc since the number of possible queries is huge.Due to the magnitude and heterogeneity of dataset contents, existing solutions such as Google Dataset Search (Brickley et al., 2019) rely on the retrieval of dataset metadata provided by data publishers for describing their datasets, as illustrated in Table 1 which is a real example taken from the NTCIR-E test collection (Kato et al., 2021).
Motivation.The metadata of a dataset resembles a structured multi-field document, and current implementations of ad hoc dataset retrieval are mainly adapted from conventional document retrieval methods like BM25 or FSDM (Chapman et al., 2020;Lin et al., 2022).It is well-known that such lexical retrieval methods cannot identify semantic matches and hence fail to find datasets that are lexically disjointed with but semantically relevant to the query.The problem can be alleviated by incorporating knowledge into retrieval-either implicit knowledge encoded in a pre-trained language model (PLM) or explicit knowledge stored in an encyclopedic knowledge base.For example, to identify the semantic connection between "protein" in the query and "nutrient" in the dataset in Table 1, implicit knowledge-enhanced retrieval methods may achieve it by measuring the similarity between their word embeddings, while explicit methods may employ the similarity between their linked entities in a knowledge base.However, to the best of our knowledge, the effectiveness of incorporating knowledge into the emerging task of ad hoc dataset retrieval has not been systematically investigated.In particular, the following three research questions remain open.
RQ1.Despite a few preliminary PLM-based implementations (Kato et al., 2021), they were finetuned on a small training set.Researches on ad hoc dataset retrieval are still in their infancy, and existing test collections provide less than 300 queries for training (Kato et al., 2021;Lin et al., 2022).The performance of PLM-based methods fine-tuned in such an in-domain setting may not be generalizable to practical settings.Will implicit knowledge-enhanced retrieval methods remain effective in ad hoc dataset retrieval in an out-of-domain setting?
RQ2.The high-quality encyclopedic knowledge bases available today, like Wikipedia and Wikidata, can be used to annotate query and dataset metadata so that their semantic similarity can be measured to enhance lexical retrieval.More importantly, this is unsupervised and not limited by the availability of training data.However, such methods are so far under-studied and their effectiveness remains unknown.Will explicit knowledge-enhanced retrieval methods be effective in ad hoc dataset retrieval?
RQ3. Lexical matching, implicit knowledge, and explicit knowledge-they have the potential to capture different signals in retrieval.While each of them alone may not exhibit superb performance, their appropriate combination may produce better results, e.g., by an interpolation of their retrieval scores.Will interpolated methods be more effective in ad hoc dataset retrieval?
Our Work and Contribution.To answer the above questions, we conducted a systematic investigation of implicit and explicit knowledge-enhanced methods for ad hoc dataset retrieval on two test collections (Kato et al., 2021;Lin et al., 2022).For implicit knowledge, we evaluated five PLM-based methods in both in-domain and out-of-domain settings.For explicit knowledge, we designed and evaluated methods based on entity similarity computed over two knowledge bases.We also explored different interpolation strategies.
As the first empirical investigation of this kind, our work fills the gap and our results will provide practical guidelines for researchers and developers working with ad hoc dataset retrieval, or even information retrieval in general.It helps establish an empirical basis that will facilitate future studies on this trending information retrieval task.
Code: https://github.com/nju-websoft/AHDR-KnowledgeEnhancedPaper Structure.We will discuss related work in Section 2, describe the evaluated methods in Section 3, present our experimental setup and results in Section 4 and Section 5, respectively, and finally conclude the paper in Section 6.

Ad Hoc Dataset Retrieval
Ad hoc dataset retrieval is a specialized information retrieval task that aims to find the most relevant datasets to a user's query.The metadata of a dataset provided by its publisher typically consists of a set of fields such as title and description.While existing retrieval methods commonly rely on metadata (Chapman et al., 2020) which resembles a structured document, the task of dataset retrieval differs from document retrieval and has its unique properties, e.g., queries often mention geospatial and temporal entities (Kacprzak et al., 2019;Chen et al., 2019), and metadata is relatively short and often incomplete (Neumaier et al., 2016).
Knowledge-enhanced retrieval methods have the potential to exploit these features, but they have not been sufficiently studied for this task.Indeed, in a recent benchmarking effort (Lin et al., 2022), only a number of lexical retrieval methods were implemented and evaluated.In Kato et al. (2021) and Chen et al. (2023), a few PLM-based implementations were evaluated but were fine-tuned on a small training set risking overfitting and their reported performance might not be generalizable.
Our empirical investigation significantly extends the above evaluation efforts.We systematically evaluate a range of state-of-the-art PLM-based methods for ad hoc dataset retrieval in both indomain and out-of-domain settings.Moreover, we design and evaluate explicit knowledge-enhanced methods, which are under-studied in the literature.

Implicit Knowledge-Enhanced Retrieval
PLMs encode knowledge into learnable dense vectors (Talmor et al., 2020).PLM-based retrieval, aka dense retrieval, has developed rapidly in recent years and exhibited powerful text understanding capabilities which helped improve the accuracy of document retrieval (Zhao et al., 2022).Among others, monoBERT (Nogueira and Cho, 2019) directly leverages the text classification capability of BERT (Devlin et al., 2019) to rank documents.DPR (Karpukhin et al., 2020) adopts a dual-encoder architecture that employs the implicit knowledge in PLM and performs metric learning.Other dense retrieval models such as Col-BERT (Khattab and Zaharia, 2020) and COIL (Gao et al., 2021) further exploit the implicit knowledge in PLMs by computing token-level matching through multiple vectors.Xiong et al. (2021) proposes ANCE which features dynamic negative sampling to improve the informativeness of training data.Condenser (Gao and Callan, 2021) adopts a novel pre-training architecture to compress information in the text.Further, coCondenser (Gao and Callan, 2022) extends Condenser by pre-training with a query-agnostic contrastive loss.
These state-of-the-art dense retrieval methods have not been applied to ad hoc dataset retrieval.We adapt them to this new task and thoroughly analyze their effectiveness in various settings.

Explicit Knowledge-Enhanced Retrieval
Since the introduction of the Semantic Web, it has benefited information retrieval systems by incorporating explicit knowledge.In McCool and Miller (2003), an early semantic search prototype named TAP was presented, showing that knowledge bases can enhance retrieval systems.For Web search, Lu et al. (2009) demonstrated that ranking methods can be improved by semantic features.For entity search, researchers have employed entity linking techniques to annotate queries and measured the semantic relevance of a target entity to a query based on their learned embeddings (Gerritse et al., 2020).
Such explicit knowledge-enhanced retrieval methods have not received much attention in the research of ad hoc dataset retrieval.We design a method for this new task and evaluate its various configurations using different knowledge bases, entity linking tools, and entity embeddings.

Methods
We divide knowledge-enhanced methods for ad hoc dataset retrieval into two types: using implicit knowledge and using explicit knowledge.Employing implicit knowledge embodied in PLMs to enhance retrieval has been widely used, which we will briefly review in Section 3.1.In Section 3.2 we will present a method for employing explicit knowledge in a knowledge base to enhance retrieval.
Problem Statement.Given a query q, the main task in ad hoc dataset retrieval is to compute the relevance of each dataset d to q denoted by rel(d, q), so that a ranked list of datasets can be returned.

Implicit Knowledge-Enhanced Retrieval
While the contents of different datasets can be in different formats (e.g., TXT, CSV, JSON), the metadata of a dataset typically consists of a set of fields such as title and description which are commonly used in retrieval.For each dataset d, we concatenate the textual values of its metadata fields {T d,1 , T d,2 , . ..} into a document T d : where ⊕ represents concatenation.Any dense retriever reviewed in Section 2.2 can be used to compute the relevance of T d to the query q as the relevance score of d: which exploits knowledge implicitly encoded by PLMs into learnable dense vectors.Dense retrievers are commonly supervised.Fine-tuning can be performed on task-specific training data-referred to as an in-domain setting, or on data for other tasks-referred to as an out-of-domain setting.

Explicit Knowledge-Enhanced Retrieval
We assume that explicit knowledge is given as a knowledge base describing a set of entities E.
To exploit such knowledge, we firstly link the query q and the document representation T d for each dataset d to two sets of entities in E, denoted by E q ⊆ E and E d ⊆ E, respectively.Then we aggregate the pairwise entity similarities between E q and E d as the relevance of d to q. Entity similarity is measured based on their textual and structural descriptions in the knowledge base.

Entity Linking
We link q to a set of entities E q ⊆ E that are mentioned in q to represent q.Analogously, we link T d to a set of entities E d ⊆ E that are mentioned in T d to represent d.Entity linking is an established research problem (Shen et al., 2015(Shen et al., , 2023) ) and we use off-the-shelf tools in the experiments.

Entity Set Similarity
Let esim(e i , e j ) be the similarity between two entities e i , e j ∈ E, which will be elaborated in Section 3.2.3.Let S be an m × n dimensional similarity matrix containing m = |E q | rows and n = |E d | columns.For 1 ≤ i ≤ m and 1 ≤ j ≤ n, each element s i,j represents the entity similarity between e i ∈ E q and e j ∈ E d , i.e., s i,j = esim(e i , e j ).We aggregate the similarity values in S as follows, also depicted in Figure 1.
For each entity e i ∈ E q , we find its most similar entity in E d and take their similarity value.We calculate the arithmetic mean (arithmean) of such similarity values over all the m entities in E q : Analogously, we compute to what extent the entities in E q can best "cover" the entities in E d : Finally, considering that two similar sets of entities should both largely cover each other, we calculate the harmonic mean (harmomean) of s row and s col , which aggregates the similarity values in S as the relevance score of d: rel(d, q) = harmomean(s row , s col ) . (5) By the property of harmonic mean, rel(d, q) will be high only if both s row and s col are high.

Entity Similarity
Now we elaborate esim(e i , e j ), the similarity between two entities e i , e j ∈ E. We measure and integrate their entity-and word-level similarities.First, we measure the entity-level similarity between e i and e j based on their embedding vectors.Representation learning is an established research problem (Wang et al., 2017;Yang et al., 2022), which encodes the textual and structural description of each entity in a knowledge base into a dense vector.Let e i , e j be the embeddings of e i , e j , respectively.We calculate their cosine similarity: Second, we measure the word-level similarity between e i and e j based on the embedding vectors of their mentions.Specifically, let W i , W j be the sets of words that appear in the mentions of e i , e j , respectively.We construct a |W i | × |W j | dimensional similarity matrix where each element represents the cosine similarity between two word embedding vectors.Then we aggregate the similarity values in the matrix in a way that resembles the aggregation process described in Section 3.2.2 and Figure 1.The result is denoted by s word i,j .Finally, we integrate entity-and word-level similarities by calculating their harmonic mean: esim(e i , e j ) = harmomean(s ent i,j , s word i,j ) .(7) We choose harmonic mean because it empirically outperforms several other combination functions such as arithmetic mean, maximum, and minimum.Table 7 in the appendix presents the results of using different combination functions.
4 Experimental Setup

Test Collections
We conducted experiments on two test collections for ad hoc dataset retrieval.

NTCIR-E
NTCIR The relevance of a dataset to a query has been annotated as irrelevant (0), partially-relevant (1), or relevant (2) as the gold standard.

ACORDAR
ACORDAR2 is a test collection specifically over RDF datasets (Lin et al., 2022), including 31,589 datasets and 493 queries.The datasets were collected from 543 open data portals.The queries were partially crowdsourced and partially collected from TREC, and were split into five subsets for fivefold cross-validation, each fold using three subsets as the training set, one subset as the validation set, and one subset as the test set.Similar to NTCIR-E, the relevance of a dataset to a query has been annotated as irrelevant (0), partially-relevant (1), or relevant (2) as the gold standard.

Evaluation Metrics
Retrieval on NTCIR-E relied on two metadata fields: title and description.ACORDAR further included two other fields: author and tags.Both NTCIR-E and ACORDAR have provided top-10 retrieval results returned by (field-weighted) BM25 over these fields.Based on these first-stage retrieval results, we investigated the reranking performance of knowledge-enhanced retrieval methods.Two evaluation metrics were used: normalized discounted cumulative gain (NDCG) and mean average precision (MAP).When calculating MAP scores, both partially-relevant and relevant in the gold standard were considered as relevant.

Implicit Knowledge-Enhanced Retrieval
We used five popular PLM-based methods for our experiments: monoBERT-large (Nogueira and Cho, 2019),3 monoT5-large (Nogueira et al., 2020),4 coCondenser (Gao and Callan, 2022), 5ColBERT-v2 (Khattab and Zaharia, 2020), 6and ANCE (Xiong et al., 2021) 7 .For each model, we used its checkpoint pre-trained on MS MARCO (Nguyen et al., 2016), and reported its performance on the test set of each test collection to measure its performance in an out-ofdomain (OOD) setting.We also fine-tuned it on the training and validation sets of each test collection and then measured its performance on the test set in an in-domain (ID) setting, where partially-relevant and relevant datasets in the training set were used as positive samples, and irrelevant datasets were used as negative samples.It is worth noting that we finetuned and used all these methods as black boxes, e.g., ANCE's special negative sampling strategy as well as all the other special strategies incorporated into these methods were executed.

Explicit Knowledge-Enhanced Retrieval
We used two well-known encyclopedic knowledge bases, Wikipedia and Wikidata, and used their corresponding entity linking tools and embeddings.
For word embeddings, we consistently collected from Wikipedia2vec.

Experimental Results
Each of the following three subsections reported experimental results to answer one of the three research questions raised in Section 1.

Implicit Knowledge-Enhanced Retrieval
As shown in Table 2, in an in-domain setting, reranking by coCondenser, ColBERT, and ANCE achieved significant improvements on both test collections in terms of NDCG@5.However, for monoBERT the improvement was marginal on NTCIR-E, and for monoT5 we even observed a performance drop on ACORDAR.The results indicated that the fine-tuned PLM-based methods might have overfitted the small training sets of existing test collections for ad hoc dataset retrieval.monoBERT and monoT5 performed extensive query-document interaction, probably leading to more severe overfitting and worse test results.
The out-of-domain setting analyzed the generalizability of PLM-based methods.While reranking by monoT5, coCondenser, and ANCE achieved significant improvements on NTCIR-E in terms of NDCG@5, only monoT5 also achieved a significant improvement on ACORDAR, and for Col-BERT we even observed a performance drop on both test collections.The results suggested that only a few domain-adapted PLM-based methods could directly generalize to the task of ad hoc dataset retrieval having its unique features.
ColBERT which relies on word-level matching achieved relatively poor results possibly due to some ambiguous query words such as 'MCI' which could be mistakenly matched with the same word

Explicit Knowledge-Enhanced Retrieval
As shown in Table 3, only reranking by using Wikipedia knowledge with TAGME achieved a significant improvement on NTCIR-E in terms of NDCG@5.We observed performance drops in all the other configurations, and observed noticeable differences between the performance of using different entity linking tools (i.e., TAGME or REL) and different entity embeddings (i.e., KGTK or RDF2vec).The results suggested that reranking with explicit knowledge could have the potential but also require very careful implementation to obtain effectiveness in ad hoc dataset retrieval.
More concretely, we attributed the relatively good performance of TAGME to its higher recall than REL.Indeed, the sparse Wikipedia links found by REL could not sufficiently capture the semantics of the original text.RDF2vec which mainly employed the graph structure of Wikidata was inferior to KGTK whose text version used in our experiments ignored graph structure and only exploited the textual description of entities, which seemed to be more helpful than the graph structure.

Interpolation with BM25 Scores
In Section 5.1 and Section 5.2, we directly used the relevance scores computed by knowledgeenhanced retrieval methods to rerank first-stage retrieval results.Their unsatisfying results might be partially related to their weak sensitivity to exact lexical matches.Therefore, a straightforward extension would be to interpolate their score with BM25 score to enhance their capability of lexical matching, i.e., by calculating the sum of the two scores (both after min-max normalization).We chose sum because it was empirically among the best-performing fusion algorithms for interpolation.Table 8 in the appendix presents the results of using different fusion algorithms provided by ranx (Bassani and Romelli, 2022). 14 For implicit knowledge-enhanced retrieval methods, as shown in Table 4, interpolation helped improve the performance of all the methods on ACORDAR in an out-of-domain setting, and all 14 https://amenra.github.io/ranx/fusion/ the improvements were significant in terms of NDCG@5 (represented by †).On NTCIR-E, interpolation improved the performance of monoBERT and ColBERT, although it worsened the performance of the other methods.With interpolation, reranking by almost all the methods (except for ColBERT) achieved significant improvements on both test collections in terms of NDCG@5 (represented by * ), whereas without interpolation only monoT5 achieved that.The results demonstrated that domain-adapted PLM-based methods interpolated with BM25 scores could effectively generalize to ad hoc dataset retrieval.
For explicit knowledge-enhanced retrieval methods, as shown in Table 5, interpolation helped improve the performance of all the methods on ACORDAR, and all the improvements were significant (represented by †).On NTCIR-E, interpolation also noticeably improved the performance of using Wikidata knowledge, while its influence on the methods using Wikipedia knowledge was minor.With interpolation, reranking by using Wikipedia knowledge with TAGME achieved significant improvements on both test collections in terms of NDCG@5 (represented by * ), whereas without interpolation it achieved that only on NTCIR-E.The results demonstrated that reranking with explicit knowledge interpolated with BM25 scores could effectively benefit ad hoc dataset retrieval.

Interpolation of Implicit and Explicit
Knowledge-Enhanced Retrieval In Section 5.3.1, reranking with implicit and explicit knowledge interpolated with BM25 scores both showed effectiveness, so we continued to explore whether the two kinds of knowledge could complement each other by further interpolating implicit knowledge-enhanced relevance score, explicit knowledge-enhanced relevance score, and BM25 score, i.e., by calculating the sum of the three scores (all after min-max normalization).For a focused discussion, we reported the results of monoT5 and Wikipedia with TAGME, the best-performing implicit and explicit knowledgeenhanced retrieval methods interpolated with BM25 scores according to Table 4 and Table 5, respectively.Similar results were observed on the other methods and hence omitted.
As shown in Table 6, such further interpolation achieved the highest scores in all the settings.In particular, compared with using explicit knowledge interpolated with BM25 scores, incorporating im-  plicit knowledge brought significant improvements on ACORDAR (represented by ‡), while their complementarity in the other settings was not strong.The results suggested that using both implicit and explicit knowledge interpolated with first-stage lexical retrieval scores could represent the current best practice in reranking for ad hoc dataset retrieval.

Conclusion and Future Work
We summarize our empirical findings to answer the three research questions raised in Section 1. RQ1: Will implicit knowledge-enhanced retrieval methods remain effective in ad hoc dataset retrieval in an out-of-domain setting?According to the results presented in Section 5.1, only monoT5 remained effective on its own, whereas other PLM-based methods could not consistently bring improvements.However, interpolated with BM25 scores, most of these methods effectively generalized to this new task as shown by the results in Section 5.3.1.It demonstrates the necessity of combining dense and sparse retrieval for this task.
RQ2: Will explicit knowledge-enhanced retrieval methods be effective in ad hoc dataset retrieval?According to the results presented in Section 5.2 and Section 5.3.1, reranking with explicit knowledge could be beneficial to this task only when performing interpolation with BM25 scores, and significant improvements were consistently observed only for the configuration using Wikipedia knowledge with TAGME for entity linking.It suggests that the incorporation of explicit knowledge into this task should be carefully designed, and calls for more effective implementations in the future.
RQ3: Will interpolated methods be more effective in ad hoc dataset retrieval?According to the results presented in Section 5.3.1 and Section 5.3.2,not only interpolation with BM25 scores was helpful, but also implicit and explicit knowledge exhibited complementarity.A combination of lexical matching, implicit knowledge, and explicit knowledge consistently achieved the best performance in our experiments, representing the current best practice for solving this task.
Hopefully our empirical findings and conclusions should provide useful guidelines for the community to research and practice ad hoc dataset retrieval.In future work, we will continue to explore and address the unique challenges of this important task.Regarding implicit knowledge-enhanced retrieval, we are interested in constructing a large test collection to support in-domain supervision of more generalizable PLM-based retrieval methods.We also plan to apply PLM-based methods to not only dataset metadata but also dataset contents which are large and heterogeneous, thus posing great challenges.A related trending research direction is to explore the effectiveness of large language models in ad hoc dataset retrieval.Regarding explicit knowledge-enhanced retrieval, since the accuracy of entity linking observed in our experiments was not satisfactory and might distort the results, an idea is to extend existing test collections with manually annotated entities, which would also provide a useful resource for entity linking research.

Limitations
First, although we have already used two test collections for ad hoc dataset retrieval, there was the possibility that our results were still biased due to the small number of queries (less than a few hundred) in these test collections.The generalizability of our conclusions could be improved in the future when new and larger test collections are available.
Second, some observations in our experiments have yet to be justified.For example, while reranking by most PLM-based methods interpolated with BM25 scores showed effectiveness, it remains unclear why ColBERT was an exception.Exploring such reasons could help deepen our understanding of this task as well as the strengths and weaknesses of existing retrieval methods.
Third, our score interpolation performed in the experiments was helpful but simple.There could be more effective interpolation strategies.For example, instead of score interpolation, implicit and explicit knowledge could be integrated into a single model.We have witnessed the emergence of such efforts, which will be evaluated in our future work.
Fourth, following common practice in the literature, we only considered dataset metadata in retrieval without using dataset contents due to their magnitude and heterogeneity, which could not be effectively handled by current PLMs.It remains an open question as to whether and how dataset content should be exploited in retrieval.

Table 1 :
A query and the metadata of a relevant dataset.
-E 1 is the English version of the test collection used in the NTCIR-15 Dataset Search task (Kato et al., 2021), including 46,615 datasets and 192 queries.The datasets were collected from Data.gov.The queries were crowdsourced, and were originally split into 96 as the training set and 96 as the test set.We further split the former into 76 as our training set and 20 as our validation set.

Table 2 :
Performance of implicit knowledge-enhancement retrieval methods, with * indicating a significant improvement after reranking (according to paired t-test under p < 0.05).

Table 3 :
Performance of explicit knowledge-enhancement retrieval methods, with * indicating a significant improvement after reranking (according to paired t-test under p < 0.05).

Table 4 :
Performance of implicit knowledge-enhanced retrieval methods interpolated with BM25 scores, with * indicating a significant improvement after reranking, and † indicating a significant improvement after interpolation with BM25 scores (paired t-test with p < 0.05).

Table 5 :
Performance of explicit knowledge-enhanced retrieval methods interpolated with BM25 scores, with * indicating a significant improvement after reranking, and † indicating a significant improvement after interpolation with BM25 scores (paired t-test with p < 0.05).

Table 6 :
Performance of interpolation of implicit and explicit knowledge-enhanced retrieval methods, with * indicating a significant improvement after reranking, † and ‡ indicating a significant improvement after incorporating explicit and implicit knowledge, respectively (paired t-test with p < 0.05).

Table 7 :
Performance of explicit knowledge-enhanced retrieval (Wikipedia w/ TAGME) using different functions for combining entity-and word-level similarities.

Table 8 :
Performance of implicit knowledge-enhanced retrieval methods interpolated with BM25 scores using different fusion algorithms for interpolation.