Unsupervised Dense Retrieval for Scientific Articles

In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.


Introduction
Search engines are deeply integrated into Elsevier's information services of its scientific literature data.An example is the one provided by ScienceDirect1 , providing researchers with search services on more than 19M full text articles.These search engines are currently based on lexical search models such as BM25.The deployment of such models is effortlessly simplified by using popular industrystandard libraries such as Elasticsearch2 .However, lexical search suffers from the lexical gap problem such as misspellings, synonyms, abbreviations, ambiguous words, and ignoring of word order (Formal et al., 2021).
Recently, dense retrieval (DR) models have proven to be highly effective in solving the lexical gap problem while still remain fast search speed (Karpukhin et al., 2020;Xiong et al., 2020).DR models map queries and passages to a common vector space and retrieve relevant passages what is machine learning by searching for (approximate) nearest neighbors.DR has been well studied on laboratory data but still in the early stage for industry-level applications (Hofstätter et al., 2022;Kim, 2022).DR is mainly applied in multi-modal search in industry where traditional lexical search is not possible, like text-image search (Radford et al., 2021) or music search (Castellon et al., 2021).
It is of great interest to use state-of-the-art DR models to build semantic search engines for industry.Such search engines can enable efficient access and search to scientific literature of Elsevier and help researchers in their journey (Elsevier, 2022).Our goal in this work is to develop a semantic search engine that needs no relevance-labeled data to train the DR model, thus allowing easy adaptation to new domains and easy deployment in industry.
There are several challenges to be tackled.First, training a DR model requires sufficient labeled data such as MS-MARCO (Nguyen et al., 2016), whereas there is often no such data for specific domains or startups.In our case, we have a large collection of passages from scientific articles but no relevance label.Furthermore, it is shown that DR models trained on one domain do not generalize to another (Thakur et al., 2021).The passages in our corpus have a different word distribution compared to that in MS-MARCO.Besides, the passages are also unbalanced regarding their domains (see Section 4.3).Therefore, using the models trained on MS-MARCO will not yield high retrieval performance.It is interesting to tackle the domain difference problem.Finally, there is no test set to evaluate search performance and creating a goodquality test set is time-consuming and expensive.All these challenges hinder the application of DR models in industry setting.
In this work, we trained a DR model using a state-of-the-art unsupervised dense retrieval model called GPL (Wang et al., 2021).It uses a pretrained query generator to generate queries from passages.The passage is considered as positive for the generated queries.Negative passages for generated queries are retrieved using existing dense retrieval models trained on MS-MARCO.An existing cross-encoder model trained on MS-MARCO produces relevance scores of query-passage pairs as supervision signals to train the DR model.
Finally, we constructed two test sets by either manual annotation or automatic extraction of existing relevance information from the meta field of the corpus.The experimental results show that our best model can significantly improve the retrieval performance compared to lexical and semantic search baselines.
The semantic search engine we have created for our product is shown in Figure 1.It is currently deployed and running in a beta test mode.

Dense retrieval
The very first work on dense retrieval (DR) was proposed by Karpukhin et al. (2020).DR uses text encoders to represent queries and documents as dense vectors and retrieve documents by similarity scores between query vectors and document vectors.It has shown to achieve competitive performance in first-stage retrieval compared with traditional lexical retrieval method.

Unsupervised dense retrieval
Unsupervised dense retrieval (UDR) aims to train dense retrieval models without manually labeled data.It generates high-quality pseudo labeled data and designs proper loss functions to train DR models.
The first step is to generate positive examples, which is done by extraction or generation.For example, Izacard et al. (2021) extracted a pair of relevant texts form the same document using the inverse cloze task and independent cropping.Wang et al. (2021) (Izacard et al., 2021;Xu et al., 2022).On the other hand, relevance scores from existing generalizable cross-encoder have been used as supervision signal (Wang et al., 2021).

GPL Model Training
Since there are no relevance labels for the passages in our corpus, we apply a recent unsupervised dense retrieval model GPL (Wang et al., 2021) to train our dense retrieval model.We generate 3 queries from each passage using a pre-trained query generator (Nogueira et al., 2019).The passagequery pairs will be the pseudo positive examples.For each generated query, we retrieve similar passages using two existing DR models trained on the MS-MARCO dataset (Reimers and Gurevych, 2019), and take the first 50 of each model as pseudo negatives.Finally, we use a student-teacher training method.The teacher model is a cross-encoder trained on MS-MARCO which shows good performance in zero-shot retrieval tasks (Hofstätter et al., 2020).The student model is the bi-encoder DR model to be learned.
The student-teacher training is used because the pseudo labels are noisy and can not be directly used in the traditional pairwise ranking loss (Burges, 2010) or contrastive loss (Wu et al., 2018).Instead, using a cross-encoder has been demonstrated to generalize well on different datasets (Hofstätter et al., 2020) and thus can be used as a teacher model through knowledge distillation.
For the knowledge distillation we have used MarginMSE loss (Hofstätter et al., 2020).It is defined as: where f be is the bi-encoder, which maps the text of query or passage to a vector, f ce is the crossencoder, which maps the text of query and passage to a score, q i is the query, p + i is the positive passage, and p − i is the negative passage.By minimizing L M arginM SE , the MarginMSE loss avoids the hard treatment of the positives and negatives as in pairwise ranking loss (Burges, 2010) and contrastive loss (Wu et al., 2018).For example, for (f alse positive, negative) pairs or (positive, f alse negative) pair, we do not expect the bi-encoder to put them far away in the embedding space or have small similarity scores.The cross-encoder will assign a small δ value and the MarginMSE loss will teach the bi-encoder to produce small δ value as well.
Implementation.We use all-t5-base-v1 as the query generator because it is designed to generate key-word queries, which is similar with the terms or topics people search in our product.We use msmarco-distilbert-base-v3 and msmarco-MiniLM-L-6-v3 as the negative retrievers, and ms-marco-MiniLM-L-6-v2 as the teacher crossencoder as suggested in GPL.We use sebastianhofstaetter/distilbert-dot-tas_b-b256-msmarco as the starting checkpoint of the student bi-encoder because this is the best bi-encoder on MS-MARCO.The teacher model and the student model contain 22M and 66M parameters, respectively.All the models aforementioned can be downloaded from Huggingface 3 .We set batch size 16.We set maximum sequence length 512.Note that the passages are snippet from the articles and have on average 3 https://huggingface.co/models474 English words or 723 WordPiece (Wu et al., 2016) tokens.Cutting off of the passages loses information.It is worthy split the passages into shorter ones and we leave the work for future study.

Test Set Construction
Corpus The corpus we are working on supports a web service providing concept definitions and subject overviews for researchers who want to expand their knowledge about scholarly and technical terms. 4For example, for the term "water purification", a web page is created that contains its definition, several scientific article snippets containing other definitions of the term, and several relevant terms as well.The corpus contains about 2M passages extracted from scientific articles.The articles are from 20 different domains and not evenly distributed across domains.Figure 3 shows the domain distribution.
Manual test set.We aim to develop a semantic search engine on top of this corpus, so that when a user searches a term, the semantic search engine returns passages containing the definition of the term.Therefore, the ideal queries are questions about scientific terms, and the ideal relevant passages are those talks about (part of) the definitions the terms.
As the data contains scientific terms from 20 domains, we sample one term from each.We only sample those having Wikipedia pages to increase the chance that there exists relevant passages for a query.
We use the widely-used pooling method in information retrieval (Ferro and Peters, 2019) to select passages for annotation.We include 3 different retrieval systems in the pool including the BM25 model (Pérez-Iglesias et al., 2009), the TAS model (Hofstätter et al., 2021), and the GPL model trained by us, in order to ensure the passages in the test set are diverse and not biased towards either lexical retrieval or semantic retrieval methods.We randomly sample from the top-10 passages in the ranking lists.
We had 3 workers annotating the selected querypassage pairs.Conflicts of annotation were discussed until an agreement was reached.Finally, 20 queries and 539 query-passage pairs are selected and annotated.
Auto test set.Although the manual test set has high quality, it is too small and thus sensitive for evaluation.We use the noisy meta information in our corpus and construct a larger test set.The passages in the corpus is organised by terms.Each term has several passages associated with it that are considered relevant and containing the definition of the term.The extraction of the definition and relevant passages are done by production system based on lexical methods.Thus the passages can be roughly taken as relevant to that term.To balance terms from different domains, we sample 10 terms from each domain and take all the passages associated with it as relevant.Finally, we have 200 queries and 3,562 relevant labels.

System Architecture
Figure 2 shows the architecture of the semantic search engine.The system is divided into two parts, offline and online.In the offline part, we download the corpus from Amazon S3 buckets, then on Amazon Sagemaker we preprocess the corpus, train the the bi-encoder model and convert the passages into 768-dimensional vectors using the trained model.The HNSW 5 algorithm is used to index the passages.
The online part is divided into two parts.One of the parts is an API-based service running on Amazon Sagemaker.The task of this service is to convert the user query into a vector and find the passages closest to the query vector using the index we created in offline mode.The other part is a UI based interface running on an Amazon EC2 instance.This part processes the user query and displays the passage associated with the query through a UI interface.
The EC2 instance and API run on an Intel Xeonbased processor and the cost of running them is 1 dollar per hour.For training the model, we use the AWS p3.8x.largeEC2 instance type.This instance is installed with NVIDIA Tesla V100 GPUs.The cost of training the model was approximately 200 dollars.During inference time, the system is running on a CPU instance and it is able to process up to 70 requests/second.The average time needed to get the search result for a query is 0.03 second.TAS (Hofstätter et al., 2021).The sebastianhofstaetter/distilbert-dot-tas_b-b256-msmarco model is a zero-shot baseline.It was the best bi-encoder on MSMARCO when this paper was submitted.We also use this model as the starting checkpoint to train the GPL model.BM25+CE.This is a two-stage baseline implemented by us.It consists of lexical retrieval and re-ranking.We first use BM25 to produce a ranked list of passages, then use a cross-encoder ms-marco-MiniLM-L-6-v2 trained on MSMARCO to rerank the top-1000 passages.We use this model as the teacher model when training GPL.

Dataset
We use two test sets including the Manual and the Auto.Table 2 shows the statistics.The Manual has 20 queries and the Auto has 200 queries.Auto also has more labels for query-passage pairs.Note there is 0 non-relevant labels for Auto, however this does not affect the evaluation as all the rest passages without a relevant label will be counted as non-relevant.To speed up evaluation, we randomly sample a subset of passages for the models to retrieve from, combined with the passages in each of the two test sets.This results in two test corpora consisting of 100,513 and 102,506 passages for the Manual and the Auto.The test corpora have the same domain distribution with the full corpus.

Retrieval performance
In this section, we aim to answer RQ 1.We use a subset of 83K passages from our corpus and generate 3 queries for each passages and generate 100 negative passages for each query.Finally, we sample 4M training examples in the format of (q i , p + i , p − i , δ i ).It is suggested that such a volume is enough to train a GPL model for a new domain (Wang et al., 2021).We also empirically demonstrate the impact of training example size in Section 5.2.We train two GPL models: GPL is trained on 83K passages randomly sam-pled.GPL_BLC is trained on 83K passages which are balanced sampled from the 20 domains.Since we aim to build a one-stage retrieval model, we compare our model with a lexical retrieval model -BM25 and a zero-shot dense retrieval model -TAS.We also compare with a two-stage method -BM25+CE.
Table 1 shows the retrieval performance.First, BM25 performs robustly well on the two test sets, while zero-shot TAS performs poorly.It indicates that dense retrieval models do not generalize well on new domain.This finding is consistent with the work of Thakur et al. (2021).The difference of metrics between BM25 and TAS is larger on Auto, because we have annotated both lexical and semantic relevant passages in the Manual test set while most relevant passages in the Auto test set are obtained by lexical methods only.The dense retrieval model TAS is thus down-estimated on Auto.Second, BM25+CE performs the best.It improves NDCG@10, MAP@10, and MRR@10 to a large margin compared to BM25.The cross-encoder model (ms-marco-MiniLM-L-6-v2) is trained on MS-MARCO.Thus, the result indicates the good generalization capability of cross-encoder ranking models.Third, GPL or GPL_BLC perform better than BM25 on most the metrics and better than TAS on all the metrics.For example, an MRR@10 of 87.62 means that GPL_BLC can rank relevant passages on the first or second position on averaged queries, an R@100 of 91.77 means that GPL_BLC can retrieve 91.77% of the relevance passages in top 100.Note that the performance difference between GPL and GPL_BLC is big on Manual but small on Auto.The possible reason is that on Auto most semantically relevant passages are not labeled in the test set.Figure 4: NDCG@10 of the Unbalanced and Balanced corpus.

The impact of training example size
In this section, we aim to answer RQ 2. We use all the 2M passages in the corpus and generate 32M training examples to train the model.We save the checkpoint every 160K examples.We evaluate model performance on the Manual test set.
Figure 5 shows the NDCG@10 score against the training example size.We observe that more training examples do help to improve the performance of the model.The performance increases fast at the beginning and achieves an NDCG@10 of 0.74 with about 1M training examples, it then increases slowly towards an NDCG@10 of 0.80.
To sum up, it is not necessary to train the GPL model with all passages in our corpus; a volume of 1M training examples should be sufficient for the model.Train GPL on the whole corpus.

The impact of domain distribution
In this section, we aim to answer RQ 3. Since there is meta information about what domain the passages belong to in our corpus, we compare the model trained on the random 83K passages (Unbalanced) and the model trained on the 83K evenly distributed in the 20 domains (Balanced). Figure 4 shows the NDCG@10 of corpus 83K and 83Kbalance.We use the Manual set as the test set.We observe that (1) there is a large improvement on 83K-balanced compared to 83K-unbalanced; (2) the NDCG@10 increase for most queries, and the improvement is especially large for those with low NDCG@10.

Case study
In this section, we show one query and the top 3 ranked passages selected from the Manual test set to analyze the retrieval effectiveness.We showcase three models including BM25, TAS, and GPL.The case study helps us to know how the retrieved passages are different for the DR model trained on the target domain, the zero-shot DR model and the lexical retrieval model.BM25, as expected, retrieves passages containing exact match of words in the query.As it is a bag-of-word model, we observe that the word "water" and "purification" do not always appear together in the passages.TAS can retrieve semantically similar passages, but they are sometimes off the topic.For example, the 1st passage retrieved by TAS is about "fuel purification", it even contains the definition.However, it is not about "water purification".TAS_GPL can retrieve relevant passages which even contain the definition.For example, the 1st passage is a good definition of "water purification".The case clearly shows that lexical retrieval and dense retrieval find very different passages.This is because their ways of representing texts are completely different.Furthermore, training DR models on the target domain can improve retrieval performance to a large margin even though the training labels are noisy.

Conclusions & Future Work
In this work, we build a semantic search engine on scientific articles.To tackle the challenge of no labeled data for both training and test, we apply a state-of-the-art unsupervised dense retrieval model named GPL.As the articles are unbalanced across different domains, we sample passages from multiple domains to form balanced training batches.We also created two test sets for the evaluation: one manually annotated and one automatically constructed from the meta information of our corpus.We compare the semantic search engine with the currently deployed lexical search engine on the test sets.Both the qualitative and quantitative experiment results show that the semantic search engine can significantly improve the search performance.This results suggest that GPL is a robust and effective model for unsupervised dense retrieval.
For the future work, we will train the query generator and the negative retriever on our data to generate a better quality of both positive and negative training example to improve the retrieval performance.

Limitations
Currently, we see 3 limitations in our work.First, the query generator is trained on a different domain, which results in skipping important keywords or phrases around which the query should be generated.Second, the negative retrievers are not adapted to the domain.The results obtained by these retrievers are negative but not "hard negative".This leads to limitations in learning of the student model.Third, the semantic search engine we build has not been evaluated on production population.We plan to conduct online evaluation in the future.

Figure 1 :
Figure 1: Interface of our semantic search engine.

Figure 2 :
Figure 2: Architecture of the semantic search engine.

Figure 3 :
Figure 3: Passage domain distribution.The top 5 domains cover about 58.1% of the passages and the bottom 5 domains only contains about 3.97% of the passages.
Query-wise NDCG@10 of model trained with all the 4M training examples.

Table 1 :
Retrieval performance (%).The best values for each metric and the upper bound method is in bold.

Table 2 :
Statistics of test sets.