Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking

In the field of information retrieval, Query Likelihood Models (QLMs) rank documents based on the probability of generating the query given the content of a document. Recently, advanced large language models (LLMs) have emerged as effective QLMs, showcasing promising ranking capabilities. This paper focuses on investigating the genuine zero-shot ranking effectiveness of recent LLMs, which are solely pre-trained on unstructured text data without supervised instruction fine-tuning. Our findings reveal the robust zero-shot ranking ability of such LLMs, highlighting that additional instruction fine-tuning may hinder effectiveness unless a question generation task is present in the fine-tuning dataset. Furthermore, we introduce a novel state-of-the-art ranking system that integrates LLM-based QLMs with a hybrid zero-shot retriever, demonstrating exceptional effectiveness in both zero-shot and few-shot scenarios. We make our codebase publicly available at https://github.com/ielab/llm-qlm.

Despite this success, the strong effectiveness of PLM-based rankers does not always generalise without sufficient in-domain training data (Thakur et al., 2021;Zhuang andZuccon, 2021a, 2022).Transferring knowledge from other domains has been used to overcome this issue (Lin et al., 2023) by training these rankers on large-scale supervised QA datasets such as MS MARCO (Nguyen et al., 2017).Alternatively, generative large language models (LLMs) like GPT3 (Brown et al., 2020) have been used to synthesize domain-specific training queries, which are then used to train these rankers (Bonifacio et al., 2022;Dai et al., 2023).Despite their effectiveness, all of these methods consume significant expenses in training a PLMbased ranker.
In this paper, we consider a third avenue to address this challenge: leveraging LLMs to function as Query Likelihood Models (QLMs) (Ponte and Croft, 1998;Hiemstra, 2000;Zhai and Lafferty, 2001).Essentially, QLMs are expected to understand the semantics of documents and queries, and estimate the possibility that each document can answer a certain query.Notably, recent advances in this direction have greatly enhanced the ranking effectiveness of QLM-based rankers by leveraging PLMs like BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020).These PLM-based QLMs are finetuned on query generation tasks and subsequently employed to rank documents as per their likelihood (Nogueira dos Santos et al., 2020;Zhuang et al., 2021;Lesota et al., 2021;Zhuang and Zuccon, 2021c).
We focus on a specific PLM-based QLM, the recently proposed Unsupervised Passage Re-ranker (UPR) (Sachan et al., 2022).UPR leverages advanced LLMs to obtained the query likelihood estimations.Empirical results show that using the T0 LLM (Sanh et al., 2022) as a QLM, large gains in ranking effectiveness can be obtained.A key aspect of this work is that this effectiveness is obtained without requiring additional fine-tuning data, making Sachan et al. highlight the zero-shot ranking capabilities of their LLM-based QLM.However, we argue that the experimental setting used by Sachan et al. does not fully align with a genuine zero-shot scenario for the QLM ranking task.This is because T0 has already undergone fine-tuning on numer-ous question generation (QG) tasks and datasets, subsequent to its unsupervised pre-training. 1Consequently, there exists a discernible task leakage to the downstream QLM ranking task, thereby rendering their approach more akin to a transfer learning setting, rather than a true zero-shot approach.
To gain a comprehensive understanding of the zero-shot ranking capabilities of LLM-based QLM rankers, in this paper we take a fresh examination of this topic.Our approach involves harnessing the power of state-of-the-art transformer decoderonly LLMs, such as LLaMA (Touvron et al., 2023), which have undergone pre-training solely on unstructured text through unsupervised next token prediction.Importantly, the models we consider have not undergone any additional supervised instruction fine-tuning, ensuring a truly complete zero-shot setting for our investigation.
We further extend our analysis by comparing the effectiveness of these LLMs with various popular instruction-tuned LLMs in the context of zeroshot ranking tasks.Interestingly, our findings reveal that further instruction fine-tuning adversely affects the effectiveness of QLM ranking, particularly when the fine-tuning datasets lack specific QG tasks.This insight highlights the strong zero-shot QLM ranking ability of LLMs that solely rely on pre-training, thereby suggesting that further instruction fine-tuning is unnecessary for achieving strong zero-shot effectiveness.Building upon these insights, we push the boundaries of zero-shot ranking even further by integrating a hybrid zero-shot firststage retrieval system, followed by re-ranking using the zero-shot LLM-based QLM re-rankers and a relevance score interpolation technique (Wang et al., 2021).Our approach achieves state-of-theart effectiveness in zero-shot ranking on a subset of the BIER dataset (Thakur et al., 2021).

Methodology
Zero-shot QLM re-ranker: We follow the setting introduced in previous works (Zhuang et al., 2021;Sachan et al., 2022) to evaluate the zero-shot QLM ranking capability of LLMs.Specifically, given a sequence of query tokens q and a set D containing candidate documents retrieved by a first-stage zero-shot retriever such as BM25, the objective is to rank all candidate documents d ∈ D based on 1 There are at least 16 QG datasets according to the opensourced T0 training: https://huggingface.co/datasets/ bigscience/P3 the average log likelihood of generating all query tokens, as estimated by a LLM.The relevance scoring function is defined as: here, q t denotes the t-th token of the query, p is a model and task specific prompt used for prompting the LLM to behave like a question generator (see Appendix A for more details), and LLM(q t |p, d, q <t ) refers to the probability of generating the token q t given the prompt p, the candidate document d, and the preceding query tokens q <t .It is important to note that, in a truly zero-shot ranking pipeline, the first-stage retriever should be a zero-shot method and the QLM re-ranker should exclusively be pre-trained on unsupervised unstructured text data and no fine-tuning is performed using any QG data.
Interpolating with first-stage retriever: Following Wang et al. (2021), instead of solely relaying on the query likelihood scores estimated by the LLMs, we also linearly interpolate the QLM score with the BM25 scores from the first-stage retriever by using the weighted score sum: Here, α ∈ [0, 1] represents the weight assigned to balance the contribution of the BM25 score and the QLM score.In our experiments, we heuristically apply min-max normalization to the scores and assign more weight to the QLM scores, given its pivotal role as the second-stage re-ranker.This is achieved by setting α = 0.2 without conducting any grid search.We use the python library ranx2 (Bassani and Romelli, 2022) to implement the interpolation algorithm.
Few-shot QLM re-ranker: Since LLMs are strong few-shot learners (Brown et al., 2020), we also conducted experiments to explore how LLMbased QLM re-rankers could be further enhanced by providing a minimal number of human-judged examples.To achieve this, we employed a prompt template known as "Guided by Bad Questions" (GBQ) (Bonifacio et al., 2022).The GBQ template consists of only three document, good question, and bad question triples.We use it to guide the LLM-based QLM to produce more accurate query likelihood estimations.We refer readers to the original paper for details about the GBQ template.

Experimental Settings
LLMs: Our focus is on the response of LLMs in the QLM ranking task, specifically in a genuine zero-shot setting.To accomplish this, we used LLaMA (Touvron et al., 2023) and Falcon (Almazrouei et al., 2023), both of which are transformer decoder-only models that are pre-trained solely on large, publicly available unstructured datasets (Penedo et al., 2023).We specifically consider open-source LLMs because we can control the data used to train them, thus guaranteeing no QG dataset was used.
To evaluate the influence of instruction finetuning data on QLM estimation, we compared these models with other well-known LLMs that were fine-tuned with instructions, including T5 (Raffel et al., 2020), Alpaca (Taori et al., 2023), StableLM, StableVicuna, andFalcon-instruct (Almazrouei et al., 2023).It is important to note that the finetuning instruction data for these models are unlikely to include QG tasks. 3Additionally, we follow Sachan et al. (2022) to include T0 (Sanh et al., 2022) andFlanT5 (Chung et al., 2022), which underwent fine-tuning specifically for QG instructions.All LLMs used in this paper are openly available, see Appendix B for more details.
To ensure feasibility, we conducted experiments on a popular subset of the BEIR benchmark datasets4 : TRECC (Voorhees et al., 2021), DBPedia (Hasibi et al., 2017), FiQA (Maia et al., 2018), and Robust04 (Voorhees, 2005).The evaluation metric used is nDCG@10, the official metric of the BEIR benchmark.Statistical significance analysis was performed using Student's two-tailed paired t-test with corrections, as per common practice in information retrieval.The results of this analysis is reported in Appendix D due to space constraints.

Main results
We present our main results in Table 1, highlighting key findings.For fair comparison, all the re-rankers consider the top 100 documents retrieved by BM25.Firstly, it is evident that retrievers and re-rankers fine-tuned on MS MARCO training data consistently outperform zero-shot retrievers and QLM re-rankers across all datasets, except for T5-QLMlarge, which is based on a smaller T5 model.This outcome is expected since these methods benefit from utilizing extensive human-judged QA training data and the knowledge can be effectively transferred to the datasets we tested.
On the other hand, zero-shot QLMs and QG fine-tuned QLMs exhibit competitive, similar effectiveness.This finding is somewhat surprising, considering that QG fine-tuned QLMs are explicitly trained on QG tasks, making them a form of transfer learning.This finding suggests that pretrained-only models such as LLaMA and Falcon possess strong zero-shot QLM ranking capabilities.
Another interesting finding is that instruction tuning can hinder LLMs' QLM ranking ability if the QG task is not included in the instruction finetuning data.This is evident in the results of Alpaca-7B, StableVicuna-13B, Falcon-7B-instruct and Falcon-40B-instruct, which are instruction-tuned versions of LLaMA-7B, LLaMA-13B, Falcon-7B and Falcon-40B, respectively.Our hypothesis to this unexpected finding is that instruction-tuned models tend to pay more attention to the task instructions and less attention to the input content itself.Although they are good at following instructions in the generation task, the most important information for evaluating query likelihood is in the document content, thus instruction-tuning hurts query likelihood estimation for LLMs.On the other hand, QG instruction-tuned LLMs show large improvements in QLM ranking.For example, the T0 and FlanT5 models are QG-tuned versions of T5 models, and they perform better.These results confirm that T0 and FlanT5 leverage their fine-tuning data, thus should be considered within the transfer learning setting.
In terms of model size, larger LLMs generally tend to be more effective, although there are exceptions.For instance, LLaMA-7B outperforms LLaMA-13B on DBpedia.

Interpolation with BM25
Table 2 demonstrates the impact of interpolating with BM25 scores.Notably, we observe a large decrease in the effectiveness of monoT5 re-rankers, which are trained on large-scale QA domain data, when interpolating with BM25.This finding aligns with a study conducted by Yates et al. (2021).In contrast, QLM re-rankers consistently exhibited higher effectiveness across most datasets when us- We note that the results in Table 2 are obtained by setting α = 0.2 without tuning this parameter because we are testing our method in zero-shot setting where this parameter needs to be set without validation data.Nonetheless, we conduct a post-hoc analysis on TRECC to understand the sensitivity of this parameter.The results are presented in Figure 1.From the results, we can draw the following conclusions: 1.The interpolation strategy consistently has a negative impact on monoT5-3B, while it consistently benefits instruction-tuned and zeroshot rerankers.
2. Instruction-tuned rerankers consistently underperform their corresponding zero-shot rerankers, regardless of the set alpha value.
3. Optimal values of α for both instruction-tuned and zero-shot rerankers fall within the range of 0.1 to 0.4.

Effective ranking pipeline
In Table 3 we push the boundary of our two-stage QLM ranking pipeline in both zero-shot and fewshot setting to obtain high ranking effectiveness.
For this purpose, we use the same linear interpolation as Equation 2 with α = 0.5 to combine BM25 and HyDE as the zero-shot first-stage retriever. 5he top 100 documents retrieved by this hybrid retriever are then re-ranked using QLMs.Firstly, our results suggest that the effectiveness of zero-shot first-stage retrieval can be improved by simply interpolating sparse and dense retrievers.Moreover, after QLM re-ranking, the nDCG@10 values surpass those in Table 1.This indicates that zero-shot QLM re-rankers benefit from a stronger first-stage retriever, leading to improved overall ranking effectiveness.For the few-shot results, we observe that providing only three GBQ examples to the model further enhances ranking effectiveness, although this effect is less pronounced for FlanT5.Remarkably, our QLM ranking pipeline achieves nDCG@10 on par with or higher than the state-ofthe-art PROMPTAGATOR method on comparable datasets in both zero-shot and few-shot settings.It is important to note that PROMPTAGATOR requires training on a large amount of synthetically generated data for both the retriever and re-ranker, whereas our approach does not require any training.It's worth highlighting that instruction-tuned LLMs continue to exhibit lower effectiveness compared to their pre-trained-only LLMs, even when a better first-stage retriever is employed and under a few-shot setting.

Conclusion
In this paper, we adapt recent advanced LLMs into QLMs for ranking documents and comprehensively study their zero-shot ranking ability.Our results highlight that these LLMs possess remarkable zeroshot ranking effectiveness.Moreover, we observe that additional instruction fine-tuned LLMs unperformed in this task.This important insight is overlooked in previous studies.Furthermore, our study shows that by integrating LLM-based QLMs with a hybrid zero-shot retriever, a novel state-of-the-art ranking pipeline can be obtained that excels in both

Limitations
While theoretically our QLM method can be applied to any LLM, for practical implementation, access to the model output logits is required.Therefore, in this paper, our focus has been solely on open-source LLMs where we can have access to the model weights.In contrast, approaches like Inpars and PROMPTAGATOR, which extract knowledge from the text produced by LLMs, do not require access to the model weights.Common commercial API services that expose popular close-source models such as GPT-4, however, do not provide access to model logits.offered by popular closesource models such as GPT-4 .These can easily be used within Inpars and PROMPTAGATOR by directly leveraging the generated text.However, our method cannot use these models because they do not provide access to the logits.It might be possible that in future commercial LLM provides would add functionalities in their APIs to access model logits.
Our focus on open-source LLMs also offers us the opportunity to scrutinise the data used to train the LLMs to ascertain that no QG data was used.This reassures a genuine zero-shot setting is considered, as opposed to previous work on LLM-based QLMs (Sachan et al., 2022).Although LLaMA and Falcon are primarily pre-trained using unsupervised learning on unstructured text data, it remains possible that the pre-training data contains text snippets that serve as instructions and labels for the QG task.In order to ascertain the authenticity of the zero-shot setting, it may be necessary to thoroughly analyze and identify such text snippets within the pre-training data.
In the paper, we could not report a complete statistical significance analysis of the results due to space limitation.Appendix D reports a detailed analysis.However, our analysis was limited by the unavailability of run files for some of the models published in previous works, as they were not released by authors.In these cases, we could not perform statistical comparisons with respect to the runs we produced.We note this is a common problem when authors do not release their models' runs.We make all run files available, along with code, at https://github.com/ielab/llm-qlm.
Table 4: Prompts used for each LLM-dataset pair.For Alpaca-7B and StableLM-7B we also prepend a system prompt according to the fine-tuning recipe of the each model.For Alpaca-7B is "Below is an instruction that describes a task, paired with an input that provides further context.Write a response that appropriately completes the request.\n\n".For SableLM-7B is "<|SYSTEM|># StableLM Tuned (Alpha version)\n-StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.\n-StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.\n-StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.\n-StableLMwill refuse to participate in anything that could harm a human.\n" Generate a question that is the most relevant to the given article's title and abstract.\n{doc}Generate a query that includes an entity and is also highly relevant to the given Wikipedia page title and abstract.\n{doc}Generate a question that is the most relevant to the given document.\n{doc}Generate a question that is the most relevant to the given document.\n{doc}T0-3B/T0-11B Please write a question based on this passage.\n{doc} Please write a question based on this passage.\n{doc} Please write a question based on this passage.\n{doc} Please write a question based on this passage.\n{doc}LLaMA-7B/LLaMA13B/Falcon-7B/Falcon-13B/Falcon-7B-instruct/Falcon-13Binstruct Generate a question that is the most relevant to the given article's title and abstract.\n{doc}\n\nHereis a generated relevant question: Generate a query that includes an entity and is also highly relevant to the given Wikipedia page title and abstract.\n{doc}\n\nHereis a generated relevant question: Generate a question that is the most relevant to the given document.\nThedocument: {doc}\n\nHere is a generated relevant question: Generate a question that is the most relevant to the given document.\nThedocument: {doc}\n\nHere is a generated relevant question: Alpaca-7B ### Instruction:\nGenerate a question that is the most relevant to the given article's title and abstract.\n\n###Input:\n{doc}\n\n### Response: ### Instruction:\nGenerate a query that includes an entity and is also highly relevant to the given Wikipedia page title and abstract.\n\n###Input:\n{doc}\n\n###Response: ### Instruction:\nGenerate a question that is the most relevant to the given document.\n\n###Input:\n{doc}\n\n### Response: ### Instruction:\nGenerate a question that is the most relevant to the given document.\n\n###Input:\n{doc}\n\n### Response: StableLM-7B <|USER|>Generate a question that is the most relevant to the given article's title and abstract.\n{doc}<|ASSISTANT|>Hereis a generated relevant question: <|USER|>Generate a query that includes an entity and is also highly relevant to the given Wikipedia page title and abstract.\n{doc}<|ASSISTANT|>Hereis a generated relevant question: <|USER|>Generate a question that is the most relevant to the given document.\nThedocument: {doc}<|ASSISTANT|>Here is a generated relevant question: <|USER|>Generate a question that is the most relevant to the given document.\nThedocument: {doc}<|ASSISTANT|>Here is a generated relevant question StableVicuna-13B ### Human: Generate a question that is the most relevant to the given article's title and abstract.\n{doc}\n###Assistant: Here is a generated relevant question: ### Human: Generate a query that includes an entity and is also highly relevant to the given Wikipedia page title and abstract.\n{doc}\n###Assistant: Here is a generated relevant question: ### Human: Generate a question that is the most relevant to the given document.\nThedocument: {doc}\n### Assistant: Here is a generated relevant question: ### Human: Generate a question that is the most relevant to the given document.\nThedocument: {doc}\n### Assistant: Here is a generated relevant question:

A Models and datasets prompts
Given that various instruction-tuned LLMs might be fine-tuned using diverse system and instruction prompts, coupled with the fact that datasets vary in document formats across different domains, it becomes necessary to employ specific prompts tailored to each LLM-dataset pair to achieve optimal zero-shot ranking performance.Thus, we design a prompt for each LLM-dataset pair based on the LLM usage instruction provided by the original authors and dataset features.To facilitate clarity, we have compiled a comprehensive list of all the prompts utilized for each LLM-dataset pair, which can be found in Table 4.

B List of Huggingface model names
Table 5 provides links to the Huggingface model hub (Wolf et al., 2020) for the LLMs used in this paper.All the models can be conveniently downloaded directly from the Huggingface model hub, with the exception of Alpaca-7B.For Alpaca-7B, we followed an open-sourced github repository to perform the fine-tuning of LLaMA-7B ourselves.

C Descriptions of Baselines
• BM25 (Robertson and Zaragoza, 2009): A widely used statistical bag-of-words approach  that is commonly used as the zero-shot firststage retrieval method.We use the Pyserini "two-click reproductions" (Ma et al., 2022) to produce the BM25 results on BEIR datasets.
• QLM-Dirichlet (Zhai and Lafferty, 2001): The traditional QLM method that exploits term statistics and Dirichlet smoothing technique to estimate query likelihood, we also use Pyserini implementation for this baseline.
• HyDE (Gao et al., 2022): A two-step zeroshot first-stage retriever that leverages generative LLMs and Contriever.In the first step, a prompt is provided to a LLM to generate multiple documents relevant to the given query.Subsequently, in the second step, the generated documents are encoded into vectors using the Contriever query encoder and then aggregated to form a new query vector for the search process.We utilized the open-sourced implementation provided by the original authors for our experiments: https://github.com/texttron/hyde.
• Contriver-msmarco.A Contriever checkpoint further pre-trained on MS MARCO training data.We use the Pyserini provided pre-build dense vector index and model checkpoint for this baseline.
• SPLADE-distill (Formal et al., 2022): A first-stage sparse retrieval model that exploits BERT PLM to learn query/document sparse term expansion and weights.We use the Pyserini provided pre-build index and SPLADE checkpoint to produce the results.
• DRAGON+ (Lin et al., 2023): A dense retriever model that fine-tuned on augmented MS MARCO corpus and uses multiple retrievers to conduct automatical relevance labeling.It stands as the current state-of-the-art dense retriever in the transfer learning setting.We use the scores reported on the BEIR learderboard6 for this baseline.• PROMPTAGATOR (Dai et al., 2023): These methods consist of a Transformer encoderbased retriever and re-ranker that are trained using synthetic queries generated by LLMs.They offer both zero-shot and few-shot settings.As public model checkpoints are not currently available, we refer to the scores reported in the original paper as our point of reference for comparing against our own methods and baselines.

D Statistical significance analysis
In Table 6 we report a statistical significance analysis for all the methods for which we can obtain a run file, along with our methods.The analysis was performed using the Student's two-tailed paired ttest with corrections, as per common practice in information retrieval.We used the Python toolkit ranx (Bassani and Romelli, 2022) for generating the report.

•
T5-QLM-large(Zhuang et al., 2021): A T5-based QLM method that fine-tuned on MS MARCO QG training data.We use the implement this method with open-sourced docTquery-T5 (Nogueira and Lin, 2019) checkpoint 7 .•monoT5-3B(Nogueira et al., 2020).A T5based cross-encoder re-ranker that fine-tuned on MS MARCO training data.We use the open-sourced implementation provided by Inpars authors 8 .• monoT5-3B-Inpars-v2 (Jeronymo et al., 2023): A T5-based cross-encoder re-ranker that fine-tuned on MS MARCO training data and in-domain synthetic queries that generated by LLMs.It is the current state-of-the-art re-ranker in transfer learning setting.We use the open-sourced implementation provided by the original authors 9 .

Table 1 :
Main results.Re-rankers re-rank Top100 documents retrieved by BM25.Transferred retrievers and re-rankers are fine-tuned on MS MARCO.

Table 3 :
Zero-shot/few-shot ranking systems.PROMPTAGATOR++ re-rankers use their own zero/few-shot PROMPTAGATOR first-stage retrievers, scores are copied from the original paper as the model is not publicly available.Other re-rankers consider the Top100 documents retrieved by BM25 + HyDE.
*zero-shot and few-shot scenarios, showcasing the effectiveness and versatility of LLM-based QLMs.

Table 5 :
Huggingface model hub links for LLMs used in this paper.

Table 6 :
Overall effectiveness of the models and statistical significance analysis.The best results are highlighted in boldface.Superscripts denote significant differences (t-test, p ≤ 0.05).x -> y denotes the x retriever re-ranked by y re-ranker.