Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks, including search engines. However, existing work utilizes the generative ability of LLMs for Information Retrieval (IR) rather than direct passage ranking. The discrepancy between the pre-training objectives of LLMs and the ranking objective poses another challenge. In this paper, we first investigate generative LLMs such as ChatGPT and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal that properly instructed LLMs can deliver competitive, even superior results to state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to address concerns about data contamination of LLMs, we collect a new test set called NovelEval, based on the latest knowledge and aiming to verify the model's ability to rank unknown knowledge. Finally, to improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models using a permutation distillation scheme. Our evaluation results turn out that a distilled 440M model outperforms a 3B supervised model on the BEIR benchmark. The code to reproduce our results is available at www.github.com/sunnweiwei/RankGPT.

As one of the most successful AI applications, Information Retrieval (IR) systems satisfy user requirements through several pipelined sub-modules, such as passage retrieval and re-ranking (Lin et al., 2020).Most previous methods heavily rely on manual supervision signals, which require significant human effort and demonstrate weak generalizability (Campos et al., 2016;Izacard et al., 2022).Therefore, there is a growing interest in leveraging the zero-shot language understanding and reasoning capabilities of LLMs in the IR area.However, most existing approaches primarily focus on exploiting LLMs for content generation (e.g., query or passage) rather than relevance ranking for groups of passages (Yu et al., 2023;Microsoft, 2023).
Compared to the common generation settings, the objectives of relevance re-ranking vary significantly from those of LLMs: the re-ranking agents need to comprehend user requirements, globally compare, and rank the passages based on their relevance to queries.Therefore, leveraging the LLMs' capabilities for passage re-ranking remains a challenging and unanswered task.
To this end, we focus on the following questions: • (RQ1) How does ChatGPT perform on passage re-ranking tasks?
• (RQ2) How can we imitate the ranking capabilities of ChatGPT in a smaller, specialized model?
To answer the first question, we investigate prompting ChatGPT with two existing strategies (Sachan et al., 2022;Liang et al., 2022).However, we observe that they have limited performance and heavily rely on the availability of the log-probability of model output.Thus, we propose an alternative instructional permutation generation approach, instructing the LLMs to directly output the permutations of a group of passages.In addition, we propose an effective sliding window strategy to address context length limitations.For a comprehensive evaluation of LLMs, we employ three well-established IR benchmarks: TREC (Craswell et al., 2020), BEIR (Thakur et al., 2021), and My.TyDi (Zhang et al., 2021).Furthermore, to assess the LLMs on unknown knowledge and address concerns of data contamination, we suggest collecting a continuously updated evaluation testbed and propose NovelEval, a new test set with 21 novel questions.
To answer the second question, we introduce a permutation distillation technique to imitate the passage ranking capabilities of ChatGPT in a smaller, specialized ranking model.Specifically, we randomly sample 10K queries from the MS MARCO training set, and each query is retrieved by BM25 with 20 candidate passages.On this basis, we distill the permutation predicted by Chat-GPT into a student model using a RankNet-based distillation objective (Burges et al., 2005).
Our evaluation results demonstrate that GPT-4, equipped with zero-shot instructional permutation generation, surpasses supervised systems across nearly all datasets.Figure 1 illustrates that GPT-4 outperforms the previous state-of-the-art models by an average nDCG improvement of 2.7, 2.3, and 2.7 on TREC, BEIR, and My.TyDi, respectively.Furthermore, GPT-4 achieves state-of-the-art performance on the new NovelEval test set.Through our permutation distillation experiments, we observe that a 435M student model outperforms the previous state-of-the-art monoT5 (3B) model by an average nDCG improvement of 1.67 on BEIR.Additionally, the proposed distillation method demonstrates cost-efficiency benefits.
In summary, our contributions are tri-fold: • We examine instructional methods for LLMs on passage re-ranking tasks and introduce a novel permutation generation approach; See Section 3 for details.• We comprehensively evaluate ChatGPT and GPT-4 on various passage re-ranking benchmarks, including a newly proposed NovelEval test set; See Section 5 for details.• We propose a distillation approach for learning specialized models with the permutation generated by ChatGPT; See Section 4 for details.
2 Related Work
In this paper, we explore using ChatGPT and GPT-4 in passage re-ranking tasks, propose an instructional permutation generation method, and conduct a comprehensive evaluation of benchmarks from various domains, tasks, and languages.Recent work (Ma et al., 2023) concurrently investigated listwise passage re-ranking using LLMs.In comparison, our study provides a more comprehensive evaluation, incorporating a newly annotated dataset, and validates the proposed permutation distillation technique.

LLMs Specialization
Despite their impressive capabilities, LLMs such as GPT-4 often come with high costs and lack opensource availability.As a result, considerable research has explored ways to distill the capabilities of LLMs into specialized, custom models.For instance, Fu et al. (2023) and Magister et al. (2023) have successfully distilled the reasoning ability of LLMs into smaller models.Self-instruct (Wang et al., 2023b;Taori et al., 2023) propose iterative approaches to distill GPT-3 using their outputs.Additionally, Sachan et al. (2023) and Shi et al. (2023) utilize the generation probability of LLMs to improve retrieval systems.This paper presents a permutation distillation method that leverages Chat-GPT as a teacher to obtain specialized re-ranking models.Our experiments demonstrate that even with a small amount of ChatGPT-generated data, the specialized model can outperform strong supervised systems.

Passage Re-Ranking with LLMs
Ranking is the core task in information retrieval applications, such as ad-hoc search (Lin et al., 2020;Fan et al., 2022), Web search (Zou et al., 2021), and open-domain question answering (Karpukhin et al., 2020).Modern IR systems generally employ a multi-stage pipeline where the retrieval stage focuses on retrieving a set of candidates from a large corpus, and the re-ranking stage aims to re-rank this set to output a more precise list.Recent studies have explored LLMs for zero-shot re-ranking, such as instructional query generation or relevance generation (Sachan et al., 2022;Liang et al., 2022).However, existing methods have limited performance in re-ranking and heavily rely on the availability of the log probability of model output and thus cannot be applied to the latest LLMs such as GPT-4.Since ChatGPT and GPT-4 have a strong capacity for text understanding, instruction following, and reasoning, we introduce a novel instructional permutation generation method with a sliding window strategy to directly output a ranked list given a set of candidate passages.Figure 2 illustrates examples of three types of instructions; all the detailed instructions are included in Appendix A.

Instructional Permutation Generation
As illustrated in Figure 2 (c), our approach involves inputting a group of passages into the LLMs, each identified by a unique identifier (e.g., [1], [2], etc.).We then ask the LLMs to generate the permutation of passages in descending order based on their relevance to the query.The passages are ranked using the identifiers, in a format such as The proposed method ranks passages directly without producing an intermediate relevance score.

Sliding Window Strategy
Due to the token limitations of LLMs, we can only rank a limited number of passages using the permutation generation approach.To overcome this constraint, we propose a sliding window strategy.Figure 3 illustrates an example of re-ranking 8 pas-sages using a sliding window.Suppose the firststage retrieval model returns M passages.We rerank these passages in a back-to-first order using a sliding window.This strategy involves two hyperparameters: window size (w) and step size (s).We first use the LLMs to rank the passages from the (M − w)-th to the M -th.Then, we slide the window in steps of s and re-rank the passages within the range from the (M − w − s)-th to the (M − s)th.This process is repeated until all passages have been re-ranked.

Specialization by Permutation Distillation
Although ChatGPT and GPT-4 are highly capable, they are also too expensive to deploy in commercial search systems.Using GPT-4 to re-rank passages will greatly increase the latency of the search system.In addition, large language models suffer from the problem of unstable generation.Therefore, we argue that the capabilities of large language models are redundant for the re-ranking task.Thus, we can distill the re-ranking capability of large language models into a small model by specialization.

Permutation Distillation
In this paper, we present a novel permutation distillation method that aims to distill the passage reranking capability of ChatGPT into a specialized model.The key difference between our approach and previous distillation methods is that we directly use the model-generated permutation as the target, without introducing any inductive bias such as consistency-checking or log-probability manipulation (Bonifacio et al., 2022;Sachan et al., 2023).
To achieve this, we sample 10,000 queries from MS MARCO and retrieve 20 candidate passages using BM25 for each query.The objective of distillation aims to reduce the differences between the permutation outputs of the student and ChatGPT.

Training Objective
Formally, suppose we have a query q and M passages (p 1 , . . ., p M ) retrieved by BM25 (M = 20 in our implementation).ChatGPT with instructional permutation generation could produce the ranking results of the M passages, denoted as R = (r 1 , . . ., r M ), where r i ∈ [1, 2, . . ., M ] is the rank of the passage p i .For example, r i = 3 means p i ranks third among the M passages.Now we have a specialized model s i = f θ (q, p i ) with parameters θ to calculate the relevance score s i of paired (q, p i ) using a cross-encoder.Using the permutation R generated by ChatGPT, we consider RankNet loss (Burges et al., 2005) to optimize the student model: RankNet is a pairwise loss that measures the correctness of relative passage orders.When using permutations generated by ChatGPT, we can construct M (M − 1)/2 pairs.

Specialized Model Architecture
Regarding the architecture of the specialized model, we consider two model structures: the BERT-like model and the GPT-like model.

BERT-like model.
We utilize a cross-encoder model (Nogueira and Cho, 2019) based on DeBERTa-large.It concatenates the query and passage with a [SEP] token and estimates relevance using the representation of the [CLS] token.

GPT-like model.
We utilize the LLaMA-7B (Touvron et al., 2023) with a zero-shot relevance generation instruction (see Appendix A).It classifies the query and passage as relevance or irrelevance by generating a relevance token.The relevance score is then defined as the generation probability of the relevance token.
Figure 5 illustrates the structure of the two types of specialized models.

Datasets
Our experiments are conducted on three benchmark datasets and one newly collected test set NovelEval.

Benchmark Datasets
The benchmark datasets include, TREC-DL (Craswell et al., 2020), BEIR (Thakur et al., 2021), and Mr.TyDi (Zhang et al., 2021).TREC is a widely used benchmark dataset in IR research.We use the test sets of the 2019 and 2020 competitions: (i) TREC-DL19 contains 43 queries, (ii) TREC-DL20 contains 54 queries.BEIR consists of diverse retrieval tasks and domains.We choose eight tasks in BEIR to evaluate the models: (i) Covid: retrieves scientific articles for COVID-19 related questions.(ii) NFCorpus is a bio-medical IR data.(iii) Touche is an argument retrieval datasets.(iv) DBPedia retrieves entities from DBpedia corpus.(v) SciFact retrieves evidence for claims verification.(vi) Signal retrieves relevant tweets for a given news title.(vii) News retrieves relevant news articles for news headlines.(viii) Robust04 evaluates poorly performing topics.
Mr.TyDi is a multilingual passages retrieval dataset of ten low-resource languages: Arabic, Bengali, Finnish, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, and Thai.We use the first 100 samples in the test set of each language.

A New Test Set -NovelEval
The questions in the current benchmark dataset are typically gathered years ago, which raises the issue that existing LLMs already possess knowledge of these questions (Yu et al., 2023).Furthermore, since many LLMs do not disclose information about their training data, there is a potential risk of contamination of the existing benchmark test set (OpenAI, 2023).However, re-ranking models are expected to possess the capability to comprehend, deduce, and rank knowledge that is inherently unknown to them.Therefore, we suggest constructing continuously updated IR test sets to ensure that the questions, passages to be ranked, and relevance annotations have not been learned by the latest LLMs for a fair evaluation.
As an initial effort, we built NovelEval-2306, a novel test set with 21 novel questions.This test set is constructed by gathering questions and passages from 4 domains that were published after the release of GPT-4.To ensure that GPT-4 did not possess prior knowledge of these questions, we presented them to both gpt-4-0314 and gpt-4-0613.For instance, question "Which film was the 2023 Palme d'Or winner?" pertains to the Cannes Film Festival that took place on May 27, 2023, rendering its answer inaccessible to most existing LLMs.Next, we searched 20 candidate passages for each question using Google search.The relevance of these passages was manually labeled as: 0 for not relevant, 1 for partially relevant, and 2 for relevant.See Appendix C for more details.
6 Experimental Results of LLMs

Implementation and Metrics
In benchmark datasets, we re-rank the top-100 passages retrieved by BM25 using pyserini1 and use nDCG@{1, 5,10} as evaluation metrics.Since ChatGPT cannot manage 100 passages at a time, we use the sliding window strategy introduced in Section 3.2 with a window size of 20 and step size of 10.In NovelEval, we randomly shuffled the 20 candidate passages searched by Google and re-ranked them using ChatGPT and GPT-4 with permutation generation.
Table 1 presents the evaluation results obtained from the TREC and BEIR datasets.The following observations can be made: (i) GPT-4, when equipped with the permutation generation instruction, demonstrates superior performance on both datasets.Notably, GPT-4 achieves an average improvement of 2.7 and 2.3 in nDCG@10 on TREC and BEIR, respectively, compared to monoT5 (3B).(ii) ChatGPT also exhibits impressive results on the BEIR dataset, surpassing the majority of supervised baselines.(iii) On BEIR, we use only GPT-4 to re-rank the top-30 passages re-ranked by Chat-GPT.The method achieves good results, while the cost is only 1/5 of that of only using GPT-4 for re-ranking.
Table 2 illustrates the results on Mr. TyDi of ten low-resource languages.Overall, GPT-4 outperforms the supervised system in most languages, demonstrating an average improvement of 2.65 nDCG over mmarcoCE.However, there are instances where GPT-4 performs worse than mmar-coCE, particularly in low-resource languages like Bengali, Telugu, and Thai.This may be attributed to the weaker language modeling ability of GPT-4  in these languages and the fact that text in lowresource languages tends to consume more tokens than English text, leading to the over-cropping of passages.Similar trends are observed with Chat-GPT, which is on par with the supervised system in most languages, and consistently trails behind GPT-4 in all languages.

Results on NovelEval
Table 3 illustrates the evaluation results on our newly collected NovelEval, a test set containing 21 novel questions and 420 passages that GPT-4 had not learned.The results show that GPT-4 performs well on these questions, significantly outperforming the previous best-supervised method, monoT5 (3B).Additionally, ChatGPT achieves a performance level comparable to that of monoBERT.This outcome implies that LLMs possess the capability to effectively re-rank unfamiliar information.Table 5: Ablation study on TREC-DL19.We use gpt-3.5-turbowith permutation generation with different configuration.
The results are listed in Table 4. From the results, we can see that: (i) The proposed PG method outperforms both QG and RG methods in instructing LLMs to re-rank passages.We suggest two explanations: First, from the result that PG has significantly higher top-1 accuracy compared to other methods, we infer that LLMs can explicitly compare multiple passages with PG, allowing subtle differences between passages to be discerned.Second, LLMs gain a more comprehensive understanding of the query and passages by reading multiple passages with potentially complementary information, thus improving the model's ranking ability.(ii) With PG, ChatGPT performs comparably to GPT-4 on nDCG@1, but lags behind it on nDCG@10.The Davinci model (text-davinci-003) performs poorly compared to ChatGPT and GPT-4.This may be because of the generation stability of Davinci and ChatGPT trails that of GPT-4.We delve into the stability analysis of Davinci, ChatGPT, and GPT-4 in Appendix F.

Ablation Study on TREC
We conducted an ablation study on TREC to gain insights into the detailed configuration of permutation generation.Table 5 illustrates the results.Initial Passage Order While our standard implementation utilizes the ranking result of BM25 as the initial order, we examined two alternative variants: random order (1) and reversed BM25 order (2).The results reveal that the model's performance is highly sensitive to the initial passage order.This could be because BM25 provides a relatively good starting passage order, enabling satisfactory results with only a single sliding window re-ranking.
Number of Re-Ranking Furthermore, we studied the influence of the number of sliding window passes.Models (3-4) in Table 5 show that reranking more times may improve nDCG@10, but it somehow hurts the ranking performance on top passages (e.g., nDCG@1 decreased by 3.88).Reranking the top 30 passages using GPT-4 showed notable accuracy improvements (see the model ( 5)).This provides an alternative method to combine ChatGPT and GPT-4 in passage re-ranking to reduce the high cost of using the GPT-4 model.

Results of LLMs beyond ChatGPT
We further test the capabilities of other LLMs beyond the OpenAI series on TREC DL-19.As shown in Table 6, we evaluate the top-20 BM25 passage re-ranking nDCG of proprietary LLMs from OpenAI, Cohere, Antropic, and Google, and three open-source LLMs.We see that: (i) Among the proprietary LLMs, GPT-4 exhibited the highest reranking performance.Cohere Re-rank also showed promising results; however, it should be noted that it is a supervised specialized model.In contrast, the proprietary models from Antropic and Google fell behind ChatGPT in terms of re-ranking effectiveness.(ii) As for the open-source LLMs, we observed a significant performance gap compared to ChatGPT.One possible reason for this discrepancy could be the complexity involved in generating permutations of 20 passages, which seems to pose a challenge for the existing open-source models.
We analyze the model's unexpected behavior in Appendix F, and the cost of API in Appendix H.

Experimental Results of Specialization
As mentioned in Section 4, we randomly sample 10K queries from the MSMARCO training set and employ the proposed permutation distillation to distill ChatGPT's predicted permutation into specialized re-ranking models.The specialized re-ranking models could be DeBERTa-v3-Large with a crossencoder architecture or LLaMA-7B with relevance  generation instructions.We also implemented the specialized model trained using the original MS MARCO labels (aka supervised learning) for com-parison4 .

Results on Benchmarks
Table 7 lists the results of specialized models, and Table 13 includes the detailed results.Our findings can be summarized as follows: (i) Permutation distillation outperforms the supervised counterpart on both TREC and BEIR datasets, potentially because ChatGPT's relevance judgments are more comprehensive than MS MARCO labels (Arabzadeh et al., 2021).(ii) The specialized DeBERTa model outperforms previous state-of-the-art (SOTA) baselines, monoT5 (3B), on BEIR with an average nDCG of 53.03.This result highlights the potential of distilling LLMs for IR since it is significantly more cost-efficient.(iii) The distilled specialized model also surpasses ChatGPT, its teacher model, on both datasets.This is probably because the re-ranking stability of specialized models is better than ChatGPT.As shown in the stability analysis in Appendix F, ChatGPT is very unstable in generating permutations.

Analysis on Model Size and Data Size
In Figure 4, we present the re-ranking performance of specialized DeBERTa models obtained through permutation distillation and supervised learning of different model sizes (ranging from 70M to 435M) and training data sizes (ranging from 500 to 10K).
Our findings indicate that the permutation-distilled models consistently outperform their supervised counterparts across all settings, particularly on the BEIR datasets.Notably, even with only 1K training queries, the permutation-distilled DeBERTa model achieves superior performance compared to the previous state-of-the-art monoT5 (3B) model on BEIR.We also observe that increasing the number of model parameters yields a greater improvement in the ranking results than increasing the training data.Finally, we find that the performance of supervised models is unstable for different model sizes and data sizes.This may be due to the presence of noise in the MS MARCO labels, which leads to overfitting problems (Arabzadeh et al., 2021).

Conclusion
In this paper, we conduct a comprehensive study on passage re-ranking with LLMs.We introduce a novel permutation generation approach to fully explore the power of LLMs.Our experiments on three benchmarks have demonstrated the capability of ChatGPT and GPT-4 in passage re-ranking.To further validate LLMs on unfamiliar knowledge, we introduce a new test set called NovelEval.Additionally, we propose a permutation distillation method, which demonstrates superior effectiveness and efficiency compared to existing supervised approaches.

Limitations
The limitations of this work include the main analysis for OpenAI ChatGPT and GPT-4, which are proprietary models that are not open-source.
Although we also tested on open-source models such as FLAN-T5, ChatGLM-6B, and Vicuna-13B, the results still differ significantly from ChatGPT.How to further exploit the open-source models is a question worth exploring.Additionally, this study solely focuses on examining LLMs in the re-ranking task.Consequently, the upper bound of the ranking effect is contingent upon the recall of the initial passage retrieval.Our findings also indicate that the re-ranking effect of LLMs is highly sensitive to the initial order of passages, which is usually determined by the first-stage retrieval, such as BM25.Therefore, there is a need for further exploration into effectively utilizing LLMs to enhance the first-stage retrieval and improve the robustness of LLMs in relation to the initial passage retrieval.

Ethics Statement
We acknowledge the importance of the ACM Code of Ethics and totally agree with it.We ensure that this work is compatible with the provided code, in terms of publicly accessed datasets and models.Risks and harms of large language models include the generation of harmful, offensive, or biased content.These models are often prone to generating incorrect information, sometimes referred to as hallucinations.We do not expect the studied model to be an exception in this regard.The LLMs used in this paper were shown to suffer from bias, hallucination, and other problems.Therefore, we are not recommending the use of LLMs for ranking tasks with social implications, such as ranking job candidates or ranking products, because LLMs may exhibit racial bias, geographical bias, gender bias, etc., in the ranking results.In addition, the use of LLMs in critical decision-making sessions may pose unspecified risks.Finally, the distilled models are licensed under the terms of OpenAI because they use ChatGPT.The distilled LLaMA models are further licensed under the non-commercial license of LLaMA.

A Instructions
A.1 Query Generation Instruction The query generation instruction (Sachan et al., 2022) uses the log-probability of the query.
Given a passage and a query, predict whether the passage includes an answer to the query by producing either 'Yes' or 'No'.
Passage: Its 25 drops per ml, you guys are all wrong.If it is water, the standard was changed 15 -20 years ago to make 20 drops = 1mL.The viscosity of most things is temperature dependent, so this would be at room temperature.Hope this helps.Query: how many eye drops per ml Does the passage answer the query?Answer: Yes Passage: RE: How many eyedrops are there in a 10 ml bottle of Cosopt?My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day.In the past other pharmacies have given me 3 10-ml bottles for 100 days.E: How many eyedrops are there in a 10 ml bottle of Cosopt?My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day.This instruction is used to train LLaMA-7B specialized models.
Given a passage and a query, predict whether the passage includes an answer to the query by producing either 'Yes' or 'No'.
This is RankGPT, an intelligent assistant that can rank passages based on their relevancy to the query.
The following are {{num}} passages, each indicated by number identifier [].I can rank them based on their relevance to query: {{query}} [1] {{passage_1}} [2] {{passage_2}} (more passages) ... The search query is: {{query}} I will rank the {{num}} passages above based on their relevance to the search query.The passages will be listed in descending order using identifiers, and the most relevant passages should be listed first, and the output format should be [] > [] > etc, e.g., [1] > [2] > etc.

system:
You are RankGPT, an intelligent assistant that can rank passages based on their relevancy to the query. user:

B Instructional Methods on LLMs as Rernaker
This paper focus on re-ranking task, given M passages for a query q, the re-ranking aims to use an agent f (•) to output their ranking results R = (r 1 , ..., r M ), where r i ∈ [1, 2, ..., M ] denotes the rank of p i .This paper studies using the LLMs as f (•).

B.1 Instructional Query Generation
Query generation has been studied in Sachan et al. (2022); Muennighoff (2022), in which the relevance between a query and a passage is measured by the log-probability of the model to generate the query based on the passage.Figure 2 (a) shows an example of instructional query generation.Formally, given query q and a passage p i , their relevance score s i is calculated as: where |q| denotes the number of tokens in q, q t denotes the t-th token of q, and I query denotes the instructions, referring to Figure 2 (a).The passages are then ranked based on relevance score s i .

B.2 Instructional Relevance Generation
Relevance generation is employed in HELM (Liang et al., 2022).Figure 2 (b) shows an example of instructional relevance generation, in which LLMs are instructed to output "Yes" if the query and passage gpt-4-0314 and gpt-4-0613) we tested achieved 0% question-answering accuracy on the obtained test set.
We searched for 20 candidate passages for each question using Google search.These passages were manually labeled for relevance by a group of annotators, including the authors and their highly educated colleagues.To ensure consistency, the annotation process was repeated twice.Each passage was assigned a relevance score: 0 for not relevant, 1 for partially relevant, and 2 for relevant.When evaluating the latest LLMs, we found that all non-retrieval-augmented models tested achieved 0% accuracy in answering the questions on the test set.This test set provides a reasonable evaluation of the latest LLMs at the moment.Since LLMs may be continuously trained on new data, the proposed test set should be continuously updated to counteract the contamination of the test set by LLMs. Figure 5 illustrates the detailed model architecture of BERT-like and GPT-like specialized models.

F Model Behavior Analysis
In the permutation generation method, the ranking of passages is determined by the list of model-output passage identifiers.However, we have observed that the models do not always produce the desired output, as evidenced by occasional duplicates or missing identifiers in the generated text.In Table 10, we present quantitative results of unexpected model behavior observed during experiments with the GPT models.
Repetition.The repetition metric measures the occurrence of duplicate passage identifiers generated by the model.The results indicate that ChatGPT produced 14 duplicate passage identifiers during re-ranking 97 queries on two TREC datasets, whereas text-davinci-003 and GPT-4 did not exhibit any duplicates.
Missing.We conducted a count of the number of times the model failed to include all passages in the re-ranked permutation output9 .Our findings revealed that text-davinci-003 has the highest number of missing passages, totaling 280 instances.ChatGPT also misses a considerable number of passages, occurring 153 times.On the other hand, GPT-4 demonstrates greater stability, with only one missing passage in total.These results suggest that GPT-4 has higher reliability in generating permutations, which is critical for effective ranking.Rejection.We have observed instances where the model refuses to re-rank passages, as evidenced by responses such as "None of the provided passages is directly relevant to the query ...".To quantify this behavior, we count the number of times this occurred and find that GPT-4 rejects ranking the most frequently, followed by ChatGPT, while the Davinci model never refused to rank.This finding suggests that chat LLMs tend to be more adaptable compared to completion LLMs, and may exhibit more subjective responses.Note that we do not explicitly prohibit the models from rejecting ranking in the instructions, as we find that it does not significantly impact the overall ranking performance.RBO.The sliding windows strategy involves re-ranking the top-ranked passages from the previous window in the next window.The models are expected to produce consistent rankings in two windows for the same group of passages.To measure the consistency of the model's rankings, we use RBO (rank biased overlap10 ), which calculates the similarity between the two ranking results.The findings turn out that ChatGPT and GPT-4 are more consistent in ranking passages compared to the Davinci model.GPT-4 also slightly outperforms ChatGPT in terms of the RBO metric.

G Analysis on Hyperparameters of Sliding Window
To analyze the influence of parameters of the sliding window strategy, we adjust the window size and set the step size to half of the window size.The main motivation for this setup is to keep the expected

Figure 2 :
Figure 2: Three types of instructions for zero-shot passage re-ranking with LLMs.The gray and yellow blocks indicate the inputs and outputs of the model.(a) Query generation relies on the log probability of LLMs to generate the query based on the passage.(b) Relevance generation instructs LLMs to output relevance judgments.(c) Permutation generation generates a ranked list of a group of passages.See Appendix A for details.

Figure 3 :
Figure3: Illustration of re-ranking 8 passages using sliding windows with a window size of 4 and a step size of 2. The blue color represents the first two windows, while the yellow color represents the last window.The sliding windows are applied in back-to-first order, meaning that the first 2 passages in the previous window will participate in re-ranking the next window.

Figure 4 :
Figure 4: Scaling experiment.The dashed line indicates the baseline methods: GPT-4, monoT5, monoBERT, and ChatGPT.The solid green line and solid gray line indicate the specialized Deberta models obtained by the proposed permutation distillation and by supervised learning on MS MARCO, respectively.This figure compares the models' performance on TREC and BEIR across varying model sizes (70M to 435M) and training data sizes (500 to 10K).
We use DeBERTa-V3-base, which concatenates the query and passage with a [SEP] token and utilizes the representation of the [CLS] token.To generate candidate passages, we randomly sample 10k queries and use BM25 to retrieve 20 passages for each query.We then re-rank the candidate passages using the gpt-3.5-turboAPI with permutation generation instructions, at a cost of approximately $40.During training, we employ a batch size of 32 and utilize the AdamW optimizer with a constant learning rate of 5 × 10 −5 .The model is trained for two epochs.Additionally, we implement models using the original MS MARCO labels for comparison.The LLaMA-7B model is optimized with the AdamW optimizer, a constant learning rate of 5 × 10 −5 , and with mixed precision of bf16 and Deepspeed Zero3 strategy.All the experiments are conducted on 8 A100-40G GPUs.

Figure 5 :
Figure 5: Model architecture of BERT-like and GPT-like specialized models.

Table 1 :
Results (nDCG@10) on TREC and BEIR.Best performing unsupervised and overall system(s) are marked bold.All models except InPars and Promptagator++ re-rank the same BM25 top-100 passages.† On BEIR, we use gpt-4 to re-rank the top-30 passages re-ranked by gpt-3.5-turbo to reduce the cost of calling gpt-4 API.

Table 4 :
Compare different instruction and API endpoint.Best performing system(s) are marked bold.PG, QG, RG denote permutation generation, query generation and relevance generation, respectively.
(Liang et al., 2022))l., 2022)and relevance generation (RG)(Liang et al., 2022)on TREC-DL19.An example of the three types of instructions is in Figure2, and the detailed implementation is in Appendix B. We also compare four LLMs provided will provide you with {{num}} passages, each indicated by number identifier [].Rank them based on their relevance to query: {{query}}.

Table 10 :
Analysis of model stability on TREC.Repetition refers to the number of times the model generates duplicate passage identifiers.Missing refers to the number of missing passage identifiers in model output.Rejection refers to the number of times the model rejects to perform the ranking.RBO, i.e., rank biased overlap, refers to the consistency of the model in ranking the same group of passages twice.