Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Re-rankers, which order retrieved documents with respect to the relevance score on the given query, have gained attention for the information retrieval (IR) task. Rather than fine-tuning the pre-trained language model (PLM), the large-scale language model (LLM) is utilized as a zero-shot re-ranker with excellent results. While LLM is highly dependent on the prompts, the impact and the optimization of the prompts for the zero-shot re-ranker are not explored yet. Along with highlighting the impact of optimization on the zero-shot re-ranker, we propose a novel discrete prompt optimization method, Constrained Prompt generation (Co-Prompt), with the metric estimating the optimum for re-ranking. Co-Prompt guides the generated texts from PLM toward optimal prompts based on the metric without parameter update. The experimental results demonstrate that Co-Prompt leads to outstanding re-ranking performance against the baselines. Also, Co-Prompt generates more interpretable prompts for humans against other prompt optimization methods.


Introduction
Information retrieval (IR) is the task of searching for documents relevant to a given query from a large corpus.As re-ranking the fetched documents from the retriever effectively enhances the performance and the latency, recent studies have suggested several kinds of re-rankers by fine-tuning pre-trained language models (PLM) (Nogueira and Cho, 2019;Nogueira et al., 2020).Furthermore, Sachan et al. (2022) show that large-scale language models (LLMs) such as GPT-3 (Brown et al., 2020) can be exploited as a zero-shot reranker with the prompt describing the task.They also highlight the importance of an appropriate prompt to elicit the full performance of LLMs, rather than updating the parameters.They choose an optimal prompt among the handcrafted candidates by cross-validation.However, such a manual search for the discrete prompts is highly expensive and sub-optimal in transferability.
To resolve the issue, several methods are proposed for automatically optimizing the discrete prompt.They focus on text classification or maskfilling task while underestimating the open-ended generation (Shin et al., 2020;Gao et al., 2021;Prasad et al., 2022).Recently, Deng et al. (2022) address the discrete prompt optimization applicable to generation tasks with reinforcement learning by designing the reward function, which measures the generated text belonging to a discrete label.Since there are tasks that are still not aligned, requiring a continuous score of output, we aim at a prompt optimization for one of such tasks: re-ranking.
In this paper, we propose Constrained Prompt generation, Co-Prompt, as left-to-right discrete prompt optimization without additional model training.By defining the metric of prompt optimum for re-ranking, we interpret the searching process of the optimal prompt as constrained generation with two modules: a zero-shot re-ranker as a discriminator and any decoder-only PLM as a generator.The discriminator calculates the likelihood (i.e., metric) that the prompt sequence is optimal for guiding an LLM to distinguish relevant documents among the large set for a given query.The generator samples the prompt tokens having a high prior from the previous prompt sequences for effectively restricting the prompt candidates for discriminator to evaluate.An overview of Co-Prompt is shown in Figure 1.
We validate our method, Co-Prompt, against other optimization baselines on two LLMs, T0 (Sanh et al., 2022) and OPT (Zhang et al., 2022), with two benchmark datasets, MS-MARCO (Nguyen et al., 2016) and Natural Question (Kwiatkowski et al., 2019).Experimental results show that Co-Prompt consistently generates well-performing prompts regardless of LLMs and datasets over the baselines.The qualitative analyses also support the interpretability of the prompts generated by Co-Prompt, similar to human language patterns.
Our contributions in this work are threefold: • We highlight the impact of optimal prompt on a zero-shot re-ranker by exploiting the optimization methods.• We propose Co-Prompt, a novel discrete prompt optimization via constrained generation for a zero-shot re-ranker.• We experimentally show that Co-Prompt consistently guides the re-ranker well against the baselines and its output is similar to human language patterns.

Related Work Document Ranking with Generative Model
Using the generative model is one of the dominant methods for ranking the retrieved documents by defining the relevance score as the query likelihood score (Nogueira dos Santos et al., 2020;Ju et al., 2021).More recently, Sachan et al. (2022Sachan et al. ( , 2023) ) showed that the LLM serves as either a zero-shot re-ranker or a training module of an unsupervised dense retriever.However, unlike ours, they require carefully designed manual prompts, which may have a limitation in transferability.
Prompt Optimization As prompting is considered a key variable when exploiting LLMs for various NLP tasks, finding the optimal prompt has become important to get the best performance out of the LLMs (Kojima et al., 2022;Xie et al., 2022).
Recently, the prompt optimization work has focused on discrete prompt search (Shin et al., 2020;Gao et al., 2021;Deng et al., 2022) or soft prompt learning over a continuous space (Liu et al., 2021;Qin and Eisner, 2021;Lester et al., 2021).While the existing optimization methods mainly consider text classification or mask-filling task, their applicability to re-ranking is yet underexplored.In this paper, we target at optimizing discrete prompts for zero-shot re-ranker to get higher relevance scores for more relevant pairs via constrained generation.
Constrained Generation Constrained generation aims at deriving the text sequences that follow a certain constraint (Keskar et al., 2019).Utilizing a discriminator for guiding the generation toward the constraint via the Bayes' rule is one of the widely used constraint generation methods (Dathathri et al., 2020;Krause et al., 2021;Chaffin et al., 2022).Inspired by the effectiveness of the discriminator-based method, we adopt the zero-shot re-ranker as a discriminator when generating optimal discrete prompt sequences.

Preliminaries
An LLM re-ranks the retrieved document d concerning the relevance score with a given query q as the query generation score: where |q| denotes the token length of the query q and ρ is a natural language prompt guiding an LLM to generate the query q.Since the prompt ρ is the only controllable variable in Equation 1, searching for an optimal prompt is a simple yet effective way to enhance the performance of LLMs.Thus, in this work, we focus on a prompt optimization strategy.

Constrained Prompt Generation
We define the optimal prompt ρ * for the re-ranker which maximizes the query generation scores: where D is the dataset for the retriever, consisting of pairs of a query and its relevant document.We solve the task of searching the optimal prompt ρ * for the document-query pair dataset D with discriminator-based constrained generation.The generation is guided by the Bayes' rule: P (ρt|D, ρ1:t−1) ∝ PM D (Ds|ρ1:t)PM G (ρt|ρ1:t−1), (3) where M D is a zero-shot re-ranker serving as a discriminator, M G is a decoder-only PLM as a generator, and D s is a dataset sampled from D.
Discriminator The discriminator M D measures how effectively the prompt sequence ρ 1:t guides the zero-shot re-ranker to generate the query from the given document by computing the likelihood P M D (D s |ρ), defined as the expectation of relevance score between document-query pairs (q i , d i ) of the sampled dataset D s with the prompt ρ: We use this likelihood as the metric for prompt optimum.The other option of Generator The generator M G samples the pool of prompts to be evaluated by a discriminator since computing Equation 3 of all possible tokens in the vocabulary requires a prohibitively high computational cost.The decoder-only PLM is exploited to sample prompt tokens ρ t having a high prior P M G (ρ t |ρ 1:t−1 ) in a zero-shot manner.
We combine these modules to optimize the prompt by iteratively performing two steps: candidate generation and evaluation.We choose to use a beam search as a decoding strategy for left-toright prompt generation.The detailed steps of the decoding strategy are shown in Algorithm 1.

Experimental Setups
We describe the experimental setups for validating the performance of the prompts.Our code is publicly available at github.com/zomss/Co-Prompt.
Datasets We employ two information retrieval datasets: 1) MS-MARCO (Nguyen et al., 2016), collected from the Bing search logs, and 2) Natural Question (NQ, Kwiatkowski et  fetched from Google search engines.We only use the document data of the dataset for evaluation.More information is shown in Appendix A.1.

Evaluation Metrics
We evaluate the results by two metrics, ACC and nDCG. 1) ACC is the percentage of the relevant documents in the total retrieved ones.2) nDCG, normalized discounted cumulative gain, reflects that the more relevant documents should record higher ranks.

Prompt Baselines
We compare Co-Prompt against four baselines: 1) Null Prompt is an empty prompt without any token.2) P-Tuning is a soft prompt optimization method that yields prompt embeddings from the prompt encoder (Liu et al., 2021).3) RL-Prompt is a discrete prompt optimization method by training policy network (Deng et al., 2022).Note that we modify RL-Prompt and P-Tuning applicable to the re-ranking task. is the first question asked on Google for" 31.9 "Please post your question again when its not just about" 30.6 "Score!What are all 3 things, the first is" 30.2 "Score the top 5 things on this sub reddit for" 29.3 "This looks like the same as every "what are the" 30.5 "This post should be titled as" 31.2 "What are some common questions asked on the internet about" 30.3 "How do i find the name on google, and" 29.1  Implementation Details The discriminator M D is the same model as the zero-shot re-ranker.Since the generator M G should be a decoder-only model, in the case of T0, GPT2-Large (Radford et al., 2019) is utilized as the generator.OPT, a decoderonly model, is used as both the discriminator and the generator.We use the start token as "Please" for a direct comparison with the manual prompt and fix the beam width B as 10 and the maximum prompt length L as 10 in our experiment.
Environment We conduct all experiments including prompt searching and document re-ranking on V100 32GB GPUs.We use BEIR (Thakur et al., 2021) framework 1 for re-ranked result evaluation and passage retrieval datasets.Also, the retrievers, BM25 and DPR, are from the same framework.We employ T0 and OPT with 3B and 2.7B parameters each for the discriminator and the re-ranker publicly open on the Huggingface model hub 2 (Wolf et al., 2020).

Result
In this section, we show the overall results of our method, Co-Prompt, with a detailed analysis.

Impact of Start Tokens
We exploit other options of start token such as "Score" and "This" as shown in Table 2. Regardless of the start tokens, Co-Prompt consistently generates prompts eliciting the performance of LLM efficiently.However, we observe that finding the optimal start token for the dataset is important to achieve better results.

Impact of Generator
As shown in Table 3, even if different generators are used, the generated prompts by different generators guide the zero-shot re-ranker efficiently.Still, the differences in performance are caused by a vocabulary mismatch between the two modules.We see that, although our method does not vary significantly in performance to the generator, a more suitable generator may be necessary for better results.
Relevance Score We analyze the distributions of relevance scores between positive or negative document-query pairs.As the negative documents for a given query are retrieved from BM25, the negative ones are related to the query but unable to directly find the answer.As shown in Figure 2, we point out that the distribution difference exists between pairs despite some overlap.Also, an LLM can distinguish which pair is positive, even without a prompt.However, we observe that the effect of discrete prompt optimization on the zero-shot reranker is in the direction of increasing the mean and variance of the relevance score.
Case Study of Prompts Table 2 shows the discrete prompts generated by our method and discrete prompt baselines when exploiting OPT as a reranker.While the prompts from the RL-prompt are ungrammatical gibberish close to a random word sequence, our method, Co-Prompt, generates interpretable prompts for humans, following human language patterns, and surpasses the performance of the other discrete prompts.Also, the word 'question', one of the keywords describing the task, is included in the prompts from Co-Prompt regardless of the datasets.This implies that the prompts from our method can provide a natural user interface to improve human understanding of how LLMs work.See Appendix B.3 for more examples of Co-Prompt.

Conclusion
In this paper, we propose Co-Prompt, left-to-right prompt optimization for zero-shot re-ranker via constrained generation.Co-Prompt effectively restricts prompt candidates and evaluates the optimum of these prompts without any parameter updates.We experimentally show that our method achieves consistently outperforming performance across all experiments.Also, the impact of prompt optimization including baselines on the zero-shot re-ranker highlights its importance.We also present an interesting outcome in that the optimal prompt is interpretable for human.For future work, we plan to expand our method to other open-ended generation tasks using LLMs.

Limitations
As shown in Table 1, our method is experimentally demonstrated to be effective for two LLMs.However, OPT, a decoder-only model, is more suitable for the prompts generated by Co-Prompt.This seems to be because T0, the encoder-decoder model, requires a separate generator such as GPT-2.The performance of prompts may vary to the generator involved in the vocabulary and training process.Also, there is a trade-off between search time and performance.While increasing the beam size and the number of document-query pairs enhances the probability of finding a more optimal prompt, it makes the search time proportionally longer.

Ethics Statement
Our work contributes to enhancing the retrieval performance of a zero-shot re-ranker by optimizing the discrete prompt via constrained generation.We are keenly aware of the possibility of offensive or upsetting prompts caused by bias of the generator itself even though there were no such prompts in our experiments.Because there is no additional training for prompt optimization, our method has difficulty removing the bias of the language model itself.As studies on reducing the bias of language models or filtering out inappropriate expressions in texts are being actively conducted, these problems are expected to be sufficiently resolved in the future.

A.1 Datasets
We employ two information retrieval datasets for evaluating the performance of the zero-shot reranker with the prompts.1) MS-MARCO (Nguyen et al., 2016) (Karpukhin et al., 2020).Both datasets are the benchmarks for evaluating information retriever systems (Thakur et al., 2021).Only 1,500 document-query pairs from MS-MARCO test split and NQ development split each are utilized for the prompt optimization.

A.2 Metrics
As mentioned in Section 4, we employ two metrics, 1) ACC and 2) nDCG.In addition, we use one more metric.3) MAP is the mean average precision of the relevant documents' ranks for a given query.

A.3 Retrievers
We use two types of retrievers, sparse and dense retrievers, for retrieving documents re-ranked by LLMs. 1) BM25 (Robertson and Zaragoza, 2009) is a representative sparse retriever computing the relevance score between a document and a query based on term frequency and inverse document frequency.BM25 has been widely employed because of its fast speed and effective performance.
2) DPR (Karpukhin et al., 2020) interprets training dense retrieval as metric learning problems.The biencoder initialized with BERT (Devlin et al., 2019) is trained with contrastive learning exploiting positive and negative passages for a given query.It shows outperforming results over traditional sparse retrievers.

A.4 Zero-shot Re-rankers
We employ two LLMs, T0 and OPT, as re-rankers with the prompt.1) T0, one of the T5 series (Raffel et al., 2020), consists of transformer encoder-decoder layers.The models are fine-tuned versions of T5 for multi-task learning with prompted datasets.2) OPT, a publicly open model, consists of decoder-only transformer layers.Its performance is comparable to those of GPT-3 models.We exploit OPT instead of GPT-3 due to academic budget.
The template is needed when trasmitting a document, a prompt and a query to zero-shot reranker together.Following the template setting of UPR, the template used in the experiments is "Passage: {document} {delimiter} {prompt} {de-limiter} {query}".The delimiters used in the experiments are " " for T0 and "\n" for OPT.

A.5 Baselines
Manual Prompt Sachan et al. ( 2022) not only proposed unsupervised passage re-ranker exploiting LLMs but also carefully selected the optimal prompt among handcrafted candidates validated by the re-ranked result at BM25 passages of NQ development set.The manually optimized prompt "Please write a question based on this passage" effectively guides zero-shot re-rankers to generate the query corresponding to the document.Liu et al. (2021) proposed P-tuning3 , generating soft prompts (i.e., continuous prompt embeddings), not discrete ones.They employed the prompt encoder consisting of long-short term memory layers trained to return the optimal soft prompts for the task.While the method mainly focuses on the text classification task, we define the loss objective as query generation log-likelihood for application to re-ranking.The prompt encoder is trained with document-query pairs for 10 epochs to generate 10-length soft prompts.Deng et al. (2022) proposed discrete prompt generation, applicable to open-ended generation tasks, with reinforcement learning.They validated the method applicable to text style transfer, one of open-ended text generation techniques.In order to align to the re-ranking task, we define the reward for the policy network as query generation log-likelihood from the document and the prompt.Following the setting mentioned in RL-Prompt4 , the 5-token length prompt is created through 12,000 training steps with a policy network model.

B Analysis
B.1 Likelihood P M D (D s |ρ 1:t ) In this section, we call the likelihood proposed in Equation 4as the base metric.We consider the other option of likelihood P M D (D s |ρ 1:t ) in a contrastive manner and also show the compared result with base metric in Table 5.

Contrastive Measurement
The query generation score should be high for positive documentquery pairs D + s and low for negative pairs D − s .In a contrastive manner, the likelihood exploits the contrast between P base (D + s |ρ) and P base (D − s |ρ) as follows: As shown in Table 5, base metric gains a certain level of performance regardless of the dataset and LLM, whereas contrastive metric shows inferior performance over MS-MARCO.

B.2 Impact of Generator
We show more detailed results of the prompts from the different generators in table 4. While the generated prompts follow human language patterns, there are some differences in used words.

B.3 Detailed Results
We evaluate the performance of zero-shot re-ranker with various metrics at Top-20 and Top-100 documents, as shown in Table 6.Co-Prompt is ranked 1st or 2nd on every metric across all experiments.On the other hand, the manual prompt, optimized for NQ, records inferior performance over MS-MARCO.Also, other optimization methods, RL-Prompt and P-Tuning, fail to achieve the best record in all experiments.This shows that the optimal prompt for zero-shot re-ranker is made from our method, Co-Prompt.
In addition, when confirming qualitatively generated prompts, the outputs from Co-Prompt are similar to human language patterns compared to RL-Prompt.The keyword "question" is included in most of the prompts generated by Co-Prompt.Considering that other optimization methods produce dense prompt embedding or ungrammatical gibberish, Co-Prompt suggests a new direction in which a prompt can function as a natural user interface to understand a black-box model.

Figure 1 :
Figure 1: An overview of the constrained prompt generation process.

Figure 2 :
Figure 2: Distributions of relevance scores between document-query pairs.The positive pairs mean relevant ones and the negative pairs irrelevant.method are more appropriate to play the role of an instruction to guide LLMs against other prompt optimization methods.More detailed results of reranked performance with various metrics are shown in Appendix B.3.
Algorithm 1: Co-Prompt: a beam search-based prompt generation algorithm with a discriminator and a generator.Ds: document-query pairs, B: beam width, L: maximum prompt length, N : the number of final prompts, V: vocabulary set

Table 1 :
al. (2019)), ACC@k of the re-ranked result with the prompts when k is 20 and 100.The best scores are marked in bold, and the next ones are underlined.

Table 2 :
Comparison of different discrete prompts and evaluation on the top-20 documents retrieved by BM25.The best results of each re-ranker are marked in bold.

Table 3 :
Comparison between the prompts from the different generators.The best results are marked in bold.

Table 1 ,
Co-prompt consistently shows a robust performance gain in all scenarios, regardless of LLM, the dataset, and the retriever.Specifically, Co-Prompt, applied to OPT, achieves better results than the other methods.This indicates that the prompts generated by our 1 http://beir.ai/ 2 https://huggingface.co/models

Table 4 :
Comparison of the prompts from the different generators and evaluation on the document set retrieved from MS-MARCO by BM25.The best results of each metric are marked in bold.

Table 5 :
Comparison between two options of likelihood at the ACC-k accuracy.

Table 6 :
Detailed results of LLM re-ranker with different prompts.The performance is evaluated with the three metrics at top-20 and top-100 documents.