Topic-DPR: Topic-based Prompts for Dense Passage Retrieval

Prompt-based learning's efficacy across numerous natural language processing tasks has led to its integration into dense passage retrieval. Prior research has mainly focused on enhancing the semantic understanding of pre-trained language models by optimizing a single vector as a continuous prompt. This approach, however, leads to a semantic space collapse; identical semantic information seeps into all representations, causing their distributions to converge in a restricted region. This hinders differentiation between relevant and irrelevant passages during dense retrieval. To tackle this issue, we present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. Unlike the single prompt method, multiple topic-based prompts are established over a probabilistic simplex and optimized simultaneously through contrastive learning. This encourages representations to align with their topic distributions, improving space uniformity. Furthermore, we introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency. Experimental results from two datasets affirm that our method surpasses previous state-of-the-art retrieval techniques.


Introduction
Dense Passage Retrieval (DPR), due to its efficacy and efficiency, has gained significant attention recently (Karpukhin et al., 2020;Lee et al., 2020).DPR encapsulates semantic information of queries and passages within a low-dimensional embedding space and measures relevance using cosine distance.
Prompt-based learning is an effective emerging technique for multiple natural language processing tasks (Liu et al., 2021a;Lester et al., 2021;Wang et al., 2022a).This technique uses a task-specific prompt as input to augment the performance of Pretrained Language Models (PLMs).Prompts, typically discrete text templates with task-specific information, need explicit definitions.To circumvent local optima during tuning, researchers (Li and Liang, 2021;Liu et al., 2021b) suggested deep prompt tuning that trains a single vector as a continuous prompt.This approach demonstrated effectiveness and flexibility in text-generation tasks.Inspired by deep prompt tuning, recent research (Tang et al., 2022;Tam et al., 2022) has integrated continuous prompts into retrieval tasks.By adding task-specific semantic information as input, these prompts improve PLMs' knowledge utilization and guide PLMs to produce more relevant text representations.Consequently, relevant passage representations are likely closer to the query, thus securing higher rankings.
However, past research has not fully addressed the limitations of a single prompt when dealing with the diverse semantics of a comprehensive dataset.The engagement of all passages with a singular prompt induces a uniform semantic shift.
Imagine using a single prompt like "Explain the main concept of [QUERY]," where [QUERY] is the actual query.While this prompt may effectively steer the Pre-trained Language Model (PLM) to generate representations capturing aspects of each discipline, it may overlook the diverse semantics and terminologies across fields.Consequently, it could have difficulty differentiating articles from distinct disciplines.A single prompt might fall short of capturing the nuances within each discipline, like subtopics or specialized areas.Employing one prompt for all disciplines may result in a uniform semantic shift and a convergence of passage representations in a restricted region of the embedding space, as depicted in Figure 1.This semantic space collapse (Li et al., 2020;Gao et al., 2019;Xiao et al., 2023) can blur the distinction between relevant and irrelevant passages, potentially masking irrelevant passages amidst relevant ones.Therefore, prompt generation is pivotal for this semantically nuanced task.Further analysis is conducted in Section 5.
In this paper, we explore the use of multiple continuous prompts to address the anisotropic issue of deep prompt tuning in dense passage retrieval, a persisting challenge.Challenge 1: The effective generation of multiple continuous prompts based on the corpus.A simple approach is partitioning the dataset into subsets via topic modeling, each sharing a common topic-based prompt.This strategy allows the distribution of latent topics across a probabilistic simplex and unsupervised extraction of semantics (Blei et al., 2003;Li et al., 2022b), enabling the definition and initialization of distinct, interpretable topic-based prompts.Challenge 2: The integration of topic-based prompts into the Pre-trained Language Models (PLMs).Although our topic-based prompts are defined on a probabilistic simplex using topic modeling, ensuring topical independence, constructing such a simplex and learning topical knowledge within the PLMs' embedding space presents a challenge due to inherent model differences.As a result, we make the topic-based prompts trainable and adopt contrastive learning (Chen et al., 2020;Gao et al., 2021) for optimizing topical relationships.
To tackle these challenges, we introduce a novel framework, Topic-DPR, that efficiently incorporates topic-based prompts into dense passage retrieval.Instead of artificially inflating the number of prompts, we aim to define a prompt set reflecting the dataset's diverse semantics through a data-driven approach.We propose a unique prompt generation method that utilizes topic modeling to establish the number and initial values of the prompts, which we term topic-based prompts.These prompts are defined within a probabilistic simplex space (Patterson and Teh, 2013), initialized using a topic model such as hierarchical Latent Dirichlet Allocation (hLDA) (Griffiths et al., 2003).Moreover, we propose a loss function based on contrastive learning to preserve the topic-topic relationships of these prompts and align their topic distributions within the simplex.The impact of topic-based prompts serves as a pre-guidance for the PLMs, directing representations towards diverse sub-topic spaces.For dense retrieval, we consider query similarities and design a tailored loss function to capture query-query relationships.We use contrastive learning to maintain query-passage relationships, maximize the similarity between queries and relevant passages, and minimize the similarity between irrelevant pairs.Considering the semi-structured nature of the datasets, we also introduce an in-batch sampling strategy based on multi-category information, providing high-quality positive and negative samples for each query during fine-tuning.
The efficacy of our methods is confirmed through comprehensive experiments, emphasizing the role of topic-based prompts within the Topic-DPR framework.The key contributions are: 1. We propose an unsupervised method for continuous prompt generation using topic modeling, integrating trainable parameters for PLMs adaptation.
2. We introduce Topic-Topic Relation, a novel prompt optimization goal.It uses contrastive learning to maintain topical relationships, addressing the anisotropic issue in traditional deep prompt tuning.
3. Our framework supports the simultaneous use and fine-tuning of multiple prompts in PLMs, improving passage ranking by producing diverse semantic text representations.
2 Related Work

Dense Passage Retrieval
Recent advancements in PLMs such as BERT (Devlin et al., 2018), Roberta (Liu et al., 2019), and GPT (Brown et al., 2020) have enabled numerous unsupervised techniques to derive dense representations of queries and passages for retrieval.These approaches primarily use a Bi-Encoder structure to embed text in a low-dimensional space and learn similarity relations via contrastive learning, contrasting traditional sparse retrieval methods like BM25 or DeepCT (Robertson et al., 2009;Dai and Callan, 2019).DPR (Karpukhin et al., 2020) pioneered an unsupervised dense passage retrieval framework, affirming the feasibility of using dense representations for retrieval independently.This efficient and operational approach was further refined by subsequent studies (Xiong et al., 2020;Gao and Callan, 2021;Ren et al., 2021;Wang et al., 2022b) that focused on high-quality negative sample mining, additional passage relation analysis, and extra training.The essence of these methods is to represent texts in a target space where queries are closer to relevant and distant from irrelevant passages.

Prompt-based Learning
As PLMs, such as GPT-3 (Brown et al., 2020), continue to evolve, prompt-based learning (Gu et al., 2021;Lester et al., 2021;Qin and Eisner, 2021;Webson and Pavlick, 2021) has been introduced to enhance semantic representation and preserve pre-training knowledge.Hence, for various downstream tasks, an effective prompt is pivotal.Initially, discrete text templates were manually designed as prompts for specific tasks (Gao et al., 2020;Ponti et al., 2020;Brown et al., 2020), but this could lead to local-optimal issues due to the neural networks' continuous nature.Addressing this, Li and Liang (2021) and Liu et al. (2021b) highlighted the universal effectiveness of well-optimized prompt tuning across various model scales and natural language processing tasks.
Recent studies have adapted deep prompt tuning for downstream task representation learning.Prom-CSE (Jiang et al., 2022) uses continuous prompts for semantic textual similarity tasks, enhancing universal sentence representations and accommodating domain shifts.Tam et al. ( 2022) introduced parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings, with P-Tuning v2 (Liu et al., 2021b) exhibiting superior performance.DPTDR (Tang et al., 2022) incorporates deep prompt tuning into dense passage retrieval for open-domain datasets, achieving exceptional performance with minimal parameter tuning.
3 The proposed Topic-DPR

Consider a collection of M documents represented as
where each 3-tuple (T i , A i , C i ) denotes a document with a title T i , an abstract A i , and a set of multi-category information C i .The objective of dense passage retrieval is to find relevant passages A j for a given query T i , where their multi-category information sets intersect, denoted as C i ∩ C j .

Topic-based Prompts
The principal distinction between our Topic-DPR and other prompt-based dense retrieval methods lies in using multiple topic-based prompts to enhance embedding space uniformity and improve retrieval performance.The idea behind creating topic-based prompts is to assign each document a unique prompt that aligns with its semantic and topical diversity.We use semantics, defined by the topic distributions within the simplex space, to initialize the count and values of the topic-based prompts.
We use topic modeling to reveal concealed meanings by extracting topics from a corpus, as explained in Appendix.Topics are defined on a probabilistic simplex (Patterson and Teh, 2013), connecting documents and dictionary words via interpretable probabilistic distributions.We employ hierarchical Latent Dirichlet Allocation (hLDA) (Griffiths et al., 2003), a traditional topic modeling approach, to construct the topic-based prompts.hLDA provides a comprehensive representation of the document collection and captures the hierarchical structure among the topics, which is crucial for seizing the corpus's semantic information diversity.
As shown in Figure 2, hLDA defines hierarchical topic distributions of documents and distributions over words, enabling the generation of topic-based prompts from all hidden topics and corresponding topic words.hLDA creates a hierarchical K topic tree with h levels; each level comprises multiple nodes, each representing a specific topic.This hierarchical structure allows our method to adapt to varying levels of granularity in the topic space, yielding more targeted retrieval results.
Let ∆ K signify the probabilistic simplex.After uncovering K topics from the corpus, the topic distribution θ (i) of each document d i in ∆ K is defined The definition of the topic-based prompts.We model the document corpus using topic modeling to obtain topic distributions.Higher-level K s topics from the hierarchy are then inputted into the prompt encoder to construct our topic-based prompts.Notably, the parameters of the linear layers within the encoder will be optimized. as: where c k signifies the k-th topic's component of d i .
Our topic-based prompts aim to disperse document representations to alleviate semantic space collapse.For these prompts, maintaining significant semantic differences is vital.We only use higher-level K s topics, a subset of K topics, to form these prompts.These high-level K s topics are distinctly unique and suitable for defining prompts.Using hLDA, all documents are assigned to one or more topics from the subset in an unsupervised manner, enabling similar documents to share the same topics.This approach enables our method to capture the corpus's inherent topic structure and deliver more accurate and diverse retrieval results.
Each topic t k ∈ {t 1 , ..., t Ks } can be interpreted as a dictionary subset by the top L words with the highest probabilities in t k , defined as β (k) = {w 1 , ..., w L }.We utilize these top L words (topic words) to generate each Topic-based Prompt.We then propose a prompt encoder E Θ to embed the discrete word distribution β (k) , i.e., token ids, into a continuous vector V k = E Θ (β (k) ), assisting PLMs in avoiding local optima during prompt optimization.As shown in Figure 2, E Θ primarily comprises a residual network.The embedding layer preserves the topic words' semantic information, and the linear layer represents the trainable parameters Θ.The prompt encoder generates each vector V k ∈ {V 1 , ..., V Ks } as the representation of the topic-based prompt based on topic t k .During retrieval, a document d i is assigned to a topicbased prompt P (i) ∈ {V 1 , ..., V Ks } generated by the topic t (i) , where the document has the highest topic component.Documents with similar topic distributions share the same prompt.For the contrastive learning fine-tuning phase, the PLMs can clearly distinguish simple negative instances with different prompts and focus more on hard negatives with identical prompts.

Deep Prompt Tuning
To incorporate our topic-based prompts into the PLMs, we utilize the P-Tuning V2 (Liu et al., 2021b) methodology to initialize a trainable prefix matrix M , dimensions dim × (num * 2 * dim), where dim denotes the hidden size, corresponding to our topic-based prompts in Figure 2, num refers to the transformer layers count, and 2 represents a key vector K and a value vector V .These dimensions specifically support the attention mechanism (Vaswani et al., 2017), which operates on key-value pairs and needs to align with the transformer's hidden size and layer structure.
As illustrated in Figure 3 (middle), we encode the title T i and the assigned prompt P (i) as the query q i = Attention[M (P (i) ), P LM s(T i )], and the abstract along with its prompt as passage ), P LM s(T i )] can be formulated as: This calculates a weighted sum of input embeddings, i.e., M (P (i) ) and P LM s(T i ), M (P (i) ) and P LM s(A i ), based on their contextual relevance.Our approach uses the attention mechanism to Figure 3: The Topic-DPR framework comprises three main components.First, we associate the document titles or abstracts with topic-based prompts based on their topic distributions (left).Secondly, during the deep prompt tuning phase, the prefix matrix houses the parameters for these prompts, and a pre-trained language model serves as the encoder for titles or abstracts, with the attention mechanism facilitating inter-layer output interactions (middle).Lastly, we introduce three contrastive learning objectives to group relevant queries and passages on a simplex for efficient dense retrieval tasks (right).
amalgamate topic-based prompts with PLM embeddings, enabling focused attention on significant semantic aspects of the input.This facilitates dynamic adjustment of embeddings based on topicbased prompts, leading to more contextually pertinent representations.
Typically, we use the first token [CLS] of output vectors as the query or passage representation.Unlike the Prefix-tuning approach, Topic-DPR simultaneously employs and optimizes multiple prompts, enhancing the model's ability to capture the corpus's diverse semantics, thereby improving retrieval performance.

Contrastive Learning
Topic-DPR aims to learn the representations of queries and passages such that the similarity between relevant pairs exceeds that between irrelevant ones.Here, we classify the contrastive loss into three categories: query-query, query-passage, and topic-topic relations.
Query-Query Relation.The objective of learning the similarity in the query-query relation is to increase the distance between the negative query q − i and the query q, while enhancing the similarity between the positive query q + i and the query q i .Given a query q i with m positive queries q + i, z m z=1 and n negative queries q − i, j n j=1 , we optimize the loss function as the negative log-likelihood of the query: ρ(qi, q + i, z ) log e s(q i , q + i, z )/γ e s(q i , q + i, z )/γ + n j=1 e s(q, q − i, j )/γ where γ is the temperature hyperparameter, and s(•) denotes the cosine similarity function.We define ρ(•) as the correlation coefficient of the positive pairs, which is discussed in Section 3.3.3.Query-Passage Relation.Different from the Eq. 3, the query-passage similarity relation regards the query q i as the center and pushes the negative passages p − .Formally, we optimize the loss function as the negative log-likelihood of the positive passage: Since the objective of the dense passage retrieval task is to find the relevant passages with a query, we consider that the relation of query-passage similarity is critical for the Topic-DPR and the Eq.3 is an auxiliary to Eq.4.
Topic-Topic Relation.The motivation for optimizing multiple topic-based prompts lies in the fact that a set of diverse prompts can guide the representations of queries and passages toward the desired topic direction more effectively than a fixed prompt.However, with prompts distributed across K s topics, the margins between them are still challenging to distinguish using conventional fine-tuning methods.Consequently, we aim to enhance the diversity of these topic-based prompts in the embedding space through contrastive learning to better match their topic distributions.Given a batch of passages {A i } N i=1 , we encode them into the PLMs using K s topic-based prompts V and generate K s × N passages p1 1 , ..., p Ks 1 , ..., p Ks N , where We propose a loss function for each prompt V k designed to push the other prompts {V z } Ks−1 z =k away with the assistance of passages, as formulated below: In this function, M is the margin hyperparameter, signifying the similarity discrepancy among passages across various topics.It is premised on the belief that unique prompts can steer the same text's representation towards multiple topic spaces.Consequently, Pretrained Language Models (PLMs) can focus on relationships among instances with identical prompts and disregard distractions from unrelated samples with differing prompts.The additional prompts impose constraints on the PLMs, spreading pertinent instances over diverse topic spaces.This approach explains our exclusive use of higher-level topics from hLDA for topic-based prompt definition (Section 3.2).

In-batch Positives and Negatives
For dense retrieval, identifying positive and negative instances is vital for performing loss functions Eq.3 and Eq.4.In our approach, we feed a batch of N documents into the PLMs per iteration and sample positives and negatives from this batch.Importantly, we employ multi-category information from these documents to pinpoint relevant queries or passages, aligning with our problem's objective.Queries or passages sharing intersecting multicategory information are considered positive.The correlation coefficient of a positive pair q i , q + j can be expressed as: This parallels the positive pair in the query-passage relation.By default, all other queries or passages in the batch are deemed irrelevant.

Combined Loss Functions
In this section, we combine the three relations presented above to obtain the combined loss function for fine-tuning over each batch of {(q i , p i )} N i=1 , {V k } Ks k=1 : where α is a hyper-parameter to weight losses.

Experimental Settings
Datasets We evaluate Topic-DPR's retrieval performance through experiments on two scientific document datasets: the arXiv-Article (Clement et al., 2019) and USPTO-Patent datasets (Li et al., 2022a).
For dense passage retrieval, we extract titles, abstracts, and multi-category information from these semi-structured datasets, using titles as queries and relevant abstracts as optimal answers.Appendix details these datasets' statistics.
Evaluation Metrics Considering the realistic literature retrieval process and dataset passage count, we use Accuracy (Acc@1, 10), Mean Reciprocal Rank (MRR@100), and Mean Average Precision (MAP@10, 50) for performance assessment.Furthermore, we apply a representation analysis tool to gauge the alignment and uniformity of the PLMs embedding space (Wang and Isola, 2020).
Baselines For a baseline comparison, we employ the standard sparse retrieval method, BM25 (Robertson et al., 2009).The efficacy of our proposed techniques is assessed against DPR and DPTDR, adapted to our datasets.DPR, an advanced dual-encoder method, incorporates contrastive learning for dense passage retrieval, while DPTDR, a contemporary leading method using deep prompt tuning, employs a continuous prompt to boost PLMs' retrieval efficiency.Due to the absence of specific positive examples for each query, we apply the positive sampling approach across all techniques to guarantee fair comparisons.Implementation Details We initialize our Topic-DPR parameters using two uncased PLMs, BERTbase, and BERT-large, obtained from Huggingface.All experiments are executed on Sentence-Transformer with an NVIDIA Tesla A100 GPU.Appendix details all the hyper-parameters.

Experimental Results
Table 1 presents our experiments' outcomes using BERT-base and BERT-large models on the arXiv-Article and USPTO-Patent datasets.In comparison to sparse methods, dense retrieval techniques show significant performance improvement, emphasizing dense retrieval's importance.When contrasting DPTDR with DPR, DPTDR exhibits superior performance in the MAP@10 and MAP@50 metrics due to its continuous prompt enhancement.Our topic-based prompts in Topic-DPR boost the Acc and MRR metrics.Furthermore, Topic-DPR outperforms baseline methods across all metrics.Specifically, our Topic-DPRbase exceeds DPTDRbase by 3.00/2.42and 2.64/2.98 points in MAP@10 and MAP@50, which are vital for large multi-category passage retrieval.Additionally, in the deep prompt tuning setting, our Topic-DPR ♠ , despite slight performance degradation, still maintains comparative performance with only 0.1%-0.4% of the parameters tuned.The consistent enhancements across diverse settings, models, and metrics manifest the robustness and efficiency of our Topic-DPR method.This research establishes Topic-DPR as an effective deep prompt learning-based dense retrieval method, setting a new state-of-the-art for the datasets.Ablation experiments are conducted in Appendix.
5 Analysis on Topic-DPR

Quality of Topic Words
To assess whether the topic words extracted from hLDA are helpful for dense passage retrieval tasks, we conducted an experiment that directly used the topic words as prompts to train DPR with BERTbase.Each query was transformed into "[TOPIC WORDS...] + [QUERY]" and each passage was transformed into "[TOPIC WORDS...] + [PAS-SAGE]".As shown in Table 2, when noise is introduced, the model experiences more interference, resulting in decreased performance.This indicates that simply adding random words to the queries and passages is detrimental to the model's ability to discern relevant information.However, with the help of the topical information extracted from hLDA, the model performs better than the original DPR.This suggests that introducing high-quality topic words is beneficial for domain disentanglement, as it allows the model to better differentiate between various subject areas and focus on the relevant con-  text.By incorporating topic words as prompts, the model is able to generate more diverse and accurate semantic representations, ultimately improving its ability to retrieve relevant passages.

Representations with Topic-based Prompts
We visualize the representations with different types of labels and analyze the influence of topicbased prompts intuitively.As shown in Figure 4(a), the passages distributed in K a topics overlap in the vanilla BERT.As depicted in Figure 4(b), they are arranged in a pentagonal shape, resembling the probabilistic simplex of the topic model ∆ K .Here, each topic is independent and discrete, with clearly distinguishable margins between them.This outcome is primarily due to the topic-based prompts, which are initialized by the topic distributions and direct the representations to exhibit significant topical semantics.During the dense retrieval phase, the passages in sub-topic spaces are more finely differentiated using loss functions Eq.3 and Eq.4.As illustrated in Figure 4(d), passages belonging to the same category cluster together, enabling queries to identify relevant passages with greater accuracy.These observations suggest that our proposed topicbased prompts can encourage PLMs to generate more diverse semantic representations for dense retrieval.

Alignment and Uniformity
We experiment with 5,000 pairs of queries and passages from the USPTO-Patent dataset's development set to analyze the quality of representations in terms of three metrics, including alignment, uniformity, and cosine distance.Alignment and uniformity are two properties to measure the quality of representations (Wang and Isola, 2020).Specifi-Methods Align(q,p + ) Uniform(q)/(p) Sim(q,p + ) /(q,p cally, the alignment measures the expected distance between the representations of the relevant pairs (x, x + ): where the f (x) is the L2 normalization function.
And the uniformity measures the degree of uniformity of whole representations p data : Uniform(p data ) log E x, y E i.i.d.
p data e −2 f (x)−f (y) 2 , (9) As demonstrated in Table 3, the results of all dense retrieval methods surpass those of the vanilla BERT.Comparing DPR with DPTDR, the latter exhibits better alignment performance yet poorer uniformity, with representations from DPR displaying significant differences in average similarity between relevant and irrelevant pairs.This phenomenon highlights a shortcoming of deep prompt tuning in dense retrieval, where the influence of a single prompt can lead to anisotropic issues in the embedding space.Furthermore, this observation can explain why DPTDR underperforms DPR in certain metrics, as discussed in Section 4.2.Regarding the results of our Topic-DPR, the alignment of representations decreases to 0.38, marginally better than DPTDR.Importantly, our method substantially enhances uniformity, achieving the best scores and indicating a greater separation between relevant and irrelevant passages.Thus, our Topic-DPR effectively mitigates the anisotropic issue of the embedding space.

Conclusion
In this paper, we examine the limitations of using a single task-specific prompt for dense passage retrieval.To address the anisotropic issue in the embedding space, we introduce multiple novel topic-based prompts derived from the corpus's semantic diversity to improve the uniformity of the space.This approach proves highly effective in identifying and ranking relevant passages during retrieval.We posit that the strategy of generating data-specific continuous prompts may have broader applications in NLP, as these prompts encourage PLMs to represent more diverse semantics.

Limitations
Our method achieves promising performance to enhance the semantic diversity of representations for dense passage retrieval, but we believe that there are two limitations to be explored for future works: (1) The topic modeling based on the neural network may be joint trained with the dense passage retrieval task, where the topics extracted for our topic-based prompts can be determined to comply with the objective of retrieval automatically.(2) The possible metadata of documents, like authors, sentiments and conclusions, can be considered to assist the retrieval of documents further.

Figure 1 :
Figure 1: Anisotropic issue of deep prompt tuning with a single prompt.Topical information in prompts aids in distinguishing irrelevant passages in our Topic-DPR.
Figure2: The definition of the topic-based prompts.We model the document corpus using topic modeling to obtain topic distributions.Higher-level K s topics from the hierarchy are then inputted into the prompt encoder to construct our topic-based prompts.Notably, the parameters of the linear layers within the encoder will be optimized.

Figure 4 :
Figure 4: T-SNE visualization (Van der Maaten and Hinton, 2008) of the representations from the vanilla BERT-base and the Topic-DPR-base.Each point represents one passage under a specific label, and each color represents one class.The two types of labels we use are the five topics generated by the hLDA and the 412 categories in the USPTO-Patent dataset.

Table 2 :
Quality analysis of the topic words on the test set of the arXiv-Article dataset.DPR with topic words indicates that each example in the training data has the corresponding topic words added as prompts.DPR with random words indicates that each example in the training data has random topic words added as prompts.

Table 3 :
Quality analysis of the representations generated by different methods in BERT-base.The quality is better when all the above numbers are lower except Sim(q,p + ).