Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers

Prompt tuning attempts to update few task-specific parameters in pre-trained models. It has achieved comparable performance to fine-tuning of the full parameter set on both language understanding and generation tasks. In this work, we study the problem of prompt tuning for neural text retrievers. We introduce parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings. Through an extensive analysis, we show that the strategy can mitigate the two issues -- parameter-inefficiency and weak generalizability -- faced by fine-tuning based retrieval methods. Notably, it can significantly improve the out-of-domain zero-shot generalization of the retrieval models. By updating only 0.1% of the model parameters, the prompt tuning strategy can help retrieval models achieve better generalization performance than traditional methods in which all parameters are updated. Finally, to facilitate research on retrievers' cross-topic generalizability, we curate and release an academic retrieval dataset with 18K query-results pairs in 87 topics, making it the largest topic-specific one to date.


Introduction
Seeking for relevant texts has been a fundamental problem for a broad range of natural language processing (NLP) applications such as open-domain question answering (Chen et al., 2017), retrievalaugmented language modeling (Guu et al., 2020), and fact verification (Thorne et al., 2018).Its recent progress has been dominantly favored by the neural approaches (Karpukhin et al., 2020;Khattab and Zaharia, 2020), especially the large-scale pre-trained language models with ever-growing parameters.For example, a recent study attempts to leverage models up to 10 billion parameters (Ni

Confidence Calibration Query-length Robustness
Figure 1: For DPR (Karpukhin et al., 2020) trained on OpenQA datasets, PE learning (e.g., P-Tuning v2) offers parameter-efficiency and improved generalization thanks to better calibration and query-length robustness.
Meanwhile, an increasing number of studies have focused on the parameter-efficiency and generalizability challenges of neural methods.In terms of parameter-efficiency, the common practices (Karpukhin et al., 2020) rely on fine-tuning dual encoders for queries and documents separately and thus cause parameter redundancy (Geigle et al., 2022).Furthermore, fine-tuning the full parameters of a pre-trained retriever for multi-lingual (Litschko et al., 2022) or cross-topic settings can also result in parameter-inefficiency.Moreover, despite neural approaches' in-domain outperformance, it has been found that their cross-domain generalization cannot match the simple BM25 method (Thakur et al., 2021).Consequently, these issues pose challenges to develop cost-effective neural text retrievers.
Recently, parameter-efficient (PE) transfer learning, including prompt tuning (Li and Liang, 2021;Liu et al., 2021c;Lester et al., 2021), adapters (Houlsby et al., 2019), and hybrid methods (Hu et al., 2021;Zaken et al., 2022), is proved to achieve comparable performance to fine-tuning on language understanding and generation tasks by employing very few task-specific tuning parameters.Inspired by this progress, we propose to study whether and how PE learning can benefit neural text retrieval in terms of both parameter-efficiency and generalizability.
In this work, we systematically examine a line of mainstream PE methods in in-domain, crossdomain, and cross-topic settings.As expected, most PE approaches perform comparably to finetuning on in-domain retrieval.Excitingly, PE prompt tuning (Li and Liang, 2021;Liu et al., 2022) can also encourage neural text retrievers to generalize on the cross-domain benchmark BEIR (Thakur et al., 2021) and OAG-QA-a new multi-discipline academic cross-topic retrieval dataset we constructed.For example, by simply replacing finetuning to the parameter-efficient P-Tuning v2 (Liu et al., 2022), we achieve relative gains ranging from 3.5% to 105.0% on out-of-domain BEIR datasets.
Through empirical analyses, we attempt to provide an understanding of the better generalization brought by PE prompt tuning.First, PE prompt tuning can help empower the neural model with better confidence calibration, which refers to the theoretical principle that a model's predicted probabilities of labels should correspond to the ground-truth correctness likelihood (Guo et al., 2017).Second, it encourages better performance on queries with different lengths from in-domain training, demonstrating PE methods' generalization capacity to out-of-domain datasets.
To summarize, this work aims to advance the neural text retrievers from three aspects: • Problem: we propose to leverage PE learning for neural text retrievers with much fewer tuning parameters.We demonstrate that PE prompt tuning can not only perform comparably to fine-tuning in-domain but also enable neural retrievers to achieve significant generalization advantages over fine-tuning on crossdomain and cross-topic benchmarks.
• Understanding: we provide an understanding of PE learning's outperformance across domains and topics.Our analysis suggests that its generalization advantage largely comes from its confidence-calibrated prediction and query-length robustness.
• Dataset: we construct OAG-QA, an academic paper retrieval dataset curated from real-world questions and expert answers, to test retrievers' cross-topic generalizability.With 22 disciplines and 87 topics, OAG-QA is the largest fine-grained topic retrieval dataset to date.

Related Work
Neural Text Retrieval.Text retrievers traditionally rely on sparse lexical-based inverted index to rank candidate documents containing query terms (e.g., TF-IDF and BM25).They benefit from the simplicity but often suffer from the lexical gap (Berger et al., 2000).Recently, neural text retrievers, including dense retrievers (Karpukhin et al., 2020;Xiong et al., 2021;Hofstätter et al., 2021), late-interaction models (Khattab and Zaharia, 2020;Santhanam et al., 2021), and hybrid or re-ranking models (Nogueira et al., 2019;Wang et al., 2020b), becomes popular as they can capture the semanticlevel query-document similarity thanks to the advance of pre-trained language models (Han et al., 2021).
Generalization in Text Retrieval.The weaker generalizability of neural retrievers compared to conventional lexical ones has recently arouse concerns in the community (Liu et al., 2021a,b;Chen et al., 2022), and it results in BEIR, a heterogeneous cross-domain generalization benchmark (Thakur et al., 2021).While recent works notice and employ ideas like bigger pre-trained models (Ni et al., 2021) or unsupervised pre-training on large corpus (Izacard et al., 2021) to improve scores on BEIR, few of them focus on studying better transferring strategies based on existing architectures and datasets for out-of-domain generalization.
Parameter-Efficient (PE) Learning.Sizes of pretrained language models are soaring up (Brown et al., 2020), causing great challenges to traditional task transfer based on full-parameter finetuning.A recent focus has been on the emerged PE transfer learning, including prompt tuning (Li and Liang, 2021;Liu et al., 2021c;Lester et al., 2021), adapters (Houlsby et al., 2019), and hybrid methods (Hu et al., 2021;Zaken et al., 2022).They employ very few tuning parameters to achieve finetuning comparable transfer performance.Despite abundant research made on problems like language understanding (Houlsby et al., 2019;Liu et al., 2022) and generation (Li and Liang, 2021), how it will impact retrieval remains under-explored.

Challenges in Neural Text Retrieval
The neural text retriever, which leverages pretrained language models, e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), as the backbone, has significantly mitigated the lexical gap (Berger et al., 2000) in text retrieval and become a standard component for many NLP applications (Chen et al., 2017;Guu et al., 2020;Petroni et al., 2021).It consists of several different categories and in this work we focus on the following two dominant ones.
• Dense Retriever (Karpukhin et al., 2020): Dense retrieval learns dual encoders to map queries and documents into a dense vector space such that relevant pairs of queries and documents have shorter distances.It usually adopts the inner-dot product for the sake of efficiency as sim(q, p) = E Q (q) T E P (p) where E Q (•) and E P (•) are dense encoders that map queries and documents to dense vectors, respectively.A ruleof-thumb training objective is the Noise Contrastive Error (NCE), which takes the query q i and its relevant (positive) document p + i and n irrelevant (negative) documents p − i,j as: e sim(q i ,p + i ) + n j=1 e sim(q i ,p − i,j ) (1) • Late-Interaction Retriever (Khattab and Zaharia, 2020): ColBERT combines the strengths of the bi-encoder and cross-encoder to encode the the query and document at a finer granularity into multi-vector representations.The relevance is estimated by using the rich yet scalable interaction between the query and document representations.Specifically, the model produces an embedding for every token in queries and documents and compute the relevance using the sum of maximum similarities between vectors of query tokens and all document tokens as: where E q and E d are the sequences of embeddings for query q and document d.
Challenges.Neural retrieval approaches, such as dense retrievers and late-interaction models, have achieved outperformance over lexical ones on typical open-domain question answering datasets, e.g., NaturalQuestions (Kwiatkowski et al., 2019).However, recent studies (Litschko et al., 2022;Thakur et al., 2021) unveil some of their inherent limitations, posing the following challenges: • Parameter Inefficiency: Though the fullparameter fine-tuning empowers neural retriev-ers to achieve good results, it results in substantial parameter redundancy from two aspects.First, training dual-encoders double the size of the parameters to be tuned.The improving strategies, such as parameter sharing (Yan et al., 2021;Geigle et al., 2022), have to sacrifice the retrieval performance.Second, the cross-lingual (Litschko et al., 2022) and crossdomain (Thakur et al., 2021) transfer may require additional full-parameter tuning on each of the individual tasks and consequently increase the number of parameters by several times.• Weak Generalizability: Though neural retrievers offers advantages on domain datasets, e.g., OpenQA datasets (Karpukhin et al., 2020), some of them-particularly dense retrievers-cannot generalize well to zero-shot cross-domain benchmarks (Thakur et al., 2021).However, the zeroshot setting is widely adopted in downstream scenarios, as constructing retrieval training datasets with annotations could be outrageously expensive.Such challenge also broadly connects to the generalizability of neural networks.
In this work, we aim to explore the solutions for addressing the above challenges in neural text retrieval.Specifically, we focus on the parameterefficient transfer learning, which has offered alternative strategies for the downstream usage of pre-trained models in natural language processing.

Parameter-Efficient Transfer Learning
We introduce the parameter-efficient transfer learning (PE learning) framework and notable techniques.Different from fine-tuning (Devlin et al., 2019), which updates the full parameters of pretrained models for each target task, PE learning aims to achieve comparable performance to finetuning by tuning only a small portion of parameters per task (Houlsby et al., 2019;Li and Liang, 2021;Liu et al., 2022).

Transformers
The success of PE learning largely takes advantages of the Transformer architecture (Vaswani et al., 2017).Transformers are composed of stacked layers, each containing a multi-head attention module and a feed-forward network (FFN).The attention function can be written as: where the query Q, key K and value V are: The multi-head attention performs N heads in parallel and concatenates their outputs to form the input to FFN where f is an activation function: Different PE learning methods attempt to modify different modules of a Transformer to achieve parameter efficiency.

Parameter-Efficient Learning Methods
We introduce several emerging PE learning methods.Figure 2 illustrates the technical differences between them.
Adapters (Houlsby et al., 2019;Pfeiffer et al., 2020).The adapter inserts small modules between Transformer layers, which forms as a bottleneck to limit the amount of parameters in the format of: where h is the input, W down ∈ R d×r and W up ∈ R r×d are project matrices, and f (•) is the activation function (Cf. Figure 2 (a)).
BitFit (Zaken et al., 2022).Each Transformer layer consists of self-attention, FFN, and Layer-Norm operations, all of which have certain bias terms as shown in Eqs 4 and 5. Bit-fit proposes to only tune the bias terms b(•) of the Transformer (Cf. Figure 2 (d)).(Liu et al., 2021c).This approach inserts trainable continuous prompts to the input sequences of the Transformer.Given a PLM, e(•) is the input embedding function that maps input tokens to input embeddings.For a template T = {[P 0:i ],

Lester et al. & P-Tuning
x, [P i+1:m ], y} where x is the context and y is the target, e.g., the [MASK] token, the model's inputs are: where h i is the trainable prompt (Cf. Figure 2 (b)).
Prefix-Tuning (Li and Liang, 2021) & P-Tuning v2 (Liu et al., 2022).Prefix-tuning concatenates l trainable key and value embeddings of the attention to the prefix on each layer of the language models.Specifically, given the original key vectors K ∈ R l×d and value vectors V ∈ R l×d , the trainable vectors P k , P v are correspondingly concatenated to K and V .The computation of an attention head becomes: Here the superscript (i) refers to the part of the vectors that correspond to the i-th head.It has been empirically proved comparable to fine-tuning on a wide range of downstream tasks, including text generation (Li and Liang, 2021), natural language understanding (NLU) and sequence labeling (Liu et al., 2022).
Since the retrieval task is more related to NLU, we employ P-Tuning v2's implementation, which makes several optimizations on top of prefix-tuning (Cf. Figure 2 (c)).

In-Domain Parameter-Efficiency
In this section, we describe the data and settings we used for the in-domain OpenQA experiments and evaluate the retrieval performance of the parameterefficient methods introduced above.
Settings.We evaluate the in-domain performance We use top-k retrieval accuracy as our evaluation metric, which measures the percentage of questions that have at least one document containing the answer in the top k retrieved documents.
Results.We identify the best-performed hyperparameters for each method and the results are shown in Table 1.P-Tuning v2 and BitFit are comparable to fine-tuned baseline on all datasets as expected.P-Tuning v2 also performs the best on four in-domain datasets among the tested PE approaches.On the other hand, Lester et al. & P-Tuning performs a bit weaker than the fine-tuned baseline.Adapter shows weak performance, but might be attributed to the version of implementation (Pfeiffer et al., 2020) (i.e., other versions with different implementation or more tunable parameters may be better).The results empirically demonstrate that PE methods can significantly cut down necessary tuning parameters to 0.1% and provide competitive performance in in-domain data.
Interestingly, we also notice that on the out-ofdomain dataset SQuAD, P-Tuning v2, Lester et al. & P-Tuning, and BitFit substantially outperform the fine-tuned counterpart.

Cross-Domain and Cross-Topic Generalizability
In this section, we examine the zero-shot generalizability of fine-tuning and PE learning.We take P-Tuning v2 (Liu et al., 2022) as an representative for PE methods, which has the highest average indomain accuracy .Particularly, as previous work seldom looks into the cross-topic generalization, we introduce OAG-QA, the largest fine-grained cross-topic retrieval dataset to date.On crossdomain evaluation, we adopt the well-acknowledge BEIR (Thakur et al., 2021) benchmark.
6.1 OAG-QA: A Fine-Grained Cross-Topic Scientific Literature Retrieval Dataset OAG-QA is a fine-grained topic-specific passage retrieval dataset constructed by collecting high-quality questions and answers from Online Question-and-Answers (Q&A) forums, such as Quora and Stack Exchange.These forums offer people chances to ask questions and receive answers from other expert users, potentially with reference to academic papers.These references can be consequently aligned to paper entities with rich meta-information (e.g.abstract, fieldof-study (FOS)) in the Open Academic Graph (OAG) (Zhang et al., 2019), the largest publicly available academic entity graph to date.We collect questions from two influential websites: Stack Exchange2 in English, and Zhihu3 in Chinese.On top of the collected pairs of questions and paper titles, we align them to OAG (Zhang et al., 2019;Wang et al., 2020a;Tang et al., 2008) paper ids via public API4 .In terms of topics, disciplines from Stack Exchange and tags from Zhihu naturally serve as fine-grained topics attached to   3) which consists of 17,948 unique queries from 22 scientific disciplines and 87 fine-grained topics.Given each topic, we sample 10,000 candidate papers including the groundtruth from the same disciplines as OAG annotates, and take their titles and abstracts as the corpus.

Zero-Shot Cross-Domain Generalization
Datasets.We adopt Benchmarking-IR (BEIR) proposed in (Thakur et al., 2021), a zero-shot generalization benchmark for evaluating retrievers tasks across domains.It consists of zero-shot evaluation datasets, (15 out of 18 are available) from 9 retrieval tasks of heterogeneity.The datasets vary from each other in corpus sizes (3.6k -15M documents), queries and documents' lengths, and domains (news articles vs. scientific papers).
Settings.Following (Thakur et al., 2021), we trained the models on one dataset and report the zero-shot performances on the other datasets.We choose DPR (Karpukhin et al., 2020) from dense retrievers and ColBERT (Khattab and Zaharia, 2020) from late-interaction models to explore the retrieval effectiveness under PE and full-parameter finetuning settings.Following the settings of BEIR, we use the open-sourced Multi-dataset DPR checkpoint (Karpukhin et al., 2020) and ColBERT model trained on MS MARCO (Nguyen et al., 2016).
To obtain comparable evaluation across datasets and tasks in BEIR (Thakur et al., 2021), we use Normalized Cumulative Discount Gain (nDCG@k) to involve both binary and graded relevance measures for ranking quality.
Results.Table 4 reports the results of DPR and ColBERT on the 15 datasets of BEIR.For DPR, P-Tuning v2 generalizes much better than the finetuned one on all datasets except for MS MARCO and DBPedia.We observe that the datasets where our method improves by more than 5 points, such as Touche-2020 and SciFact, usually consist of long documents with average lengths over 200.We conjecture that the DPR trained on OpenQA has been biased to the 100-word document length in the oridinary setting.In summary, P-Tuning v2 achieves an absolute 5.2% improvement on the fine-tuned baseline on average.Thus, P-Tuning v2 greatly improves the out-of-domain generalization of dense retrieval models.
On the other hand, ColBERT trained by P-Tuning v2 also outperforms the fine-tuned Col-BERT on almost all (13/15) datasets.P-Tuning v2 slightly underperforms on NQ and Quora where documents are relatively short.For the out-ofdomain average scores, P-Tuning v2 outperforms the baseline ColBERT by an absolute gain of 2.4%.Compared to DPR, fine-tuned ColBERT generalizes better, probably because it is trained on the larger and more diverse MS MARCO and its architecture can be more scalable.But P-Tuning v2 still gains an advancement on generalization over the fine-tuned one.In conclusion, the results show that with similar in-domain performance, P-Tuning v2 can improve zero-shot generalization for crossdomain compared to fine-tuning.

Zero-Shot Cross-Topic Generalization
In addition to cross-domain generalization, crosstopic generalization is a more pragmatic and meaningful challenge for retrieval tasks.For example, in a scientific literature retrieval system, the corpus sizes, abstract lengths, and writing styles would not vary too much.The challenge lies in refining retrievers for more fine-grained fields-of-study.
Settings.We use the same trained DPR (Karpukhin et al., 2020) and ColBERT (Khattab and Zaharia,Table 4: Zero-shot cross-domain generalization evaluated on 14 datasets of BEIR (Thakur et al., 2021).All scores are nDCG@10, and those of "FT" are taken from BEIR's report.("*" denotes in-domain datasets; "FT" denotes fine-tuning; "PT2" denotes P-Tuning v2) 2020) model introduced in 6.2 and conduct a zeroshot evaluation.We measure top-20 retrieval accuracy on the dataset of each topic and report the average scores over each discipline.
Results.Table 5 compares models trained by P-Tuning v2 and fine-tuning using top-20 retrieval accuracy.P-Tuning v2 outperforms fine-tuning in 20/22 topics in DPR and 18/22 topics in ColBERT respectively.Specifically, P-Tuning v2 performs poorly in Algebra and Linear algebra, two fields which contain a large number of mathematical symbols, in both DPR and ColBERT at the same time.
Overall, on average P-Tuning v2 are better than that of baseline, gaining 2.6% and 1.2% absolute improvement over DPR and ColBERT respectively.

An Understanding of the Generalization
How does PE learning help neural text retrievers to generalize well?While it might be attributed to PE learning's flatter loss minimum or alleviated catastrophic forgetting, in this work we investigate other quantifiable reasons, the confidence calibration and query-length robustness.

Confidence Calibration
Despite metrics like accuracy are usually the most concerned in machine learning, there are more properties to care about, such as calibration.Cal- ibration refers to models' ability to provide class probability that corresponds to its likelihood of being true.A calibrated model provide trustful confidence to its prediction, which is particularly important for algorithms deploying in critical realworld scenarios.Notwithstanding the higher accuracy, modern neural networks are known to be miscalibrated (Guo et al., 2017).Recent literature has also demonstrated that cross-domain calibration is a nice proxy for model's out-of-domain generalizability (Wald et al., 2021).To measure a retriever's calibration, we resort to Expected Calibration Error (ECE) proposed in (Naeini et al., 2015) as: (9) which bins estimates from n samples within [0, 1] into B m , a set of M equal-length buckets.Each sample i has its label y i , estimated label ŷi , and estimated probability pi .
Following prior work (Penha and Hauff, 2021), we cast the ranking problem as multi-class classification to compute ECE.We take queries with valid top-5 predictions, apply softmax over retrieval scores per query, and turns the ranking into 5-class classification to derive ECE (Cf.Table 6) and calibration diagrams (Cf. Figure 3).(Naeini et al., 2015) of Fine-tuning (FT) and P-Tuning v2 (PT2) based on DPR (Karpukhin et al., 2020); smaller the better.Findings.As shown in Table 6 and Figure 3, we find that P-Tuning v2 based DPR are more calibrated than its fine-tuned counterpart, whatever on in-domain or cross-domain datasets.The only exception is the TREC-COVID dataset in BEIR, which only evaluates on 50 queries and may cause a variance.To conclude, even though fine-tuning and P-Tuning v2 share a similar in-domain performance, their levels of calibration still vary largely from each other, which accords with observations in (Guo et al., 2017) that better accuracy does not mean better calibration property.Such calibration can explain P-Tuning v2's generalizability, as (Wald et al., 2021) theoretically proves that a superior multi-domain calibration effect to fine-tuning usually leads to better cross-domain generalization.

Query-Length Robustness
Mismatched query lengths across datasets is another hidden reason.For example, in four OpenQA datasets we experiment, most query lengths locate in the interval from 8 to 40; while other datasets can have very different query lengths.Fine-tuning changes pre-trained models parameters and may consequently bias text retrievers to certain query lengths; PE methods are free from such worries.
Findings.We present a case study on two typical datasets, Quora and ArguAna from BEIR (Thakur et al., 2021), to justify the hypothesis.The query lengths are derived from splitting plain query texts by white-spaces.For a clearer visualization, we split the lengths by equal-sized bins.As shown in Figure 4, when queries are medium-length (30-100), both P-Tuning v2 and fine-tuning perform comparably.But when queries are either relatively short (in Quora) or long (in ArguAna), P-Tuning v2 generalizes much better than fine-tuning.This indicates that PE learning based (e.g., P-Tuning v2) neural text retrievers have a better robustness against varied query lengths in testing.

Conclusion
We propose to leverage PE prompt tuning for neural text retrieval, which is proved for the first time in this problem for comparable performance to full-parameter fine-tuning.Furthermore, PE approaches like P-Tuning v2 improve cross-domain and cross-topic generalization, which fundamentally comes from improved confidence calibration and query length robustness as we show.Finally, we construct and release the largest fine-grained topic-specific academic retrieval dataset OAG-QA, which contains 87 different domains and 17,948 query-paper pairs, to support future research.

Limitations
In this section we discuss several potentially unresolved topics related to this work.First, despite the superior parameter-efficiency of PE learning, a long-standing challenge is that it converges slower and is relatively more sensitive to hyper-parameters (typically, learning rate) than fine-tuning.We have the same observation in this work and have to bypass the problem by training longer and trying multiple groups of hyperparameters.It is thus important to design more robust and stable training strategies for prompt tuning in the future.
Second, the OAG-QA dataset requires further exploration.As indicated in Table 7, we purposely leave 20 samples in each fine-grained topic for future investigations on the effectiveness of PE learning in few-shot and meta learning settings.Conventionally, these settings require fine-tuning the whole model in each task separately, causing great redundancy.However, PE learning's extreme parameter efficiency can come to its rescue.We leave this investigation for future work.
Third, PE learning's calibration and generalization properties should ideally be applicable to other language tasks, such as text understanding and generation.In this work, we focus on neural text retrieval, as it usually faces more distribution-shift scenarios.However, many other practical problems also suffer from the challenges of biased training data and generalization, and the application of PE learning on them remains largely unexplored.Original DPR (Karpukhin et al., 2020) We used the open-sourced DPR checkpoint trained on multitask data with bert-base-uncased model (sequence length: 256).The results are aligned with DPR authors' reported ones in paper.
DPR with P-tuning v2 (Liu et al., 2022).For Ptuning v2 training, we used a batch size of 128 and a sequence length of 256.We trained the question and passage encoders, which are based on bertbased-uncased model, for up to 40 epochs for large datasets (NQ, TriviaQA, SQuAD and Multi-dataset setting) and 100 epochs for small datasets (TREC, QA) with a learning rate of 0.01 and a prefix length of 100 using Adam, linear scheduling with 5% warm-up and dropout rate 0.1.
DPR with Lester et al. & P-Tuning (Liu et al., 2021c).Like P-tuning v2, we used bert-baseduncased model as basic model, however, we only applied modification to the input and set the parameters of learning rate as 0.01.We tried different prefix length such as 100, 200 to test the performance of the model.
DPR with BitFit (Zaken et al., 2022).In BitFit training, we use the same values of batch size, sequence length, dropout rate and learning rate as in P-tuning v2 as well as the same model, bert-baseduncased model.It took 40 epochs to train the model in the same datasets using Adam Optimizer and linear scheduling with 5% warm-up.We fixed all parameters and trained only bias parameters.
DPR with Adapter (Houlsby et al., 2019).In the procedure of training Adapter, we set the Adapter architectures as PfeifferConfig style, and except learning rate of 3e-5 and epochs of 50, the parameters and datasets were all same as in Bit-Fit as introduced in the above paragraph.We adopt the implementation of adapter in ADAPTER-TRANSFORMER (Pfeiffer et al., 2020).

B.2 Implementation of ColBERT
Original ColBERT (Khattab and Zaharia, 2020) In full-parameter training, We adopt the parameters offered by (Khattab and Zaharia, 2020).We trained ColBERT model with a learning rate of 3 × 10 −6 with a batch size of 32.We fix the number of embeddings per query at 32 and follows (Thakur et al., 2021) to set the number of document embeddings as 300.The embedding dimension is set as 128.
The model is trained for up to 400k iterations.
With P-tuning v2, We trained ColBERT from the parameters of bert-based-uncased for up to 400K steps on MS MARCO dataset with a learning rate of 0.01 and a prefix length of 64.We used a batch size of 32 and fixed the number of embeddings per query at 32 and the number of embeddings per document at 300.The embedding dimension is set to be 128.

Figure 2 :
Figure 2: The illustration of four parameter-efficient methods.The PLM module represents a certain sublayer of a PLM, e.g., the attention or FFN.The components in blue are frozen and the yellow ones are trainable.

Table 2 :
Examples of disciplines, topics, and example query-paper pairs (only titles are shown) in OAG-QA.

Table 7 :
Statistics of OAG-QA.When training DPR with Adapter, the adapter-transformers(version 2.2.0) was used.