Stochastic Bridges as Effective Regularizers for Parameter-Efficient Tuning

Parameter-efficient tuning methods (PETs) have achieved promising results in tuning large pre-trained language models (PLMs). By formalizing frozen PLMs and additional tunable parameters as systems and controls respectively, PETs can be theoretically grounded to optimal control and further viewed as optimizing the terminal cost and running cost in the optimal control literature. Despite the elegance of this theoretical grounding, in practice, existing PETs often ignore the running cost and only optimize the terminal cost, i.e., focus on optimizing the loss function of the output state, regardless of the running cost that depends on the intermediate states. Since it is non-trivial to directly model the intermediate states and design a running cost function, we propose to use latent stochastic bridges to regularize the intermediate states and use the regularization as the running cost of PETs. As the first work to propose regularized PETs that use stochastic bridges as the regularizers (running costs) for the intermediate states, we show the effectiveness and generality of this regularization across different tasks, PLMs and PETs. In view of the great potential and capacity, we believe more sophisticated regularizers can be designed for PETs and better performance can be achieved in the future. The code is released at \url{https://github.com/thunlp/stochastic-bridge-pet/tree/main}.


Introduction
Recent years have witnessed the dramatic growth of pre-trained language models (PLMs) in various fields (Devlin et al., 2019;Dosovitskiy et al., 2021).As the size of PLMs continues to increase, the number of parameters has now even reached hundreds of billions (Brown et al., 2020;Smith et al., 2022), making fine-tuning the whole PLM both computationally impractical and environmentally unfriendly.In view of this, a variety of Parameter-Efficient Tuning methods (PETs) are proposed (Houlsby et al., 2019;Hu et al., 2022;Zaken et al., 2022;Lester et al., 2021).By only tuning a small number of additional parameters, PETs can be comparable to full-parameter fine-tuning.
Despite the success of PETs, their underlying mechanism remains an open problem.Recently, several works have proposed to interpret PETs with optimal control theory.Yang and Liu (2022) first show that the optimization in Prefix Tuning (Li and Liang, 2021) (a typical method of PETs) can be considered as the search for optimal control variables in the context of optimal control, i.e., the trainable prefixes can be seen as the control variables that drive the PLM (the system) to the desired output.Ding et al. (2022) further show that the optimal control perspective can be applied to almost all PETs.The optimization of PETs' parameters can be seen as minimizing the two cost functions in the optimal control literature: (1) terminal cost L T , which measures the quality of the terminal state, and (2) running cost L R , which measures the feasibility of the controlled intermediate states and the control variables.Although L T can well correspond to the loss function of the model output, L R is only vaguely described as the regularizers on the parameters of PETs (control variables) in Yang and Liu (2022) and Ding et al. (2022), ignoring the dependency of L R on the intermediate states.
In this work, we show that designing a running cost to regularize intermediate states not only makes the optimal control perspective of PETs more theoretically sound but also empirically leads to better PETs.We begin by assuming that in PLMs, the intermediate hidden states for generating different tokens in a sentence have different dynamics (or trajectories), and the dynamics can be approximated with stochastic processes in a latent space.Specifically, we first freeze the PLM and learn a mapping from the original hidden state space of the PLM to a latent space.In the latent space, the dynamics of the intermediate hidden states for generating different target tokens can be approximated with different target-specific diffusion bridges.The obtained mapping can then be plugged to the model to regularize the intermediate hidden states when training PETs.Besides, since a diffusion bridge is (1) a Markov process and (2) a solution to a stochastic differential equation (SDE), we correspondingly propose two methods to learn the mapping: (1) fitting the Markov transition probability density function (PDF) and (2) fitting the SDE directly.The two methods act as a tradeoff between efficiency and effectiveness: the first method incurs only negligible computational cost and has satisfactory results, while the second one is slower but yields better regularizers.
We conduct experiments on different PLMs of different sizes, and the experimental results on GLUE (Wang et al., 2019) under both full-set and few-shot settings demonstrate the effectiveness of our proposal across four different PETs.Further analyses show that the learned regularizer helps pull apart the hidden states of different label words.We also observe that when we project the intermediate hidden states of PETs without our regularizer into our latent space, the better the PETs perform, the closer the latent states are to our latent bridges.This spontaneous approaching behavior may indicate that stochastic-bridge-like latent dynamics naturally exist in well-trained PETs.
In summary, our work has the following contributions: (1) Guided by the perspective of optimal control for PETs, we design latent stochastic bridge regularizers on the intermediate states during the training of PETs.(2) We propose two methods to construct the latent space according to the two representations of stochastic bridges, offering a trade-off between efficiency and effectiveness.(3) Our regularizers are shown to be effective and general across different PLMs, different PETs, and different tasks.(4) We show that well-trained PETs without any regularization spontaneously exhibit stochastic-bridge-like latent dynamics.

Definition and Mathematical Notations
Consider using a L-layer PLM with the vocabulary V to handle a text-to-text task D. For each sample (x, y) ∈ D, y ∈ V is the output token and x ∈ V N is the input token sequence 1 , where N is the length of x.With x as the input, each layer of the PLM will output a sequence of hidden states, and we denote the hidden states of the i-th PLM layer as h (i) = {h (i) j } N j=1 ∈ R d×N .We denote the position where the model outputs the target y as o, i.e., the model should predict y with the hidden states h (L) o .

Optimal Control Perspective of PETs
Conventionally, adapting the PLM to D requires full-parameter fine-tuning, which is given as: min ∆θ Ex,y∼D L h (L)  o , y + R ∆θ , where θ is the parameters, ∆θ is the full-parameter update, L is the loss function, R is the regularization function, G θ+∆θ is the i-th layer forward propagation with updated parameters, Embed transforms the input tokens into embeddings.
As |θ| continues to increase, full-parameter finetuning becomes impractical, and various PETs are proposed to mitigate this problem.Let ϕ = {ϕ (i) } L i=0 be PETs' parameters.Ding et al. (2022) give a unified view of PETs from the perspective of optimal control, and Eq. 1 can be re-written as where G(i) θ represents the i-th layer forward propagation intervened by PETs, [•; •] is the concatenation operation, L T is the terminal cost and L R is the running cost.Since |ϕ| ≪ |θ|, PETs can greatly reduce the tuning cost (more details in Appx.A).Typically, L T corresponds to the prediction loss, and L R can be seen as the regularizer on PETs' parameters ϕ.However, in the optimal control literature, L R depends on not only the control variables ϕ, but also the controlled intermediate states {h In this paper, we show that additionally introducing dependence on {h i=1 for L R makes the optimal control perspective of PETs more theoretically sound, and empirically leads to better PETs. 1 Here we assume y ∈ V since a sample where y ∈ V M can be decomposed to M samples, The i-th sample is ([x; y<i], yi) for auto-regressive language modeling or ([x; y−i], yi) for auto-encoding language modeling.
< l a t e x i t s h a 1 _ b a s e 6 4 = " G u M 7 p 7 r 9 O D I s q y B + I J t S z M E g l p M = " > A The drift vector field of [is] bridge The drift vector field defined by < l a t e x i t s h a 1 _ b a s e 6 4 = " S P 1 J Z f 4 B 4 A 0 R 9 J w m h I G o o q R t q x I I j 7 j L T 8 4 f W 4 3 n o g Q t I o v F e j m L g c 9 U M a U I y U t j x z 3 0 k d n 6 e D z E u j r J t W 6 F H m a K Z X V t a 9 9 c y y V b U m g v N g 5 1 A G u e q e + e 3 0 I p

for PETs
< l a t e x i t s h a 1 _ b a s e 6 4 = " E o U c n w B 3 + y 4 5 H 0 / + T a q W p z n F g 9

Diffusion Bridges
A diffusion process X = (X t ) t∈[0,T ] is a continuous-time Markov process.For any t a < t b , the diffusion process is equipped with a transition Probability Density Function (PDF) p(t b , b|t a , a), which gives the probability density of reaching b at time t b given the history of reaching a at time t a .A diffusion process is also the solution to an Itô SDE d Xt = µ(t, Xt )dt + σ(t, Xt )dB t , where B t is a standard Brownian motion, µ(•, •) is called drift function and σ(•, •) is called diffusion function.
A diffusion bridge X T ;α,β is a diffusion process conditioning on the path observations of the two endpoints (0, α) and (T, β), i.e., X T ;α,β 0 = α and X T ;α,β T = β.For simplicity, we assume α=0 in this work, and omit the superscript α.We consider two typical diffusion bridges, the Brownian bridge and the Ornstein-Uhlenbeck bridge (OU bridge).We present here the properties of the Brownian bridge and leave the properties of OU bridge to Appx.B.

Proposition 2.1 (Properties of Brownian Bridge).
A Brownian bridge X T ;β with X T ;β 0 = 0 and X T ;β T = β is the solution to the following SDE: Diffusion bridges and SDEs are battle-tested tools to model the stochastic dynamics of complex systems in engineering (Sobczyk, 2013), finance (Wang and Sloan, 2011), etc. Considering the dynamics of PLMs' hidden states are necessarily complex, diffusion bridges and SDEs serve as ideal tools for us to model the dynamics.

The Overall Framework
Building latent dynamics in the latent space.Since directly regularizing the intermediate states and constructing the running cost are non-trivial, we introduce a projection from the intermediate state space to a latent space, and leverage diffusion bridges as regularizers to construct the running cost.Specifically, we define a r-dimensional latent space U ⊆ R r (r < d) and a learnable mapping g γ : R d × R d → U, where γ denotes the parameters.g γ projects the hidden state h (i) o and its context state h(i) into the latent space U at each layer of the PLM.Since h o is contextualized while latent bridges are not, introducing the dependency on h(i) can inform g γ about the context at the i-th layer and allow g γ to decontextualize the hidden states.We simply take the averaged states at the i-th layer j as the context.We define the latent states with discrete time as where t i+1 is the normalized layer index.We include the 0-th layer (input layer) because some PETs (e.g., prompt tuning) act on the 0-th layer.We use t 0 = 0, t L+2 = 1 represent the two endpoints.By using natural cubic spline knotted at {h we further give a continuous representation of the states in the latent space U as uC (gγ, {h (x)  o }x∈L) = {tx+1, gγ(h Learning the mapping from hidden state space to latent space.Since adapting PLMs to downstream tasks can be seen as transferring the knowledge obtained from pre-training tasks to downstream tasks, we argue that the latent dynamics of intermediate hidden states for generating the same token y should be similar in both the pre-training and downstream tasks.Therefore, we train the mapping g γ on the corpus that is used to pre-train the backbone PLM2 , and then apply the learned mapping to downstream tasks to encourage the latent dynamics to be similar to that in pre-training.Specifically, we assume that the states to generate the token y in the latent space U form a trajectory that is a path sampled from X 1;βy with high probability, where X 1;βy is the diffusion bridge describing the latent dynamics to generate y, and β y is the tail endpoint of the diffusion bridge.More details of X 1;βy will be discussed in Sec.3.2.
On the corpus where the PLM is pre-trained, we fix the PLM and use its hidden states {h i=1 to learn g γ by maximizing the goodness of approximation for latent states u under the bridge X 1;βy : where u can be u D (Eq. 5) or u C (Eq. 6) depending on the fitting method, goodness(•, •) is also a function depends on the choice of the fitting method, measuring how likely u is a sample trajectory of X 1;βy .In Sec.3.3, we will define this function alongside the fitting methods.
Regularizing PETs with latent dynamics.After learning g γ with Eq. 7, we freeze γ and use the goodness function as the running cost in Eq. 2 for PETs on downstream tasks.The objective becomes where the second term is the running cost and α is a hyper-parameter controlling the regularization intensity.By optimizing Eq. 8, PETs learn to predict y correctly and keep the latent states at the position o conform to the diffusion bridge X 1;βy .Note that introducing g γ as the regularizer does not increase the number of trainable parameters for PETs during the training stage since γ is fixed, and since we only use the pre-training corpus, no extra information of downstream tasks is leaked.Moreover, the regularizer only helps in training better PETs and does not intervene the inference.

Determining Endpoints for Bridges
An intuitive approach to determine the endpoints for the diffusion bridges for each target token is to optimize the endpoints together with the mapping g γ .However, optimizing endpoints and g γ jointly may admit a trivial solution: endpoints are both 0 ∈ R r and g γ always outputs 0. Since 0 is always a point in the sample path of such a degenerated diffusion bridge, the value of goodness function can be meaninglessly high.Although sophisticated constraints can be imposed here, as the first work that uses diffusion bridges as regularizers, we simply pre-determine the endpoints and keep them fixed.
We leave introducing constraints as future work.Specifically, we apply principal component analysis (PCA) to the output token embedding matrix V ∈ R |V|×d of the PLM, obtaining a rdimensional embedding matrix, and re-normalize each row to have a norm η.Let the resulting embedding matrix be β ∈ R |V|×r .We then use 0 ∈ R r as the heads for all the bridges, and β as the tails of the diffusion bridges, i.e., the r-dimensional embedding of y in β is used as β y in X 1;βy .The intuition for using β as the tails is that the trajectories of the intermediate states for similar target tokens should be close.In V , similar tokens are close, and β obtained by PCA can well preserve the token similarity after reducing dimensions.

Fitting the Mapping g γ
We use the Brownian bridge to illustrate the fitting of g γ .It can be analogous to OU bridge easily.Method 1: Approximating the Transition PDF.Generalizing Eq. 4 to high dimension, we can derive the transition PDF from (0, 0) to (t i+1 , g γ (h where t i has the same definition as that in u D (Eq. 5).To make g γ approximate the transition PDF, we maximize the sum of log-probability of u D under the Brownian bridge X 1;βy : where C is a constant.Here, g γ can be seen as a mapping from the hidden state space to the latent space by predicting the expectation of the Brownian bridge X 1;βy at {t i+1 } L i=0 .Method 2: Approximating the SDE.Since the Brownian bridge is the solution to the SDE in Eq. 3, we let g γ approximate the SDE.Solving the SDE requires continuous latent states, while we only have L + 1 discrete observations, we thus use the continuous representation u C introduced in Eq. 6. Generalizing Eq. 3 to high dimension, the SDE approximated by g γ can be defined as: where x is the same as that in Eq. 6, B : [0, 1] → R r is a standard r-dimensional Brownian motion.
Here, we additionally introduce the dependence on t for g γ , since time information is shown to be important in previous neural differential equation works (Zhang et al., 2020;Dupont et al., 2019).Following Li et al. (2020), when two SDEs share the same diffusion function, the KL divergence between the probability measures induced by the two SDEs is finite.Since the diffusion function σ ≡ I for Eq. 10 and the multi-dimensional generalization of Eq. 3, the KL divergence between the probability measures µ Y of Eq. 10 and µ X of generalized Eq. 3 can be estimated by: where µ(•, •) is the drift function of the predetermined Brownian bridge X 1;βy .We use the KL divergence as the goodness function to optimize the mapping g γ .Here, g γ can be seen as a mapping from the hidden state space to the latent state space by approximating the drift vector field of the underlying Brownian bridge X 1;βy .

Experiments
To verify the effectiveness and generality of the regularizers built on stochastic bridges, we conduct experiments on (1) different PLMs: BERT large (340M) (Devlin et al., 2019) and Deberta xlarge (750M) (He et al., 2021); (2) different PETs: Prompt tuning, LoRA, BitFit and Adapter; (3) different diffusion bridges: Brownian bridge and OU bridge.We show that the regularizers effectively improve the performance on GLUE (Wang et al., 2019) under both full-set and few-shot settings.

Experimental Setups
Datasets.Since both BERT large and Deberta xlarge use Wikipedia and BookCorpus (Zhu et al., 2015) for pre-training, we thus use these two corpora to train g γ .We report F1 for MRPC and QQP; Matthews correlation for CoLA; and accuracy for other tasks.We report the average performance and the standard deviation on the development set over 3 different runs.We append [MASK] to each sequence, and require the PLM to output the label word at [MASK] (e.g., negative or positive for SST-2).We exclude STS-B for it is a regression task.Models and PETs.We use the checkpoint released by Shoeybi et al. (2019) for BERT large , and the official v1 checkpoint for Deberta xlarge .We use a simple three-layer MLP to build g γ .For Prompt tuning, we use a soft prompt of length 20, and append it to the end of each sequence.For LoRA, we apply it to the query and value of attention modules.
For Adapter, we apply it to the output of attention and feed-forward modules.For BitFit, we tune all the bias terms in linear layers and layer normalization modules.Hereafter, we use PDF regularizer to refer to using g γ fitted by approximating the transition PDF, and SDE regularizer to refer to using g γ fitted by approximating the SDE, vanilla x to refer to the PET x without using regularizers.Hyper-parameters.Hyper-parameters are listed in Appx.E. We focus on the difference in performance between vanilla PETs and regularized PETs.Therefore, we set the hyper-parameters to common values from previous works and do not perform much hyper-parameter search.But we ensure the hyper-parameters for vanilla PETs and regularized PETs are the same for a fair comparison.

Full-set Results
The experimental results for BERT large and Deberta xlarge are reported in Tab. 1 and Appx.C respectively.Due to space limitation, see Tab. 6 for the complete results including OU bridge regularizers.The first line of each block in the table is the performance of vanilla PETs, and the rest of the lines are the performances of the regularized PETs.
In general, both Brownian and OU bridges, and both PDF and SDE regularizers are able to improve the performance of PETs, showing the effectiveness of our proposed regularizers.Particularly, for Prompt tuning, the SDE regularizer with both diffusion bridges yield an average performance improvement of more than 2%.We assume that it is because Prompt tuning has far less trainable parameters than other PETs, and it only acts at the input layer, which is far from the supervision signals of the terminal cost L T .Therefore, when provided with the regularization on the hidden states, the prompts receive more guidance and eventually reaching a better local optimal.
Overall, the two diffusion bridges in our experiments do not show much difference.As for the two fitting methods, SDE regularizer is generally more effective, especially for Prompt tuning where the number of trainable parameters is restricted.However, we also observe that SDE regularizer is about 3 times slower than PDF regularizer, which brings the trade-off between performance and efficiency.One can expect a better performance by leveraging more sophisticated underlying stochastic bridges, exploring more reasonable endpoints for bridges and designing better mapping g γ .As the first work using latent stochastic bridges as regularizers, we mainly consider the most simple cases and aim to show the potential of the approach.

Few-shot Results
In Tab. 1, the improvements are more substantial on small datasets MRPC, CoLA and RTE.This is probably because in large datasets, the abundant data has provided enough information to train high quality PETs; while in small datasets, the data is insufficient and the regularizer can offer additional supervision.To validate this, we conduct the experiments under the few-shot setting on GLUE.
The 16-shot results are shown in Tab. 2, and the results for the OU bridge, results for 4-, 8and 32-shot and results for Deberta xlarge are placed in Appx.C. For all PETs, the SDE regularizer yields an improvement of more than 3%.Particularly, the SDE regularizer on LoRA brings an improvement of 5.2%.Also, there is now a substantial boost on what was originally a richresource dataset, such as MNLI, QQP and QNLI.The PDF regularizer also gives modest improvements.Though slightly inferior to the SDE regularizer, it is still satisfying, considering that the PDF regularizer brings such a performance improvement with little computational cost introduced.We additionally observe that the improvement is more significant on Deberta xlarge in Tab. 8, demonstrating the potential of our regularizers on larger models.

Analyses
To better understand the role played by our regularizers, we analyze the hidden states of the PETs with and without regularizers.We choose Prompt tuning as a representative.By varying the hyper-parameter α in Eq. 8, we show that as the regularization intensity gets stronger, the clusters of hidden states corresponding to different labels become more distinguishable.Also, we show that the hidden states of vanilla PETs spontaneously approach the latent bridges in the latent space without knowing the bridges, indicating that there may exist intrinsically diffusion-bridge-like latent dynamics for PETs.

Distances between Labels are Widen
We use the different prompts obtained with or without regularizers on the full-set GLUE, and record the intermediate hidden states {h (i) [MASK] } L i=1 .We vary the regularization intensity by adjusting the coefficient α in Eq. 8 to inspect the impact of the regularization intensity on the hidden states.Note that when α = 0, it degenerates to the vanilla PET.
We randomly sample 100 samples for each label in MNLI, use UMAP (McInnes et al., 2018) to reduce the dimension of the last layer's hidden states of Prompt tuning and plot them in Fig. 2. It shows clearly that for both regularizers, as the regularization intensity gets stronger, the hidden states of the last layer become more distinguishable among labels.By looking at the axes of these plots, we find that the distances between the clusters generally increase when the regularization intensity is increased.We also notice that the SDE regularizer better helps separate the hidden states of the last layer by substantially enlarging the distance between the centroids of different labels, which could be one of the reasons why the SDE regularizer has better effectiveness in almost all experiments.
We also calculate the Pearson's correlation between the α and the average distance between the centroids of different clusters.The results are shown in Tab.h.On all the datasets, the α has a positive correlation to the average centroid distance, and on most of the datasets, the correlations are significant (p-value <.05).This indicates that as the regularization intensity gets stronger, the centroids of different label clusters become more distant, which is a desired effect because the regularizer encourages the hidden states for different labels to conform to different latent bridges.

Hidden States Spontaneously Approach the Latent Bridges
An interesting phenomenon we observe is that the vanilla PETs' intermediate hidden states spontaneously approach our latent bridges when they are projected by our mapping g γ .That is, applying our mapping g γ to the hidden states of vanilla PETs, we find that when the performance of vanilla PETs becomes better, the average distance from g γ ({h o , h(•) }) to our latent bridge gets closer.Here, similar to Wang et al. (2022), we define the distance from g γ (h o , h(•) ) to its corresponding latent bridge X y using Eq. 9 without the constant.Note that the vanilla PETs have no access to g γ and the latent bridges during the training process, and g γ also has no access to the PETs during its fitting.
We show the above phenomenon by conducting analyses in few-shot scenarios with PDF regularizer, and reporting in Tab. 3 the correlation between (1) the number of shots and the average distance from latent hidden states to latent bridges (2) the performance and the average distance from latent hidden states to latent bridges.We report Kendall's rank correlation for (1), and Pearson's correlation for (2).See Appx.D for the detailed setup.
From Tab. 3, the number of shots has a negative correlation to the distance, and the correlation is significant on 4 out of 6 datasets.This indicates that as the amount of available data increases for vanilla PETs, its intermediate hidden states in latent space spontaneously approach latent bridges even without knowing the mapping g γ and the bridges.Additionally, the results in Tab. 3 show the negative correlation between the performance of vanilla PETs and the distance to the latent bridges, and it is significant on 3 out of 6 datasets.Table 3: Kendall's rank correlation between the number of shots and the distance to the latent bridges (Dist-Shot), and Pearson's correlation between the performance and the distance to the latent bridges (Dist-Perf).
Altogether, the two findings on correlation show that as the PETs' performance improve, their intermediate hidden states projected by g γ spontaneously approach our bridges in the latent space.This implies that there exists intrinsically diffusionbridge-like latent dynamics for PETs, and also justifies our use of diffusion bridges as regularizers.

Related Works
Recent years have witnessed the success of PLMs (Raffel et al., 2020;Brown et al., 2020).However, as the sizes of PLMs continue to grow, it becomes increasingly impractical to perform finetuning on downstream tasks.Many efforts have been devoted to PETs, aiming to tune only a few parameters rather than the whole PLM on downstream tasks.For example, Prompt tuning (Lester et al., 2021) prepends tunable embeddings to the input, Adapter (Houlsby et al., 2019) inserts small modules into each layer, BitFit (Zaken et al., 2022) tunes only the bias terms, and LoRA (Hu et al., 2022) decomposes the weight updates into lowrank matrices.In this paper, based on the theoretical grounding of PETs on optimal control (Yang and Liu, 2022;Ding et al., 2022), we propose stochastic bridges as the regularizer on intermediate hidden states and introduce regularized PETs.
Our work also closely relates to continuous-time neural differential equations (NDEs) (Chen et al., 2018;Rubanova et al., 2019;Li et al., 2019;Kidger et al., 2021).Continuous-time NDEs model the dynamics of the hidden states with ODEs or SDEs parameterized by neural networks.Inspired by these works, we use SDEs to represent the latent dynamics of PETs in the latent space.Our work differs from them in that we focus on using neural SDEs as regularizers for intermediate hidden states, rather than feature extractors.We also notice that Wang et al. (2022) explore the use of Brownian bridge in regularizing the model dynamics across time.Our work differs from theirs in that we regularize the dynamics of intermediate hidden states across model layers.We additionally show that other diffusion bridges can be easily applied as the regularizer.As far as we know, we are the first to show the diffusion-bridge-like dynamics for hidden states across PLM layers and use diffusion bridges as regularizers on intermediate hidden states.
Starting from the optimal control perspective of PETs, we notice that the existing PETs lack a running cost that regularizes the intermediate hidden states.We thus propose to use stochastic bridges in a latent space as the regularizers for PETs.Experimental results on different models, tasks and PETs show that the proposed regularizers effectively improve the PETs' performance.Our analyses further show that the hidden states of the vanilla PETs spontaneously approach our diffusion bridges, indicating that there may exists intrinsically diffusionbridge-like dynamics in PETs.As the first work using stochastic bridges as regularizers, we show its effectiveness and generality.We believe it will be a promising direction.

Limitations
Introducing the regularizers inevitably incurs additional computational cost in the training of PETs.To show their impact on the training speed, we plot the time-performance curves for both PDF and SDE regularizers on full-set GLUE in Figures 5, 6, 7 and 8.
On different PETs, the regularized PETs with PDF regularizer has similar running time to the vanilla PETs.On the two large datasets, QQP and MNLI, regularized PETs with SDE regularizer take about 2 to 3 times longer to achieve the best performance than vanilla PETs.However, on medium-sized (QNLI, SST-2) and small datasets (CoLA, MRPC, RTE), the time to achieve the best results with SDE regularizer is comparable to vanilla PETs.
Overall, the PDF regularizer can effectively improve the performance of PETs without introducing much computational cost.In scenarios where there is relatively more focus on the inference performance of PETs and less concern about the slightly longer training time, or when the dataset is small, SDE regularizer should be a good choice.
Our method does not introduce additional risk to the original risks of PETs.

A Background for Parameter-Efficient Tuning Methods
The large number of parameters in the PLMs makes fine-tuning impractical, therefore different PETs are proposed to mitigate the problem.
The current PETs can be categorized into three groups: addition-based, specification-based and reparameterization-based (Ding et al., 2022).To verify the generality of our method, we include one or two PETs from each category in this work, and we give a brief review to these PETs.
Prompt Tuning is an addition-based PET.It prepends or appends trainable virtual tokens P ∈ R m×d to each sequence x ∈ R n×d to form a new input sequence [P ; x] or [x; P ] ∈ R (n+m)×d , where n, m are length of original sequence and virtual tokens respectively, d is the embedding dimension.The virtual tokens P can be either continuous (Lester et al., 2021) or be restricted to be embeddings of discrete tokens in vocabulary (Gao et al., 2021).
Adapter (Houlsby et al., 2019) where q is the diffusion coefficient and σ is the diffusion for the OU process.The transition proba-  bility density function reads as: sinh(q(T − t)) sinh(q(t − s)) sinh(q(T − s)) . (12)

C Other Results for GLUE Experiments
In this section, we present the complete results including OU bridge regularizer for Tab. 1 and Tab. 2. We also report the results for Deberta xlarge , and the results on few-shot GLUE for both BERT large and Deberta xlarge under 4-, 8-, 16-, and 32-shot.We observe that BERT large cannot give reasonable answers on CoLA and the Matthews correlations are around 0 for all the PETs and all the shots we have experienced with.However, the situation gets better for the larger model Deberta xlarge .Therefore, we exclude CoLA for BERT large and keep it for Deberta xlarge .We only select the Brownian bridge as the representative in this section, since the Brownian bridge and Ornstein-Uhlenbeck bridge have no significant difference in Tab. 1.
In Tab. 6 and Tab. 5, we report the performance of OU bridge regularizers.The experimental setups are the same as Tab. 1 and Tab. 2 respectively.The performances between OU bridge and Brownian bridge do not have a significant difference.In Tab. 7, we report the performance of Deberta xlarge on full GLUE datasets.On all four PETs, the SDE regularizer outperforms the PDF regularizer, this is consistent with the results we see in Tab. 1.The results for 4-, 8-, and 32-shot for BERT large and Deberta xlarge are plotted respectively in Fig. 3 and Fig. 4. For simplicity, we only plot the average performance for each PET.We report the results for 16-shot experiments for Deberta xlarge in Tab. 8.The setup for the experiment is almost the same as the experiment in Sec. 4.  tice that the SDE regularizer helps Deberta xlarge substantially on CoLA for most of the PETs, indicating the SDE regularizer can effectively provide useful guidance when the data is scarce and the task is hard.

D Calculation of Correlation in Sec. 5
In this section, we elaborate on how we calculate the correlations reported in Tab. 3.

D.1 Correlation between Number of Shots and Distance to Bridge
Definition D.1 (Tie).A pair of observation {(x i , y i ), (x j , y j )} is defined as tied if x i = x j or y i = y j .
Since we generate the few-shot datasets using 5 random seeds for each shot, each PET has 5 results for each shot.This results in observations with ties, e.g., the two observations for distances on the first seed and second seed for 8-shot {(8, d 1 ), (8, d 2 )} are tied.To calculate the correlation for data with ties, the Tau-b of Kendall's rank correlation is more suitable than Pearson's correlation.We therefore report the Kendall's rank correlation for the correlation between the number of shots and the hidden states distance to the latent bridges.

D.2 Correlation between Performance and Distance to Bridge
We mix all the few-shot results for different shots and different seeds to form observations of performance and the hidden states distances to bridges, and then calculate the Pearson's correlation.
E Hyper-parameters

E.2 Training PETs on full-set GLUE
We run all the experiments for 50k steps, and evaluate on the development set every 1k steps.For BERT large , we use 32 as the batch size while for Deberta xlarge , we use 16 as the batch size.We choose learning rate 1e-3 for Prompt tuning for both PLMs, and 1e-4 for other PETs for both PLMs.We use 0.01 weight decay, 1.0 maximum gradient norm and no learning rate warm-up for all the experiments.We search the best regularization strength α in {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} for PDF regularizer, and in {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5, 1.0} for SDE regularizer.The best α for PDF regularizer are listed in Tab. 9, and best α for SDE regularizer are listed in Tab.10.

E.3 Training PETs on few-shot GLUE
We run all the experiments for 1k steps, and evaluate on the development set every 50 steps.For all the shots for both regularizers and both models, we use a batch size of 2. Other hyper-parameters are kept the same as in the experiments on full-set GLUE.

F Performance of Regularizers Trained with Tiny Corpus
Although we use the pre-training corpus to train our mapping g γ , the training is actually fast and dataefficient.We show that when using only 10,000 documents in the pre-training corpus (about 0.1% of the corpus), the obtained regularizers still perform great and are comparable to the regularizers trained on the whole pre-training corpus.We train the mapping g γ for 5,000 iterations with 128 batch size.On a single NVIDIA A100 GPU, the training can be done in 1 hour for PDF regularizer, and 3 hours for SDE regularizer.The cost of training our regularizer is quite small compared to the resources required for pre-training.We conduct the same experiments as Sec.4.2 and Sec.4.3 with the regularizers trained with tiny corpus.The results are presented in Tab.11 and Tab. 12 respectively.
On few-shot GLUE, the impact of the shrinkage of the corpus is relatively obvious.But overall, the regularizers still performs great on all the PETs.The drop in performances are relatively small com-pared to the boost they bring to vanilla PETs.
t e x i t s h a 1 _ b a s e 6 4 = " Q n y C 3 / C n 8 7 3 z N 4 I o W p d G n e u e F 3 A D 0 d Y / T Y 6 4 R Q = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " X s D y f k u a Y S 9 f B b M L D 4 j y 6 B g r v C I = " > A A A C L n i c b V D L S g M x F M 3 4 t r 6 q L t 0 M i u J C y k y L j 2 X B j U s F a w u d c c i k m R p N Z o b k j l h C v s S N n + A v u H G h e 8 G F C K 7 8 C j F t X f g 6 c O F w z r 2 5 N y f O O V P g e c / O y g g 1 l s A I p 6 Y U F I r m m F z g L m 1 b m m J B V a g H l x l 3 3 S o d N 8 m k r R T c g f p 9 Q m O h V E / E t l N g O F O / v b 7 4 n 9 c u I N k L N U t z + 6 u U D B c l B X c h c / t Z u R 0 m K Q H e s w Q T y e y t L j n D E h O w i f 7 Y E g t T s q H 4 v y P 4 S 0 6 q N t 1 K 7 c i m s 4 G G m E I r a B V t I h / t o j o 6 Q I e o g Q i 6 R n f o A T 0 6 t 8 6 T 8 + K 8 D l t H n K + Z Z f Q D z v s n u K W t a Q = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " k b H c j a r E W t l 4 m t e x i t s h a 1 _ b a s e 6 4 = " c S T r B S P 7 o 5 u H y R C c t u B / S b Y B 6 3

Figure 1 :
Figure 1: An overview of our proposed latent stochastic bridge regularizer.
Few-shot Experiments.We randomly sample 2 × k examples from the original training set D train for each class.The sampling is performed 5 times with different seeds to form 5 training sets and development sets { D(i) train , D(i) dev } 5 i=1 with each being k-shot.Each time we train PETs on D(i) train , we select the best model on D(i) dev , and report its performance on the original development set D dev .
is an additionbased PET.It inserts two-layer MLPs after the attention module and feed-forward module at each layer.Denote h ∈ R d as the input of Adapter, r Hyper-parameters for training g γ as the intermediate dimension of Adapter's MLP, W d ∈ R r×d , W u ∈ R d×r as the down-projection and up-projection of Adapter, and σ as the activation function.Then the computation of Adapter can be formulated as h ← W u σ(W d x) + h BitFit (Zaken et al., 2022) is a specificationbased PET.It specifies the bias terms in layer normalization modules and linear transformation modules as trainable.LoRA (Hu et al., 2022) is a reparameterizationbased PET.It assumes that when training the model, the updates ∆W for model's pre-trained parameters W ∈ R d×k are low-rank, and thus reparameterize the ∆W of each matrix in attention module with a low-rank decomposition ∆W = BA, where B ∈ R d×r , A ∈ R r×k .For a forward pass h = W x, the computation of LoRA can be written as h = (W + ∆W )x = W x + BAx B Properties for Ornstein-Uhlenbeck Bridge Proposition B.1 (Properties of Ornstein-Uhlenbeck Bridge).A Ornstein-Uhlenbeck X T ;β pinned at X T ;β 0 = 0 and X T ;β T = β is the solution to the following SDE:

Figure 3 :
Figure 3: The average BERT large few-shot GLUE results trained with different PETs under different shots.The results are averaged across 5 different seeds and the error bars indicate the 95% confidence.SDE regularizer consistently outperforms the baseline PDF regularizer.

Figure 4 :
Figure 4: The average Deberta xlarge few-shot GLUE results trained with different PETs under different shots.The results are averaged across 5 different seeds and the error bars indicate the 95% confidence.
The visualization of the last layer's hidden states on MNLI using the prompt that is trained (a) without regularizer (b-d) with the SDE regularizer (e-g) with the PDF regularizer.And the table in (h) is the Pearson's correlation between the regularization strength α and the average distance between the centroids of different label clusters.
3, and the hyper-parameters are listed in Appx.E. The SDE regularizer outperforms the PDF regularizer on most of the PETs except Prompt tuning.We no-

Table 7 :
The results on GLUE for Deberta xlarge .

Table 9 :
Best α for PDF regularizer on full-set GLUE

Table 10 :
Best α for SDE regularizer on full-set GLUE