Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA) , which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline meth-ods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT-3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.

Resource-intensive fine-tuning, through supervised learning, and more recently reinforcement learning (RL) have shown promise in tailoring language models to arbitrary objectives.However, they require accessing and updating models' parameters, which are extremely large or inaccessible in state-of-the-art models (e.g., GPT-4 OpenAI (2023b)).This makes fine-tuning infeasible for the broader community.
Alternatively, inference-time algorithms can tailor a language model without having access to its parameters.These algorithms adjust the model's output distribution based on task-specific heuristics, while leaving the underlying model untouched.However, inference-time heuristics are traditionally hand-designed for specific tasks (e.g., Lu et al. (2021Lu et al. ( , 2020)); Liu et al. (2021a); Yang and Klein (2021); Qin et al. (2022a)) and inference-time methods are often less effective than fine-tuning a model with reinforcement learning (Lu et al., 2022a).
Drawing inspiration from RL and inferencetime techniques, we propose Inference-time Policy Adapters ( IPA), which tailors a base lan-guage model at inference-time toward arbitrary task-specific objectives without the need to finetune it.To do so, IPA combines a large base language model's output distribution with that of a smaller-sized model (an adapter policy), and optimizes the combined distribution towards a given objective with reinforcement learning.IPA uses two key ideas to make learning efficient.First, IPA only updates the adapter's parameters, avoiding the need to update the base LM.Second, IPA replaces the base model with an approximate policy-a smaller model that approximates the base model's distribution.The approximate policy is either a smaller model from the same language model family or a distilled version of the base model.At inference time, we use the combined distribution of the base model and the trained policy adapter.
Experiments across five challenging text generation tasks show that IPA brings consistent improvements over off-the-shelf language models and competitive baselines, sometimes even including expensive fine-tuning.In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT-3 with IPA brings a major performance boost over GPT-3 (and sometimes even over .Our results highlight IPA as a lightweight alternative for tailoring large language models to a wide range of objectives.IPA opens new ways to augment or customize LLMs using only academic-level resources.

Background
In this section, we introduce our text generation setting ( §2.1) and a brief background on tailoring language models with reinforcement learning ( §2.2).We then introduce our IPA algorithm for tailoring large language models without fine-tuning ( §3).

Problem Setting
Text generation is the task of generating an output sequence y given an input sequence x.We consider standard autoregressive language models, which decompose a sequence's probability as p θ (y|x) = |y| t=1 p θ (y t |y <t , x), where p θ is a neural network with parameters θ.Intuitively, our goal is to 'tailor' a pretrained model p θ towards a userspecified objective (e.g., safety).Concretely, we assume that the objective is quantified by a reward function R(y) ∈ R. We then aim to adjust p θ so that its generated sequences have high reward and reasonable language quality (e.g., fluency).

Preliminary: Tailoring LMs with RL
Online policy-based reinforcement learning has emerged as an effective way to adjust a language model towards a reward function.Formally, these algorithms (e.g., PPO (Stiennon et al., 2022), Quark (Lu et al., 2022b), or NLPO (Ramamurthy* et al., 2023)) optimize a language model p θ towards generating outputs y that maximize a given reward R: often along with regularization to maintain language quality.At a high-level, these algorithms use a policy p θ to collect input-output examples, score the outputs with a reward function R, and produce an optimized policy p θ ⋆ .Formally, (1) Here θ ′ ⊆ θ denotes the subset of p θ 's parameters that are updated by the algorithm.The key idea behind IPA is to use a full model p θ to collect examples, but update a small set of parameters θ ′ .
3 Inference-time Policy Adapters (IPA) We introduce Inference-time Policy Adapters (IPA), a lightweight approach to tailoring language models toward a user-specified objective.IPA trains a small adapter policy that adjusts the outputs of a (larger) base model at inference-time in order to maximize a reward.In doing so, IPA avoids the cost of updating the large base model, without the need to hand-design inference-time heuristics.

Policy Adaptation
First, we introduce the notion of 'tailoring' used by IPA, which mainly involves three policies.First, IPA starts with a base policy p θ , which is the language model to tailor.Second, IPA introduces an adapter policy p ϕ , which is a language model with the same output space as the base policy (i.e., vocabulary), but different parameters ϕ.Finally, IPA combines the base and adapter policies into a tailored policy: Definition 1 (Tailored policy).The tailored policy p θ←ϕ combines the distributions of the base policy p θ and the adapter policy p ϕ , where Z is a normalization factor.
The tailored policy is a product-of-experts (Hinton, 2002), which amounts to multiplying the nexttoken probabilities from the base and adapter policies, then normalizing the result.IPA's tailored policy has two key properties.First, it allows for adjusting the base policy's output without direct access to the base policy's parameters.This is critical for tailoring modern LLMs that provide access to the model's output distribution but not the model's parameters.Second, the policy adapter can use a much smaller model (i.e., ϕ ≪ θ).This provides an efficient way to tailor a large base model.

Adapter Training with RL
Our goal is to adjust the tailored policy towards a user-specified objective.The key idea in IPA is to train the tailored policy to optimize a reward with reinforcement learning, while only updating the parameters of the adapter policy.
Concretely, we use a reinforcement learning algorithm f RL (Eqn. 1) to optimize the tailored policy p θ←ϕ with a reward function R. Notably, we keep the base policy's parameters (θ) frozen, and only update the adapter policy's parameters (ϕ).That is, Intuitively, the adapter policy p ϕ learns to rescale the frozen base policy p θ , yielding a tailored policy that is 'tailored to' the reward.Notice that our framework does not depend on a specific RL algorithm.As we will demonstrate later, IPA proves to be effective when paired with three different RL algorithms (Lu et al., 2022b;Schulman et al., 2017;Ramamurthy et al., 2023), and in principle, it can easily integrate with others.
Approximate Policy.When the base model is extremely large (e.g., GPT-3), its forward pass is too costly to be used in the RL training loop.To overcome this, we propose using an approximate policy in IPA.
Definition 2 (Approximate policy).The approximate policy is defined as a smaller-sized neural model parameterized by θ that approximates the distribution of the base policy and is used to replace the base policy in the RL-based adapter training: In practice, we can obtain an approximate policy in two different ways.First, we can use a smaller pre-trained language model from the same model family.We do this if the smaller model has similar conditional generation behavior as the base policy.For instance, we use an off-the-shelf GPT2-XL as the approximate policy to tailor GPT- 3 in an open-ended generation.Alternatively, we can use a distilled base policy as the approximate policy.A distilled base policy is a language model trained on generations from the base policy, θ = arg max E y∼p θ (•|x) log P θ(y) , known as sequence-level knowledge distillation (Kim and Rush, 2016;West et al., 2022).For example, to tailor GPT-3 for lexically constrained generation, we tune GPT2-XL on prompt-generation pairs from GPT-3 to get a distilled base policy.
IPA at Inference Time.At inference time, IPA uses the tailored policy p θ←ϕ for decoding.Namely, at each time-step we obtain the next-token distribution from the tailored policy p θ←ϕ (y t |y <t ), which can then be used with a standard decoding algorithm (e.g.nucleus sampling).

Toxicity Reduction
Language models are susceptible to generating toxic completions, even when prompted with seemingly innocuous text (Gehman et al., 2020).Here, we investigate the extent to which IPA is effective in reducing the toxicity of language models in an open-ended setting.
Datasets and Metrics.The task is to generate a fluent continuation y while avoiding offensive content for a given prompt x.We evaluate on RE-ALTOXICITYPROMPTS benchmark (Gehman et al., 2020), which contains 100k prompts designed to elicit toxic generations.Following the experimental setup of Liu et al. (2021b), we use Perspective API to measure maximum toxicity, defined as the average maximum toxicity over 25 sampled generations, and the (empirical) toxicity probability of at least 1 out of 25 generations being toxic.In addition, we report fluency as the perplexity of generated output according to an off-the-shelf GPT2-XL model, and diversity as the count of unique n-grams normalized by the length of text.

Setup and Baselines
We apply IPA to tailor two types of base policy: off-the-shelf GPT-2 and GPT-3.To tailor GPT-2, we directly apply the base policy in the adaptor training, referred to as IPA(GPT-2).For tailoring GPT-3, we use an off-the-shelf GPT-2 and a distilled GPT-3 as the approximate policy for the adaptor training, labeled as IPA-(GPT-3) and IPA*(GPT-3) respectively.Notice that IPA-(GPT-3) is equivalent to directly applying the policy adaptor trained to tailor GPT-2 on top of GPT-3.In all these scenarios, we initialize the policy adaptor with a pre-trained GPT2-large model.
We use Quark as the RL algorithm to optimize the policy adaptor, and conduct an ablation study to assess the effects of using different RL algorithms in the adaptor training.As the reward function, we use the Perspective API score which measures the toxicity of the completed sequence.During inference, we use nucleus sampling with p = 0.9 to generate 25 samples for all baselines.
Results Table 1 shows the results of the toxicity reduction task.In terms of tailoring GPT-2, IPA outperforms all learning-based and decoding-based methods on the toxicity score while maintaining the language quality (i.e., fluency and diversity).It is worth noting that IPA shows improvements compared to previous methods on keeping the fluency metric untouched from the base policy, suggesting that the adapter policy effectively injects desirable quality into the base policy while keeping their existing good quality.
As for tailoring GPT-3, we found that applying the policy adapter optimized for GPT-2 directly on top of GPT-3 (denoted as IPA-) leads to a noticeable reduction in toxicity while maintaining language quality, demonstrating the adaptability and reusability of IPA.We observed further improvement when using distilled GPT-3 as the approximate policy in adapter training to tailor GPT-3 (denoted as IPA*).
In both cases, our method significantly outperforms all previous baselines for controlling GPT-3, especially the costly domain adaptive training (DAPT) which exhaustively fine-tune GPT-3 on a non-toxic corpus.Our results highlight the promise of IPA as a cost-efficient way to align large language models with user-specified objectives without the need for fine-tuning.Finally, as shown in Table 2, ablation studies on using different RL algorithms for optimizing the policy adaptor show that IPA can smoothly accommodate different RL algorithms, all leading to better performance compared to other baselines.

Lexically Constrained generation
Next, we evaluate the effectiveness of IPA in lexically constrained generation, where given a set of constraint words, the model needs to generate a sentence that includes all the given constraints.While prior works on this task evaluated the orderinvariant satisfaction of the constraints -i.e. the generation was deemed to be correct when it simply includes all the keywords (Lin et al., 2020;Lu et al., 2020), we consider a more challenging setup of ordered lexical constraints, where the generation is considered correct when it includes all the keywords with the correct order specified in the input prompt.
Datasets and Metrics.We use CommonGen (Lin et al., 2020), a dataset for generative commonsense reasoning where the task is to generate a coherent sentence given a set of concept words.To evaluate both the inclusion and order of the given constraints, we deliberately instruct the models to generate a sentence with the given keywords while following the order they appear in the input prompt.
For automatic evaluation, we gauge the constraint satisfaction with coverage, a binary metric that evaluates a generation to be correct only when it includes all the given words and also matches the order in which they appear.We also measure the fluency of each generation using a critic model finetuned on CoLA (Warstadt et al., 2019).For human evaluation, we assess the quality and plausibility of model generations for 100 randomly sampled test examples based on a 3-point Likert Scale; see details Appendix Figure 5.
Setup and Baselines.As we'll demonstrate later, GPT-3 (text-davinci-003) is surprisingly poor at faithfully following ordered lexical constraints, even after being given explicit instructions.Our goal is to tailor GPT-3 to be more reliable for constraint satisfaction in zero-shot setting.We use distilled GPT3, a GPT-2-XL fine-tuned with promptoutput pairs from GPT-3's zero-shot generations on CommonGen train set as the approximate policy for adaptor training, since an off-the-shelf GPT-2 cannot perform lexically constrained generation out of the box.We initialize the policy adaptor with a pre-trained GPT2-large model.We use Quark as the RL algorithm for adaptor optimization and choose our reward to be the multiplication of the coverage score of the lexical constraints and the fluency score inferred by the CoLA model, promoting the fluency of the generation while satisfying the ordered lexical constraints.We compare IPA with its base policy GPT-3, as well as the more advanced LLMs: GPT-3.5 and GPT-4 (OpenAI, 2023a).As a strong supervised baseline, we also fine-tune GPT-3 on the Common-Gen train set, which contains human-written outputs with the correct lexical order, denoted as GPT-3 sft .
Results.We present the experimental results in Table 3. Surprisingly, powerful language models such as GPT-3 struggle at faithfully following ordered lexical constraints even after being given explicit instructions.IPA leads to remarkable improvement top of of GPT-3 and surpasses more advanced models such as GPT-3.5 and GPT-4 in terms of constraint coverage, while achieving better or comparable generation quality.Noticeably, IPA outperform fine-tuned GPT-3 in both constraint coverage and generation quality at a fraction of the cost: while fine-tuning GPT-3 costs $156.82, training a distilled GPT-3 as the approximate policy requires only $28.59 for generating outputs from GPT-3.Our results highlight the potential of the IPA as a cost-efficient way to enhance the capability of large language models without the need for fine-tuning.

Open-ended generation
To evaluate the effectiveness of IPA in a general setting, we conduct experiments on an openended generation task; following the experimental setup in (Li et al., 2022b).The goal is to make machine-generated content more fluent, coherent, and human-like.
Datasets and Metrics.We experiment on the news domain using XSum dataset (Narayan et al., 2018).Following Li et al. (2022b), we filter out news articles with fewer than 160 tokens.We use the first 32 words as our input prompt, and generate 84 tokens as continuations.We evaluate using both automatic and pairwise human evaluation.For automatic evaluation, we use aggregate n-gram diversity and coherence scores (Li et al., 2022b) as well as MAUVE (Pillutla et al., 2021) which measures the distribution similarity between the set of human-written gold and machine-generated texts.To measure the human-likeness of generated texts, we employ OpenAI detector1 , a classifier for distinguishing AI vs. human-written text.We use the classifier's probability assigned to 'human' text to serve as an additional metric, denoted as Critic.For human evaluation, we randomly sample 100 test examples and generate continuations using different models.We conduct pairwise comparisons (of our method against baselines) on coherence and fluency using Amazon Mechanical Turk; see details Appendix Figure 4.
Setup and Baselines.We apply IPA to tailor two types of base policy: off-the-shelf GPT2-XL and GPT-3.To tailor GPT-2, we directly apply the base policy in the adaptor training, referred to as IPA(GPT-2).For tailoring GPT-3, we use an offthe-shelf GPT2-XL and a distilled GPT-3 as the approximate policy for the adaptor training, denoted as IPA-(GPT-3) and IPA*(GPT-3) respectively.Notice that IPA-(GPT-3) is equivalent to directly applying the policy adaptor trained to tailor GPT-2 on top of GPT-3.In all these scenarios, we initialize the policy adaptor with a pre-trained GPT2-large model, and use Quark as the RL algorithm to optimize the policy adaptor.As the reward function, we use the multiplication of diversity, coherence, and critic scores described above.For tailoring GPT-2, we compare decoding with IPA with six different decoding strategies: greedy, top-k sampling (k = 50), nucleus sampling (p = 0.95), typical sampling (τ = 0.95) (Meister et al., 2023), SimCTG (Su et al., 2022), and Contrastive decoding (Li et al., 2022b).The latter three meth- ods are designed to improve the coherence and naturalness of the generated text.For all decoding baselines, we use the recommended hyper-parameters in the papers.For tailoring GPT-3, we compare decoding with IPA with GPT-3's default decoding strategy: nucleus sampling (p = 0.95).Other decoding methods are not applicable to GPT-3 due to its limited access through API interface.
Results.As shown in the top section of Table 4, when tailoring GPT-2, IPA significantly outperforms all other decoding baselines across all automatic metrics.Notably, it achieves an absolute improvement of 20.26% over the best-performing baseline in the Mauve score which has shown a high correlation with the human judgment of quality.Our pairwise human evaluation in Figure 2 also verify the results.IPA generates significantly more coherent and fluent texts compared to other decoding baselines.Overall, on average, human evaluators preferred IPA 1.8× more than other decoding baselines.
Regarding tailoring GPT-3, as shown in the bottom section of Table 4, directly applying the policy adapter optimized for GPT-2 on top of GPT-3 (denoted as IPA-) results in a remarkable improvement in generation quality across all automatic metrics, demonstrating the adaptability and reusability of IPA.We observed further improvement when using distilled GPT-3 as the approximate policy in  adapter training to tailor GPT-3 (denoted as IPA*).
Our findings are verified by the results of human evaluation in Figure 2, which shows a noticeable preference by human evaluators for outputs from the GPT-3 tailored by IPA over the off-the-shelf GPT-3.

Dialogue Safety Control
Existing dialogue systems often fail to respond safely to potentially unsafe user utterances (Kim et al., 2022), limiting their deployment in realworld applications.Here, we aim to evaluate IPA for controlling the safety of a dialogue model.

Datasets and Metrics.
We experiment on DI-ASAFETY (Sun et al., 2022), a challenging dataset containing 54K context-sensitive unsafe examples.
The task is to generate a coherent response to a potentially unsafe utterance while avoiding offensive, harmful, toxic or biased language.DIASAFETY contains human-written safe and unsafe responses which we use to train a dialogue safety classifier.
We use the classifier score as an automatic measure of safety.In addition, we conduct a human evaluation of safety and coherence (3-point Likert scale) for 200 randomly sampled test examples on Amazon Mechanical Turk; see Appendix A Figure 3 for details.
Setup and Baselines.We apply IPA to tailor the Blenderbot family models (Roller et al., 2021), which are pretrained dialogue agents.Specifically, we use Blenderbot-3B-distill as the frozen base policy, a samller Blenderbot-1B-distill as the approximate policy and initialize the policy adaptor with a Blenderbot-1B-distill model. 2 We use Quark as the RL algorithm to optimize the policy adaptor.To preserve coherence and engagingness while controlling the safety of a dialogue response, we choose our reward to be the multiplication of the safety score from our trained dialogue safety classifier, as well as coherence and engagingness scores from UniEval-Dialogue (Zhong et al., 2022). 3During inference, we use Nucleus sampling (Holtzman et al., 2020) with p = 0.6 and temperature 1.0.We compare IPA with its base policy, i.e., Blenderbot-3B-distill, and other off-the-shelf dialogue models including DialoGPT (Zhang et al., 2020), GODEL (Peng et al., 2022) as well as Chat-GPT (OpenAI, 2022).ChatGPT is known to have safeguards through content filtering and is considered a strong baseline.

Results
. As shown in Table 5, IPA significantly improves the dialogue safety and coherence over its base policy Blenderbot-3B-distill, surpassing other dialogue models including DialoGPT and GODEL.
In comparison with ChatGPT, IPA achieves comparable performance on safety based on both automatic and human evaluation while showcasing improved coherence.Upon further investigation, we found that ChatGPT often generates canned responses like "I'm a language model; I'm not allowed..." as hard safeguards, which hurts the coherence and naturalness of the dialogue flow.On the other hand, Blenderbot tailored by IPA can generate safe responses that are coherent, natural, and human-like.Our results demonstrate the potential of IPA to enhance controllability in various NLP applications beyond conditional text generation.

Knowledge-grounded Dialogue
Ideally, knowledge-grounded dialogue systems should generate responses that are faithful to the given knowledge K.However, models tend to generate hallucinated responses that contain unverifiable information (Dziri et al., 2022a;Rashkin et al., 2021a;Dziri et al., 2022c).To address this undesirable behavior, we use IPA to tailor dialogue model towards generating more faithful content.Given the knowledge K and the conversation history H, the task is to generate a response r that's faithful to K and coherent with H.  a conversation.The Wizard's role is to provide information on a specific topic, while the Apprentice's task is to seek further details.WoW has been shown to suffer from hallucinations (Dziri et al., 2022b), in more than 60% of the turns, making it a valuable dataset for studying and addressing hallucination issues.To evaluate the faithfulness of responses and rate them against the knowledge snippets and gold responses, we use the FaithDial test data (Dziri et al., 2022a) at test time.FaithDial is a hallucination-free benchmark created by modifying the hallucinated responses within the WoW dataset.

Dataset and Metrics
To measure faithfulness, we use the critic model (Dziri et al., 2022a), which returns the percentage of utterances identified as faithful.Additionally, we use BERTScore to measure the semantic similarity between the generated response r and the knowledge K.We compare the token-level F1 score to rate the lexical overlap between r and K. To measure coherence and engagingness, we use UniEval model (Zhong et al., 2022).
Setup and Baselines Similar to the dialogue safety experiment, we use the Blenderbot-{3, 1}Bdistill model (Roller et al., 2021) as our base policy and approximate policy respectively, and initialize the policy adaptor with a Blenderbot-1B-distill model.We use Quark as the RL algorithm to optimize the policy adaptor.To preserve coherence and engagingness while ensuring the faithfulness of a dialogue response, we choose our reward to be the multiplication of the faithfulness score from the critic model described above, as well as coherence and engagingness scores from UniEval-Dialogue (Zhong et al., 2022). 4During inference, we use Nucleus sampling (Holtzman et al., 2020) with p = 0.6 and temperature 1.0.

Results
As shown in Table 6, supervised models struggle to generate faithful dialogue response grounded on the given knowledge.This is mainly because of the poor data quality of their supervision dataset: WoW has been shown to suffer from hallucinations in more than 60% of the turns (Dziri et al., 2022a).Moreover, pre-trained dialogue models like Blenderbot demonstrate even worse performance at generating faithful response, despite being trained on WoW and other knowledgegrounded dialogue datasets in their pre-training stage.IPA significantly improves the faithfulness of the generated dialogue response over its base policy Blenderbot while preserving the dialogue quality (i.e., coherence and engagingness), outperforming all other baselines.Our results showcases the potential of IPA to improve reliability and trustworthiness of NLP systems in various downstream applications.

Related Work
Controlled Decoding Recent studies have explored controlled generation at inference time by designing new decoding algorithms (Keskar et al., 2019;Mireshghallah et al., 2022;Li et al., 2022a;Chen et al., 2022;Zhang et al., 2022).For example, to ensure the inclusion of given keywords, Neurologic decoding (Lu et al., 2020), and GBS (Hokamp and Liu, 2017) generalize beam search for lexically constrained decoding, by constraining the decoding space with keyword-related penalties.DExperts (Liu et al., 2021b) modifies the output distribution during decoding with attribute-specific expert models.However, these decoding methods are designed for particular control types only.Another line of research develops gradient-based decoding for more general control (Qin et al., 2020(Qin et al., , 2022b;;Sha, 2020;Dathathri et al., 2020b;Kumar et al., 2021).For example, COLD Decoding (Qin et al., 2022b) introduces energy-based modeling to impose arbitrary constraints on text and samples with Langevin dynamics.PPLM (Dathathri et al., 2020b) steers generation towards desirable attributes via the gradients from a small fine-tuned model.Despite their progress, these approaches rely on computationally expensive gradient computations.
Reinforcement Learning for Natural Language Generation RL has historically been used in multiple NLG tasks such as machine translation (Wu et al., 2016;Nguyen et al., 2017), summarization (Paulus et al., 2017), dialogue (Li et al., 2016;Zhou et al., 2017), text games (Narasimhan et al., 2015;Hausknecht et al., 2020), etc. to ensure that the generated text is optimized for an arbitrary non-differentiable reward.This was often done using online policy gradient methods such as REINFORCE (Sutton and Barto, 2018), leading to documented issues with reward hackingwhere a model has great metrics but fails to solve the spirit of the task and produces inarticulate sounding text (Choshen et al., 2020;Kiegeland and Kreutzer, 2021).Recent advances use the success of LMs in modeling language fluency to introduce a KL reward penalty which significantly increases the naturalness of generated text (Ouyang et al., 2022;Korbak et al., 2022).This method has been used extensively to tune a base LM via online onpolicy (Ramamurthy* et al., 2023), off-policy (Guo et al., 2022;Lu et al., 2022b), andoffline (Snell et al., 2023;Korbak et al., 2023) RL.Such methods quickly become computationally infeasible for extreme-scale LMs with billions of parameters and thus IPA tunes a separate smaller policy designed to adapt an existing larger LM.
Parameter-Efficient Fine-Tuning Prompting and prefix-tuning (Li and Liang, 2021) adapt a very large model to a specific task.However, they are affected by sensitivity based on order of words or examples (Zhao et al., 2021;Webson and Pavlick, 2022), lack associative clarity (Min et al., 2022) and tuning prompts work for only very large models (Mahabadi et al., 2021;Liu et al., 2022b).These methods compose the input to the model.In contrast, parameter-efficient finetuning offers a clean way to compose parameters directly by adding or updating a smaller subset of model parameters.A common strategy is to prune the model parameters and introduce sparsity (Han et al., 2017;Frankle and Carbin, 2019;Frankle et al., 2020).The effectiveness of this approach is also substantiated with the use of RL (Yu et al., 2020).Instead of pruning individual units, structured-pruning prunes an entire group, such as attention heads in pretrained models (Michel et al., 2019;Voita et al., 2019).Additionally, (Li et al., 2018) demonstrate the effectiveness of optimizing a model in a lowdimensional randomly oriented subspace.Later studies (Aghajanyan et al., 2021) have also shown that the intrinsic dimensionality decreases with pretraining larger models.(Hu et al., 2022) learns a low-rank factorization via projection matrix and applies them to the self-attention weights.Recently, adding a small subset of parameters called adapters (Rebuffi et al., 2017) and compact adapters (Mahabadi et al., 2021) which are model-specific (Stickland and Murray, 2019).Pfeiffer et al. (2020) introduced a continuously evolving Adapter-Hub that stitches different pre-trained adapters for languages and tasks inspired from routing networks (Rosenbaum et al., 2019) optimized through reinforcement learning (Kirsch et al., 2018;Chang et al., 2019).Though these methods are efficient, they require access to the internal representation for model and gradient, which is not feasible for large models like GPT3 with limited access.

Conclusion
In this work, we introduce IPA, a lightweight Inference-time Policy Adapter that tailor a frozen large language model towards desirable properties (e.g., safety, coherence, faithfulness) in an efficient, generalizable, and flexible way.IPA inherits the generalizability of the RL approach to enable tailoring models with arbitrary objectives while incorporating plug-and-play flexibility of the inferencetime techniques.IPA can effectively customize the base policy models to specified desirable objectives without sacrificing the existing favorable attributes of the base models.Crucially, IPA enables tailoring massive-size large language models that are generally untunable due to computational cost or restricted API access.As IPA operates independently of the choice of the online learning RL algorithm, it is flexible to accommodate future alternative optimization techniques.Extensive experiments across five challenging and diverse text generation tasks demonstrate that IPA consistently outperforms competitive baselines, for both small and massive model sizes, sometimes even beating expensive fine-tuning.We hope our work sheds light on creative and efficient algorithmic innovations to complement the pursuit of model scales leveraging academic-level resources.

A Human Evaluation
We illustrate the human evaluation layouts on Amazon Mechanical Turk for Dialogue Safety Control, Open-ended Generation, and Lexical Contrained Generation tasks in Figures 3, 4 and 5.

Figure 1 :
Figure 1: Inference-time Policy Adapters (IPA) efficiently steer a large-scale language model (such as GPT-3) during decoding-time through a lightweight policy adapter trained to reflect any arbitrary user objective with reinforcement learning.

Figure 2 :
Figure 2: Pairwise human evaluation in terms of overall quality for open-ended generation on XSum with offthe-shelf GPT2-XL (top) and GPT-3 (bottom) as the base policy to tailor.

Figure 3 :
Figure 3: Human evaluation layout on Amazon Mechanical Turk for Dialogue Sfaety Control

Figure 4 :
Figure 4: Human evaluation layout on Amazon Mechanical Turk for open-ended generation

Figure 5 :
Figure 5: Human evaluation layout on Amazon Mechanical Turk for lexical constrainted generation

Table 1 :
Automatic evaluation for Toxicity Reduction with off-the-shelf GPT2-large (top) andas the base policy to tailor.

Table 2 :
Comparison of using different RL algorithm for training IPA for Toxicity Reduction with off-the-shelf GPT2-large as the base policy to tailor.

Table 3 :
Automatic and human evaluation results for Lexically Constrained Generation.Human evaluation scores are on a 3-point Likert Scale.

Table 5 :
Automatic and human evaluation results for Dialogue Safety Control.Human evaluation scores are on a 3-point Likert Scale.

Table 6 :
Evaluation results for knowledge-grouded dialogue generations on Faithdial.We use off-the-shelf Blenderbot as the base policy to tailor.