LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of adapter types, placement locations, and hyper-parameters to the best design for each adapter-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on both reasoning tasks.


Introduction
Large language models (LLMs) such as GPT-3 (Brown et al., 2020), BLOOM (Scao et al., 2022), and LLaMA (Touvron et al., 2023) have shown impressive performance in various natural language processing (NLP) tasks. Fine-tuning (Brown et al., 2020;Raffel et al., 2020;Wei et al., 2021;Taori et al., 2023) is a popular technique that involves adapting LLMs to specific downstream tasks by training LLMs on task-specific datasets. However, the most powerful LLMs, specifically instructionfollowing LLMs, such as ChatGPT and GPT-4 , are currently in a closed-source state. Instead, these models only offer user interfaces or APIs, and their source code is not accessible to researchers or developers. Therefore, researchers or developers are unable to utilize these LLMs as backbone models for developing fine-tuning methods for specific downstream tasks. The lack of access to the source code of these LLMs hinders the innovation and advance of the state-of-the-art.
To mitigate this issue, The team of Stanford Alpaca (Taori et al., 2023) utilizes the Selfinstruct strategy (Wang et al., 2022) to generate an instruction-following dataset comprising 52k data examples and fine-tune the open-source LLaMA (7B) model using these instructions to obtain an instruction-following LLaMA model. However, the massive size of LLMs, often comprising billions of parameters (Brown et al., 2020;Zhang et al., 2022;Chowdhery et al., 2022;Touvron et al., 2023), makes fine-tuning the entire LLM for a downstream task a highly challenging and inefficient. To overcome this challenge, Alpaca-LoRA proposes integrating a parameter-efficient fine-tuning (PEFT) (Houlsby et al., 2019;Mangrulkar et al., 2022;Fu et al., 2022) method, Low-Rank Adaptation (LoRA) (Hu et al., 2021), into Alpaca. This integration allows the parameter-efficiently fine-tuned model to achieve comparable performance to full fine-tuning but with few trainable parameters. Tthe success and potential of Alpaca and Alpaca-LoRA sparke a range of adaptations and applications, including Chinese Alpaca , Japanese Alpaca , Thai Alpaca , medical Alpaca (ChatDoctor) (Yunxiang et al., 2023), movie recommendation Alpaca (RecAlpaca) (Wang et al., 2023), multi-modal Alpaca (LLaMA-Adapter) . Additionally, the toolbox of fine-tuning LLM, known as LMFlow (Diao et al., 2023), has been developed.
Overall, we offer a promising framework for fine-tuning large LLMs on a range of downstream tasks. We envision that LLMs-Adapters will serve as a valuable resource for advancing research on adapter-based PEFT of LLMs, facilitating the deployment of such research pipelines, and enabling practical applications of this technique to real-world systems. Additionally, we will keep all of the code open source and will continue to update the framework with new adapters, LLMs, and tasks.

Adapter Family
Adapters for LLMs refer to neural modules integrated into LLMs, which contain a small number of extra trainable parameters, allowing for efficient fine-tuning of specific tasks without affecting the pre-trained parameters of the LLM. Here we use the notation Θ to represent the parameters of the LLM and Φ to represent the parameters of the adapter module. During training, the parameters of the LLM (Θ) remain fixed, while the parameters of the adapter module (Φ) are adjusted to perform a specific task. As a result, the representations generated by the LLM are not distorted due to task-specific tuning, while the adapter module acquires the capability to encode task-specific information.
This framework presents three types of adapters within LLMs: Series Adapter, Parallel Adapter, and LoRA. Our plan for future work involves updating the framework with new adapters. Series Adapter. Inspired by Houlsby et al. (2019), our framework adds the bottleneck feedforward layer in series to each multi-head and feed forward layer of a Transformer (Vaswani et al., 2017) block, which serves as the basis for most LLMs (Brown et al., 2020;Touvron et al., 2023;Wang and Komatsuzaki, 2021). Figure 1(a) shows that bottleneck adapters consist of a two-layer feed-forward neural network that includes a down-projection matrix, a non-linearity function, an up-projection, and a residual connection between input and output.
Parallel Adapter. The Parallel Adapter implements Pfeiffer et al. (2020b)'s architecture, which integrates bottleneck feed-forward layers in Parallel with the multi-head and feed-forward layers of a Transformer block in LLMs. As shown in Figure 1(b), the adapters are incorporated alongside each transformer layer.
LoRA. Hu et al. (2021) propose LoRA aimed at efficiently fine-tuning pre-trained models with fewer trainable parameters. LoRA introduces trainable low-rank decomposition matrices into LLMs' existing layers, enabling the model to adapt to new data while keeping the original LLMs fixed to retain the previous knowledge. Specifically, LoRA performs a reparameterization of each model layer expressed as a matrix multiplication by injecting low-rank decomposition matrices, as illustrated in Figure 1(c). This reparameterization enables the model to be fine-tuned without the need to compute the full dense matrix multiplication. By reducing the rank of the matrices, LoRA also helps to reduce the number of parameters in fine-tuning LLMs.
LLMs-Adapters provides a configuration file that allows users to customize the architecture settings to facilitate flexibility and extensibility. Therefore, users can select the block and layer for inserting any adapters or use standard adapter architectures from the literature.

Datasets
Adapter methods in LLM-Adapters are evaluated on six math reasoning datasets. These are: (1) the GSM8K (Cobbe et al., 2021) dataset, consisting of linguistically diverse math word problems written by human problem writers for grade school students; (2) the SVAMP (Patel et al., 2021) benchmark, which includes one-unknown arithmetic word problem for students up to 4th grade and is derived from an existing dataset with slight modifications; (3) the MultiArith (Roy and Roth, 2016) dataset, featuring math word problems that require multiple reasoning steps and operations; (4) the AddSub (Hosseini et al., 2014) dataset, comprising of addition and subtraction arithmetic word problems; (5) the AQUA (Ling et al., 2017) dataset, which contains algebraic word problems with natural language rationales; and (6) the SingleEq (Koncel-Kedziorski et al., 2015) dataset, consisting of single-equation algebra word problems with multiple math operations over non-negative rational numbers and one variable.
In order to prepare the training data for the adapters in our experiment, we perform a random shuffling of each math reasoning dataset, followed by dividing each dataset into two subsets: a training set comprising 80% of the data and a test set containing 20% of the data. Next, we combine all the training datasets into a single, unified dataset and use it to train all the adapters. To obtain the necessary supervision signals for training the adapters, we extracted rationales and answers for training data examples from the log files of Zero-shot-COT Kojima et al. (2022) experimented with GPT-3 text-Davince-003 . We have acquired a total of 3260 and 816 mathematical word problems, accompanied by corresponding rationales and answers, for the purpose of fine-tuning and testing, respectively. It is noteworthy that we have not implemented any filtering procedure to exclude samples with erroneous rationales or answers.

Methods for Comparison
The study by Houlsby et al. (2019) introduces a novel method called S-Adapter h , which involves integrating adapter layers after both multi-attention modules and MLP modules for each transformer layer. By solely training the parameters of adapter layers, S-Adapter h can achieve near stateof-the-art performance on text classification tasks.In contrast, Pfeiffer et al. (2020b) propose a modified method, called S-Adapter p , which only integrates task adapters after the MLP modules in each transformer layer. Furthermore, Pfeiffer et al. (2020b) also introduce an invertible adapter architecture for adapting a pre-trained multilingual model to a new language. Another approach, called Parallel Adapters (P-Adapter), integrates adapter layers in parallel with multi-head attention layers or MLP layers. Hu et al. (2021) propose LoRA, which freezes the pret-rained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Implementation Details
In order to facilitate INT8 training of large models with parallel adapters, we have adopted a technique whereby the parallel adapter layers are incorporated into multi-head attention layers and MLP layers, in parallel with Linear layers. This approach offers flexibility as the parallel layers can be integrated with any Linear layer within the model. To ensure a fair comparison among different adapter methods, we have endeavored to approximately match the fine-tuning compute budget for each adapter. To this end, we have set the rank of the LoRA A and B matrix to 8, while for S-Adapter h and S-Adapter p , the bottleneck size has been set to 256 and 768, respectively. For P-Adapter, the adapter layers have been placed in multi-head attention modules with a bottleneck size of 256. Our evaluation of adapter methods involves the utilization of LLaMA-7B, BLOOM-7.1B, and GPT-J-6B as the backbone models. We fine-tune and infer all of the adapter methods on these models using two 3090 GPUs, each with a memory capacity of 24 GB. The process of fine-tuning each model is accomplished in a brief duration of approximately 20 minutes. Table 1 reports the accuracy comparison of different LLMs parameter-efficiently fine-tuned with different adapters on six math reasoning datasets. The teacher model (175B) outperforms adapterbased parameter-efficiently fine-tuned LLMs across various tasks. However, in the case of simple math reasoning datasets such as MultiArith, AddSub, and SingleEq, adapter-based methods such as LLaMA-7B with LoRA can achieve comparable performance. It indicates that given sufficient taskspecific training data, adapter-based parameter-efficient fine-tuning (PEFT) of smaller LLMs may have the potential to perform similarly to very large language models. Moreover, the performance of adapter-based PEFT methods varies among LLMs of similar size. For example, the latest opensource LLMs, LLaMA, outperform GPT-J and BLOOM in most cases. Comparing different adapters, LoRA achieves excellent performance with significantly fewer trainable parameters, suggesting that task-specific fine-tuning may not require excessive learnable parameters. Overall, these findings demonstrate the potential for adapter-based PEFT of smaller LLMs to achieve high performance on specific tasks with few trainable parameters.

Conclusion and Future Work
This paper presents LLM-Adapters, a framework that includes adapter-based parameter-efficient fine-tuning (PEFT) methods of large language models (LLMs) for various tasks. The framework is research-friendly, efficient, modular, and extendable and contains state-of-the-art open-access LLMs and widely used adapters. The experiments on six math reasoning datasets demonstrate that using adapter-based PEFT with the proposed framework LLMs-Adapters can achieve comparable or even better performance than powerful LLMs in zero-shot inference on simple math reasoning datasets. Future work includes integrating new adapters and evaluating them with larger-scale language models on more tasks to enable further research on PEFT methods of LLMs. The code is available at the provided link for further exploration and development.