Task-Optimized Adapters for an End-to-End Task-Oriented Dialogue System

Task-Oriented Dialogue (TOD) systems are designed to carry out specific tasks by tracking dialogue states and generating appropriate responses to help users achieve defined goals. Recently, end-to-end dialogue models pre-trained based on large datasets have shown promising performance in the conversational system. However, they share the same parameters to train tasks of the dialogue system (NLU, DST, NLG), so debugging each task is challenging. Also, they require a lot of effort to fine-tune large parameters to create a task-oriented chatbot, making it difficult for non-experts to handle. Therefore, we intend to train relatively lightweight and fast models compared to PLM. In this paper, we propose an End-to-end TOD system with Task-Optimized Adapters which learn independently per task, adding only small number of parameters after fixed layers of pre-trained network. We also enhance the performance of the DST and NLG modules through reinforcement learning, overcoming the learning curve that has lacked at the adapter learning and enabling the natural and consistent response generation that is appropriate for the goal. Our method is a model-agnostic approach and does not require prompt-tuning as only input data without a prompt. As results of the experiment, our method shows competitive performance on the MultiWOZ benchmark compared to the existing end-to-end models. In particular, we attain state-of-the-art performance on the DST task of 2.2 dataset.


Introduction
Task-oriented dialogue systems are trained to achieve specific goal to enhance efficiency and convenience in various fields such as customer service centers and healthcare information retrieval.
Task-oriented dialogue systems are divided into key components: understanding the user's intent (NLU), tracking the current dialogue states (DST), and generating responses based on previous sessions (NLG).Pipeline-based systems separately train each component, so they have the advantage of optimizing each module and raising the performance of a given task.User feedback is, however, difficult to propagate to each module, and inputs to the component is dependent on the result of the previous module (Chen et al., 2017).Recently, dialogue systems have been trained in an end-toend manner with transfer learning or pre-training networks with large dialogue corpora.However, building efficient end-to-end TOD systems requires a large amount of data and has some limitations due to parameter sharing.End-to-end models backpropagate to transfer the gradients of the output and end back to the entire neural network.They pose an issue of parameter efficiency as updating all parameters for every downstream scenario.Also, it is challenging to debug each task and take task-flow characteristics into account.
Therefore, we propose a simple structure of adding adapters to the core modules (NLU, DST, NLG) of the TOD system as shown in Figure 1.By using the adapters, it is possible to optimize each task with only a small amount of training parameters, remaining the pre-trained model's parameters fixed.Additionally, it is safe from the catastrophic forgetting problem (French, 1999), which causes pre-trained models to lose important skills acquired during the pre-training process.The key is to apply a transfer learning strategy that yields compact and extensible downstream models in the dialogue system (Houlsby et al., 2019).This makes it easy for people to train large-scale end-to-end TOD models.Also, by applying REINFORCE (Sutton et al., 1999), we attempt to reduce the expected score gap caused by the small parameters of adapter compared to the full fine-tuning.Specifically, we use [user] I need a place to dine in the center that s expensive.
[system] I have several options for you; do you prefer African, Asian, or British food?[user] Any sort of food would be fine, as long as it is a bit expensive.Could I get the phone number for your recommendation?

NLG Adapter
[user] I need a place to dine in the center that s expensive.
[system] I have several options for you; do you prefer African, Asian, or British food?[user] Any sort of food would be fine, as long as it is a bit expensive.Could I get the phone number for your recommendation?
[user] Could I get the phone number for your recommendation?

[restaurant] area=centre pricerange=expensive
There is an Afrian place named Bedouin in the centre.How does that sound?Joint Goal Accuracy, and weighted sum of BLEU and Success rate as rewards for training DST and NLG adapter.To best of our knowledge, this is first work that use Joint Goal Accuracy as a reward for E2E TOD system.
To address the aforementioned problems, we propose a Task-Optimized Adapter for an end-toend Task-Oriented Dialogue system (TOATOD) applying reinforcement learning to DST and NLG tasks.In summary, our key contributions are as follows: • We present a new architecture that can debug per task of the end-to-end model using the separated adapters.
• Without updating the original parameters of PLM, we train end-to-end TOD system efficiently with a few trainable parameters.
• It is a novel approach to design a reward function not just for NLG, but also for DST task with metric-aware reinforcement learning.
• The performance of the proposed approach outperforms on the DST task of MultiWOZ 2.2 and shows comparable results to full finetuning on the NLU and NLG tasks.

Background
Pipeline-based Task-Oriented Dialogue System Conventional task-oriented dialogue systems usually based on the pipeline method, consisted of language understanding (NLU), dialogue state tracking (DST), policy learning (POL), and language generation (NLG).This kind of modularization allows for each component to be optimized independently, making it easier to update and understand how the model is working.Pipeline-based systems, however, have several limitations.Each of the modules train sequentially, so it is hard to align modules to the common optimization targets (Liu and Lane, 2018).This makes the system more complex and harder to backpropagate cumulated errors.The performance of the previous components affects the next modules, so if upper modules perform poorly, errors that occurred earlier may propagate and be amplified in downstream components (Liu and Lane, 2018).
End-to-end Task-Oriented Dialogue System Endto-end task-oriented dialogue systems, on the other hand, are easier to optimize and train to directly map the input to output in a single model.They can leverage large amounts of data for robust learning and the entire system can be optimized under end-to-end settings, which leads to better performance.A general approach for building end-to-end systems is to fine-tune pre-trained language models (Budzianowski and Vulić, 2019).This approach utilizes the strength of pre-trained networks, which help the models to leverage the pre-trained knowledge while also adapting to task-specific data.For example, SimpleTOD (Hosseini-Asl et al., 2020) solved task-oriented dialogue as causal language modeling task using several versions of GPT (Radford and Narasimhan, 2018;Radford et al.).
Pre-training of Dialogue Language Model Recently, methods with pre-training dialogue language model (Wu et al., 2020;Zhang et al., 2020b;Peng et al., 2021;Su et al., 2022;He et al., 2022a), instead of fine-tuning pre-trained networks have outperformed the previous baselines on the benchmark.For instance, SPACE-3 (He et al., 2022a) captures the contextualized knowledge from largescale dialogue corpora by pre-training the unified language model.However, there are still some issues with these methods.A large amount of param- eters is required for training backbone models like BERT, T5 (Raffel et al., 2020), GPT, and UniLM (Dong et al., 2019).As shown in Table 1, T5 base and T5 small require the trainable parameters over 220M and 60M respectively.And they disregard the task-flow features of task-oriented dialogue systems.Also, it's still hard to debug per module because the model parameters are shared and jointly optimized.PPTOD (Su et al., 2022) integrated modules into a unified model with task-specific prompts and alleviated the error accumulation in plug-and-play way.Still, this method is not completely free from the interference among tasks due to fully shared parameters.
Adapter tuning for NLP Since the pre-trained models for NLP tasks have become mainstream, they are mainly used for transfer learning downstream tasks.However, a parameter efficiency issue has been raised because updating all PLM parameters is expensive for every downstream scenario.
To address the issue, the adapter module is proposed to transfer PLM like BERT with parameterefficient tuning (Houlsby et al., 2019) and shows comparable performance to full fine-tuning.The main idea behind the adapter is to train the network to the downstream task with task-specific parameters while maintaining the original pre-trained parameters.The module is composed of two feedforward layers and a non-linear layer, which can be inserted into the transformer blocks of an end-toend model.It projects d-dimensional input features into a smaller dimension m, and then projects back into the original dimension, so the total of parameters add per layer with biases is 2md + d + m.
The number of additional parameters per task can be restricted by setting m < d (Li et al., 2021).
In most of the previous research, the adapter was used for efficient learning.But in this approach, we adapt the adapter modules not just for efficient learning (Stickland and Murray, 2019), but also task-optimized learning as recent studies that have mainly focused on separating parameters (Lin et al., 2021;Feng et al., 2022;Bapna and Firat, 2019).
Reinforcement learning for Text Generation Although token-based supervised learning is a widely adopted training method in text generation tasks, as highlighted by Ranzato et al., 2016, there are two major problems associated with this approach.
The first problem is the exposure bias problem, where during training, the model is exposed to the ground-truth outputs, thus allowing it to learn to generate text that is similar to the training data.However, during evaluation, the model is not exposed to the ground-truth outputs and instead, generates texts based on the previous words generated by the model itself.This can lead to errors and a significant deviation in the generated text from the training data.The second problem is the tokenlevel loss problem, where the training objective function is based solely on individual words' prediction, neglecting the predicted token sequence's overall coherence and fluency.
To address these problems, researchers have applied reinforcement learning to text generation tasks (Ranzato et al., 2016;Li et al., 2016;Paulus et al., 2018;Chen et al., 2020;Wang et al., 2021;Ye et al., 2022) using metrics such as BLEU (Papineni et al., 2002) or ROUGE (Lin, 2004) as rewards for sequence-level training.In our study, we also use the REINFORCE method and task-oriented dialogue metrics to continually train our model after supervised learning, not only to address these challenges but also to mitigate performance degradation caused by the use of an Adapter.

Adapter for each task (NLU, DST, NLG)
The Adapter module is designed to adapt the pretrained network for each of the tasks (NLU, DST, The output of the j th feed-forward layer with residual connection in the transformer block is represented as H j ∈ R n×d , where n is the input dimension, and d is the hidden dimension.As shown in Figure 2, the overall architecture of the adapter module includes multiple feed-forward projections, referred to as down-projection and up-projection, followed by layer normalization (LN).The downprojection with W down ∈ R d×h projects the input H j , which is passed by the ReLU activation function, and then the up-projection with W up ∈ R h×d projects the output back to the original dimension.The bottleneck dimension h is a hyperparameter to project the original input to a smaller dimension.And each adapter has a residual connection to avoid vanishing gradient (Rebuffi et al., 2017;He et al., 2016).

Metric-Aware Reinforcement Learning for DST & NLG module
The overall loss function for the metric-aware reinforcement learning is given by the equation (2): The equation describes how the network updates the parameters in order to maximize both the likelihood of the generated response (token loss) and the quality of the response (policy loss).The token loss can be described as a categorical cross-entropy loss.ŷ denotes the predicted probability of ground truth and y is the target probability.The token loss CE(y, ŷ), which measures how well a set of predicted token probabilities match the target tokens for a given context of dialogue when reinforcement learning.By applying REINFORCE method, the network can update weights towards the direction that allows the model for getting more rewards even when the reward function is non-differentiable.
The policy loss J policy (θ) is introduced to measure of how well the model can generate a token sequence with high probabilities that result in high rewards.ŷ denotes the predicted token sequence by model.The policy loss described in the equation ( 3) is calculated as the negative log probability of the token sequence that has the highest probability multiplied by the reward.
The parameter α in the overall loss ( 2) is a hyperparameter, that is a scalar value between 0 and 1 to weigh the importance of these losses.In this way, the model is trained to predict the correct labels (categorical cross-entropy loss) and to make good decisions that result in high rewards (policy loss) at the same time.We define the reward functions of the DST and NLG modules as follows: The Reward DST is calculated as the sum of the Joint Goal Accuracy (JGA) and a constant value of 1.The JGA measures how well the model predicted the values for every slot in the dialogue turns.Using JGA as a reward, the model is encouraged to accurately track the state of the dialogue, which is crucial for generating appropriate responses and improving the performance of task-oriented dialogue systems.
In the equation ( 5), BLEU score and Success rate are used as rewards to guide the learning process of NLG module.y u denotes the ground truth token sequence of each utterance, and ŷu denotes the predicted token sequence.To calculate the success rate, we apply the batch of a session-level, not an utterance-level.y and ŷ without u mean the session level ground truth and prediction.The weighting factor β is adjusted to balance these two metrics.The hyperparameter β may have to be carefully chosen because there is a trade-off relationship between these two metrics in the RL setting (Wu et al., 2021), where increasing one metric may come at the cost of decreasing the other.We experiment to choose α and β on the Section 6.1.3and 6.1.4.

Datasets
We experiment our method for dialogue state tracking (DST) and end-to-end response generation (NLG) tasks on the MultiWOZ 2.1 and 2.2 datasets, and the intent prediction (NLU) task with the Bank-ing77, CLINC150, and HWU64 datasets.MultiWOZ -2.1, 2.2 The MultiWOZ (Budzianowski et al., 2018) dataset has been widely used to evaluate the performance of TOD systems.It consists of 8438, 1000, and 1000 for training, dev, test sets with multi-turn dialogues, collected through a Wizard-of-Oz (WOZ) setup.The dialogues cover a wide range of domains and topics.MultiWOZ 2.2 (Zang et al., 2020) is the improved version of MultiWOZ 2.1 (Eric et al., 2020) that has corrected annotation errors, inconsistencies, and ontology issues, also added span annotations to standardize.Banking 77 (Casanueva et al., 2020) This dataset is a collection of 77 real-life customer banking service queries.It consists of 13,083 utterances.Each query is labeled with a single intent, however, it is hard to differentiate because they correspond to very similar tasks.CLINC150 (Larson et al., 2019) This dataset is multi-domain dataset which contains 23,700 utterances that cover 150 intent classes over 10 domains.HWU64 (Liu et al., 2019)
We follow the preprocessing method from UBAR to delexicalize slot values for each system responses.We evaluate our models using the older version of the standardized evaluation script for MultiWOZ 2.1, and the newly opened version for MultiWOZ 2.2, released by Nekvinda and Dusek, 2021.It has been adopted by the official MultiWoZ dataset github 2 .Other implementation details are described in Appendix C.

Dialogue State Tracking
We evaluate our models on the DST task with the MultiWOZ 2.1 & 2.2 datasets.We compute Joint Goal Accuracy on the test set, which measures how many values are filled accurately compared to the ground truth states for all slots.Joint Goal Accuracy is considered as more difficult and important metric in most research (Zhou and Small, 2019; 2 https://github.com/budzianowski/multiwozDey et al., 2022), because once wrong prediction has been made, it cannot get the score at that turn.

Evaluation Result
We compare our best models, TOATOD small and TOATOD base to the models trained with a pretrained network.Table 2 shows that our models are competitive to the end-to-end models on the current benchmark.In the 2.1 dataset, our models show a relatively good performance, despite the small number of trainable parameters.As shown in the Table 5, the Joint Goal Accuracy of TOATOD base only with task-optimized adapter (SL) is 53.33, which is slightly lower than other models using T5 base as backbone.So, we reduce the performance degradation applying metric-aware REINFORCE and report the final results.Trainable parameters of our models are less than 1/2 of the models with the smallest parameters.The result implies that the adapter module helps the network more adaptable to the DST task by only activating a few parts of the model.Because of relatively small parameters, our model is more robust to overfitting problem with confused labels.As mentioned in Section 4.1, MultiWOZ 2.2 is the cleaned version of 2.1 dataset, so the performance is better in the 2.2.Among the top results on the 2.2 dataset, our models obtain state-of-the-art performance.It demonstrates that TOATOD optimizes well on the given task remaining the prior knowledge learned from the pre-trained network.

End-to-End Response Generation
We test our methods with end-to-end response generation (NLG) task on the MultiWOZ 2.1 & 2.2 as in DST evaluation.Four metrics are used to measure the quality of generated responses.We measure if the system provides the appropriate entity (Inform rate), answers all the requested information (Success rate), and responds fluently (BLEU score).And the Combined Score for end-to-end response generation is computed as 'BLEU+0.5 × (Inform+Success)'.Under the end-to-end settings, the models have to predict proper dialogue states and then generate responses based on the states.

Evaluation Result
From Table 3, our models achieve comparable results (1.x point different from SOTA model) in all datasets.The adapter module helps each part of the model be fine-tuned independently, therefore we can optimize the DST and NLG task respectively.Our model performs well on NLG task based on the belief states from DST modules and base knowledge gained from the large-scale dialog dataset during pre-training.TOATOD base attains the best score on the Inform and Success rate.It shows that the reinforcement learning of our approach is effective for adjusting trade-off problem between BLEU score and others.(Casanueva et al., 2020), and * are from the leaderboard for DialoGLUE paper (Mehri et al., 2020b) and benchmark 3 .

Intent Classification
We test our models on the NLU task with Bank-ing77, CLINC150, and HWU64.Intent prediction is the task to identify the intent behind a given in-3 https://eval.ai/web/challenges/challenge-page/708/leaderboard/1943 put.The task is normally framed as a classification problem, so we set the metric as turn accuracy.

Evaluation Result
From Table 4, while our models do not achieve the highest score on NLU task, it is important to note that they perform well in relation to the size when compared to other models.This highlights the effectiveness of our task-optimized adapter approach in achieving a balance between model performance and efficiency.
6 Further Analysis and Discussion The BLEU scores fall slightly but the Combined Score rise with Inform and Success, which means that incorporating reinforcement learning into the training process leads our model to complete tasks more efficiently.We conduct hyperparameter tuning only for the TOATOD base , so there is some gap in the degree of performance improvement between TOATOD small and TOATOD base .

Hyperparameters of REINFORCE
The hyperparameter α is used for balancing the importance between cross-entropy loss and policy loss.We re-train DST and NLG modules by optimizing the mixed objective function to get higher rewards.The second hyperparameter β is a scaling factor to control the trade-off between BLEU score α 1.0 0.7 0.9 1.0 1.0 0.3 0.5 0.7 1.0 β 0.4 0.5 0.5 0. and Success rate for NLG module.The BLEU score is calculated based on the number of matching n-grams between the generated text and the reference text.A higher score indicates that the generated text is more similar to the label text.Success rate measures the proportion of dialogues, where the model successfully completes the task.A higher success rate means the model is better at achieving the user's goal of the dialogue.In some cases, however, the model may generate text that is close to the reference text (high BLEU score) but not relevant to the current dialogue (low Success rate).So, we aim to reduce the gap between two metrics via hyperparameter tuning.In the Table 7, the result of α experiment on DST task indicates that 1.0 yielded the best performance, suggesting the significance of the policy loss in the overall loss function.By maximizing policy loss, the model is encouraged to make decisions that result in high rewards, which improves the performance of the DST task.

α and β of NLG-optimized adapter
We experiment with several combinations of α and β within {α: 0.3, 0.5, 0.7, 0.9, 1.0 / β: 0.4, 0.5, 0.6, 0.7}.Regardless of hyperparameters, performances improve after applying the reinforcement learning.The hyperparameters of α=0.5 and β=0.7, result the best Combined Score, which is 5.44 point higher than the performance of the supervised learning.As shown in the Figure 3, on the experiment after removing impact of CE loss (α=1), we found that the higher the β, success rate increases compared to BLEU score as we expected.
As shown in Table 6, when the β is fixed, the bigger weight (smaller α) on the CE loss ensures the higher BLEU score.On the other hand, the success rate started to decline from the certain point.Trade-off issue appears at this point.Therefore, α and β need to be properly tuned in NLG task for good performance.

Conclusion
We propose TOATOD, task-optimized adapters for an end-to-end task dialogue system.By adapting task-optimized adapters, we utilize the end-to-end models without updating the pre-trained parameters and enabling debugging per task, which is different from previous research.In addition, we apply REINFORCE algorithm with metric-aware reward function directly not only on the NLG task but also DST task to prevent score degradation.As a result, we attain comparable performance to the previous SOTA models on every benchmark with very small number of trainable parameters.For the DST task of MultiWOZ2.2,our TOATOD model outperforms the current SOTA systems.

Limitations
We train the task-optimized adapters based on the pre-trained weights of dialogue LM.Therefore, if applied to other dialogue tasks such as chit-chat and conversational QA system, the performance could be lower than that shown in our research.And we need future works to clarify the reason why the performance was better on the MultiWOZ 2.2 dataset, which is expected that our model does not overfit to the confused labels.Our model inferences in on the end-to-end manner, but trains like modular system for each task.End-to-end learning is currently under study.We could adapt the multitask end-to-end learning to our method, which may lead to the better performance.Also, we could analyze the inner working of task-optimized adapters applying XAI technologies.

Ethics Statement
We honor the ethical codes set out in the ACL code of Ethics.All of the datasets used in our study are from previous studies and do not have privacy issues.51.59.It is important to carefully choose the bottleneck dimension when using the adapter module in the task-oriented dialogue system.As described in the Table 1, the trainable parameters of our model with bottleneck dimension of 256 is still significantly smaller than the PLM's parameters, so we choose the size of 1/2.

C Implementation Details
We used 8 A100 (80G) GPUs, but they were fully used during only reinforcement learning.During supervised learning, used 4 GPUs.While reinforcement learning of the DST task-optimized adapters, we set learning rate as 1e-5, and batch size as 32 (utterance-level).On the contrary, while reinforcement training of the NLG task-optimized adapters, we set learning rate as 1e-6 and batch size as 4 (session-level).During the entire training process, we set random seed as 42.And for the NLU task, we used 1 RTX A5000 (24GB) GPU and trained models without reinforcement learning.We set batch size as 64 and single run with random seeds.When training T5 small and T5 base for the Banking 77 and CLINC150, we used learning rates of 0.001 and 0.15.And we used learning rates of 0.01 and 0.1 for the T5 small and T5 base with the HWU64.

Figure 1 :
Figure 1: Overview of the Task-Optimized Adapters for an End-to-End Task-Oriented Dialogue System

Figure 2 :
Figure 2: Architecture of the sub-modules in TOATOD.A red line indicates the forward path during DST inference.The DB result, based on the output of DST inference, is used for NLG inference sequentially.All of the adapter layers (NLU, DST, NLG) share the same architecture.

Table 1 :
This table shows the size of the pre-trained T5 model (frozen shared parameters) and trainable Adapter per task in our models.We do not experiment with a large model, because the trainable parameter size of TOATOD large is bigger than T5 base .NLG) in a task-oriented dialogue system.As illustrated in Figure2, the adapters are inserted after the feed-forward layer following the multi-head attention layer of the transformer blocks of the end-to-end model.It enables the model to learn task-specific representations while preserving the shared parameters learned during pre-training.As described in Table1, our model consists of largesize frozen shared parameters per task and small size of trainable parameters that account for about 14% of the entire network.While original net- work's parameters are frozen, the j th adapter of task i ∈ {NLU, DST, NLG}, A ij computes as below:

Table 2 :
This dataset consists of 25,716 examples.It maps user utterances to structured, but mode abstract.The data provides annotation with the 64 intents from 21 different domains.
(He et al., 2022a)y for DST results.We use the best models after hyperparameter tuning and applying REINFORCE.Our models get the slightly higher score on the 2.2 dataset.Trainable Params denotes the number of trainable parameters of the network.The values with * are from SPACE-3(He et al., 2022a).

Table 3 :
Inform, Success, BLEU, Combined Score for NLG results.All results of other models are cited from the official leaderboard.The values with * are from RSTOD.For the MultiWOZ 2.2 evaluation, we used our models trained on the MultiWOZ 2.1 after replacing the DST-optimized adapter with those trained on MultiWOZ 2.2.

Table 5 :
Task performance of TOATOD base before and after applying REINFORCE.SL means supervised learning and RL means reinforcement learning.The test results of the TOATOD small is attached to Appendix B.

Table 7 :
Hyperparameter experiment with α on the DST task.We test TOATOD base on the MultiWOZ 2.1.