Learning Better Intent Representations for Financial Open Intent Classification

With the recent surge of NLP technologies in the financial domain, banks and other financial entities have adopted virtual agents (VA) to assist customers. A challenging problem for VAs in this domain is determining a user’s reason or intent for contacting the VA, especially when the intent was unseen or open during the VA’s training. One method for handling open intents is adaptive decision boundary (ADB) post-processing, which learns tight decision boundaries from intent representations to separate known and open intents. We propose incorporating two methods for supervised pre-training of intent representations: prefix tuning and fine-tuning just the last layer of a large language model (LLM). With this proposal, our accuracy is 1.63% - 2.07% higher than the prior state-of-the-art ADB method for open intent classification on the banking77 benchmark amongst others. Notably, we only supplement the original ADB model with 0.1% additional trainable parameters. Ablation studies also determine that our method yields better results than full fine-tuning the entire model. We hypothesize that our findings could stimulate a new optimal method of downstream tuning that combines parameter efficient tuning modules with fine-tuning a subset of the base model’s layers.


Introduction
As the popularity of virtual agent (VA) dialogue systems increases and their application in the finance domain is explored, the problem of intent classification demands greater attention.Several recent finance-specific VAs leverage technical advancements to respond to natural language queries (Galitsky and Ilvovsky, 2019;Khan and Rabbani, 2020).Determining the user's intent ensures that the VA can appropriately tailor its responses and/or perform relevant actions.Initial works in intent classification limited the task to classifying utterances as one of N known intents

Utterance Label
When will I get my card?
Card Arrival What exchange rates do you offer?
Exchange Rate My card hasn't arrived yet.
Card Arrival Is it a good time to exchange?Exchange Rates ... ... Is it possible to get a refund?Open Why has my withdrawal not posted? Open Table 1: Example user utterances and associated intent labels from banking77 dataset (Casanueva et al., 2020).In this example, only Card Arrival and Exchange Rate intents were known in training and thus refund and withdrawal related requests are Open intents in this context.
and achieved high accuracy (Weld et al., 2021).However, as depicted in Table 1, real-world applications often encounter intents unseen in training data that can be considered as open in the current context.Accounting for the open class establishes an (N + 1)-class classification task (Shu et al., 2017), where the open class is used as a label for any unidentified intent.
An optimal classifier for this problem must balance correctly labelling known-class utterances while avoiding mistakenly classifying open utterances as one of the known classes.(Zhang et al., 2021a) addresses this problem by proposing a novel loss function to learn an adaptive decision boundary (ADB) for each known intent.At inference, samples that do not fall within any ADB are classified as open.Compact intent representations are required as input for the ADB postprocessing learning step and in the case of (Zhang et al., 2021a) the representations are learnt by fine-tuning the last layer of BERT (Devlin et al., 2019).Since most intent classification methods require post-processing on intent representations, our work focuses on deriving richer representations by leveraging large language models (LLM) in an efficacious manner while still minimizing trainable parameters.
Following the introduction of the transformer in (Vaswani et al., 2017a), an influx of LLM architectures have continually progressed state-of-theart (SOTA) performance on many natural language processing (NLP) tasks (Otter et al., 2021).Usually these models are pre-trained on a general selfsupervised learning task, after which they are finetuned for a specific task.Fine-tuning such a model can be computationally prohibitive due to the immense number of trainable parameters.Furthermore, (Kaplan et al., 2020) found that the most important factor for LLM performance is likely model size, indicating that development of even larger models is probable.Inspired by in-context prompting, (Li and Liang, 2021) proposed prefix tuning as a parameter efficient alternative to finetuning for natural language generation (NLG).The LLM's parameters are frozen and trainable prefix tokens are prepended to the input sequence.Prefix-tuning has been adapted to natural language understanding (NLU) and performs comparably to full fine-tuning across scales and tasks (Liu et al., 2022).
We achieve SOTA results by augmenting the pre-training architecture of ADB open intent classification (Zhang et al., 2021a) with prefix-tuning.The combination of prefix-tuning with fine-tuning only the last transformer layer was motivated by (Kumar et al., 2022), which discovered that finetuning the entire model can distort pre-trained features.We find that alone, both prefix-tuning or fine-tuning the last layer under-performs finetuning all of BERT but when trained in tandem, exceeds full fine-tuning.
The rest of this paper is structured as follows: Section 2 summarizes prior works in both intent classification and parameter efficient tuning (PET).Our methodology and model architecture are defined in Section 3. In Sections 4 and 5 respectively, we provide our experimentation structure and corresponding results as well as several ablations.We finish with a conclusion and brief discussion regarding limitations and ethics.

Financial Virtual Agents
The effectiveness of VAs has led to their adoption in the financial domain.(Galitsky and Ilvovsky, 2019) demonstrated an exemplary session with a financial VA where the user queried for invest-ment advice.CalFE leverages commercial chatbot frameworks to train a finance-specific VA (Khan and Rabbani, 2020).(Ng et al., 2020) evaluates the impact of a VA's social presence on usage intention in VAs for finance.All of these works require extracting intent from user utterances.

Intent Detection
Intent classification is a well-established NLU task but most research limits the problem to known classes (Zhang et al., 2019;E et al., 2019;Qin et al., 2019;Zhang et al., 2021b).While having prior knowledge of all expected intents is ideal, this is rarely possible in a production environment, especially for new dialogue systems.More realistically, a subset of intents are anticipated and new intents are discovered after deployment.(Brychcín and Král, 2017) recognized the challenge of identifying intents prior to training and proposed an unsupervised method to group intents, but by doing so, likely ignored information available in the already identified intents.(Xia et al., 2018) employed zero-shot learning to identify emerging intents but used an LSTM which is hindered by non-parallelized learning and challenges in propagating long-range dependencies.The same issue is present in DeepUnk, a BiLSTMbased intent classification method using margin loss (Lin and Xu, 2019).(Zhan et al., 2021) shared our open intent classification problem formulation but synthetically generated out-of-domain samples for training which may not be as realistic as a fine-grained open class representation.
Our work directly extends the ADB approach to establishing an open class representation (Zhang et al., 2021a).The novelty of our adaptation is in leveraging prefix tuning in combination with partial fine-tuning to improve the pre-training of known intent representations without drastically increasing the number of trainable parameters.In parallel with our work, (Zhang et al., 2022) extended their ADB approach to learn distanceaware intent representations.Doing so resulted in comparable performance to our modification of their original approach.However, our tuning method is model-agnostic and can easily be incorporated with their distance-aware representation learning, likely improving the SOTA further.

Parameter Efficient Tuning
The desire for PET quickly emerged following the introduction of LLMs.Adapter modules in-sert task-specific parameters sequentially between transformer layers while the rest of the model remains frozen (Houlsby et al., 2019).(Li and Liang, 2021) and (Lester et al., 2021) simultaneously substantiated the efficacy of prepending tokens to attention mechanisms as a means of efficient tuning.In (Li and Liang, 2021), the prefixes are applied at each layer of the transformer while (Lester et al., 2021) only prepends to the input sequence.(Liu et al., 2022) applied the same method to NLU tasks using deep prefixes with optional reparameterization.Without reparameterization, simple embeddings are learnt for the prefixes.Reparameterization inserts a multilayer perceptron (MLP) between the embeddings and prefix tokens which allows for more complex embeddings.
Recently, (He et al., 2022) determined the theoretical impact of various PET methods and deduced that they are all modifications of a similar function.Allocating additional parameters to other PET modules as suggested by (He et al., 2022) could optimize intent representation beyond what is possible with prefixes alone.For now we limit our work to the most efficient method for low resource settings, prefix-tuning.To the best of our knowledge, this is the first PET work to combine partial fine-tuning with prefix-tuning.

Methodology
In this section we explain our procedure for open intent classification.Section 3.1 describes prefixtuning, the method we supplement partial finetuning with.Section 3.2 provides a brief summary of training the original ADB method that we have extended (Zhang et al., 2021a).

Prefix-Tuning
Prefix-tuning prepends trainable prefix tokens P k ,P v in front of Key and Value vectors of multihead attention in each transformer layer.The attention mechanism is applied to the concatenation of prefix and original tokens.Equation 1 details the computation.
Where Q, K, and V are the Query, Key and Value matrices from the original transformer (Vaswani Transformer Layer 2 . . .Often, prefix-tuning methods use a MLP to reparameterize the prefix since directly embedding can lead to unstable training and performance decrease (Li and Liang, 2021).However, (Liu et al., 2022) found that for NLU tasks, the efficacy of reparameterization is dependent on the task.From our experiments, we determine that reparameterizing the prefixes is crucial for intent classification.Following training, the MLP weights and biases from reparameterization are dropped and only prefixes are kept.

Training
Figure 1 illustrates our pre-training architecture of prefix-tuning plus tuning the last transformer layer to extract intent representations.The orange components of the diagram are trainable and the blue are frozen.This example shows a prefix length of two, but the length is a flexible hyperparameter.We detail our entire hyperparameter settings in Section 4.2.The outputs of BERT are first fed into a mean-pooling function to aggregate the sequence into a single vector x i as described by Equation 2: where i refers to the current training sample.A dense layer transforms the vector to the intent representation feature space and the resultant vector is finally passed to a linear classifier.We pre-train on known intents and their labels with softmax as the loss function to optimize both the prefix tokens and the last transformer layer.Equation 3 is the softmax loss: where n is the batch size and z j refers to the output logits of j th class.Following pre-training, the intent representations are extracted from our model for ADB postprocessing.ADB learns a tight spherical decision boundary for each known intent.At inference, intent representations that fall outside of all decision boundaries are classified as open.For clarification, the only alteration to the ADB method we employ is the addition of prefix tokens in Figure 1.See (Zhang et al., 2021a) for more information regarding decision boundaries and other training details.

Datasets
BANKING: A dataset of 77 banking intents with samples summing to 13,083 banking-specific customer service queries (Casanueva et al., 2020).It is also commonly referred to as "banking77" but (Zhang et al., 2021a) uses "BANKING" and since we are comparing our results primarily with them, we conform to their choice.
OOS: A subset of CLINC50 specifically designed for out-of-scope intent prediction (Larson et al., 2019) with 22,500 and 1,200 in and out of domain samples respectively over 150 different intents spanning 10 domains.

StackOverflow:
The processed version of the StackOverflow dataset (Xu et al., 2015), which has 20 different intents and 1,000 samples for each.

Experiment Settings
In accordance with previous methods, we sample 25%, 50%, and 75% of intent classes randomly during training as the "known" classes.The remaining are set aside as open classes and removed from training sets.We use BERT (bert-baseuncased) provided by Hugging Face (Wolf et al., 2020) to extract intent representations from utterances.The learning rate for prefixes and transformer parameters is set to 2e-5 since experimenting with setting different learning rates for prefixes and last layer of transformer did not consistently lead to a performance increase.All experiments are conducted on a NVIDIA 2080TI GPU.To fairly compare our method, we keep other hyperparameters the same as (Zhang et al., 2021a).For all results we average performance over ten random seeds.
Regarding prefix-specific settings, we use reparameterization with a hidden size of 512 unless otherwise specified.The overall parameter size is determined by the prefix length.In this task, we found that enlarging the prefix length did not lead to a consistent performance increase due to its low-rank bottleneck.(He et al., 2022) also discusses that allocating additional parameter in selfattention is only worthwhile if they make up less than 0.1% of the parameter budget.Therefore, we choose our default prefix length as 10, which equates to roughly 0.1% of BERT's trainable parameters.

Baselines
We compare our results to the most competitive open intent classification methods: DeepUnk (Lin and Xu, 2019), (K + 1)-way (Zhan et al., 2021), and the ADB method we directly extend (Zhang et al., 2021a).The DeepUnk results are taken from (Zhang et al., 2021a) which replaced the BiLSTM with BERT to generate intent representations for fair comparison.(Zhan et al., 2021) also uses BERT as its encoder but keeps just the CLS token's final hidden state instead of pooling the entire sequence.(Zhan et al., 2021) did not test on the same OOS split and cells corresponding to that configuration are left blank for tables in Section 5.

Results
Our main results and respective baseline comparisons are presented in Tables 2 and 3. Table 2 is limited to accuracy averaged over all classes, including the open class and macro F1 over the same set of classes.For a fine-grained analysis of open intent performance, Table 3 contrasts the F1 score of the open class with the macro F1 over the remaining known classes.PFT-ADB denotes our method of adding prefix tuning to ADB and the best result for each section is in boldface.
For each dataset we tested, PFT-ADB improves performance on all prior methods with the minor exception of StackOverflow F1-Score and known score.Specifically, as shown in Table 2, we achieve accuracy improvements of (1.63%, 1.95%, 2.07%) on BANKING, (0.50%, 1.22%, 3.03%) on OOS, and (1.01%, 0.76%, 0.82%) on StackOverflow for known intent ratios (25%, 50%, 75%).The consistency of our results across configurations suggests that paying closer attention to pre-training intent representations can enhance the distinction of decision boundaries in the postprocessing step.Additionally, we do not add a significant number of trainable parameters to existing methods (only 0.1%), successfully avoiding trading substantial costs for performance increase.Note that our results are comparable to that of the most recently released DA-ADB (Zhang et al., 2022) model.We believe that due to their orthogonal nature, DA-ADB and our approach could be combined together for further performance improvements.
We note that the dataset with the lowest performance gain is StackOverflow.(Zhang et al., 2021a) found that their novel post-processing method, ADB, was most effective on this dataset compared to prior methods.They hypothesized that this was due to being able to form tighter decision boundaries for the technical jargon more prevalent in StackOverflow.Following this reasoning, it could be that for this dataset the postprocessing method is paramount and enriching the intent representations alone is not enough to yield a substantial performance improvement.
It is important that an open intent classification method balances the performance on known classes while still identifying open intents.Table 3 verifies that despite changing pre-training tuning methods, ADB post-processing still adequately addresses this issue.The performance increase is consistent between both the open class and known classes for each dataset indicating that prefix-tuning does not interfere with optimizing both aspects of the open intent problem.Again, we anticipate that combining PFT-ADB with the newer DA-ADB could result in even better performance.
The following ablations focus on the OOS dataset since it covers multiple domains and we wanted to generalize beyond just the financial domain.

Effect of Reparameterization and Tuning Variations
In Table 4 we show that under the same dataset and known intent ratio, performance varies considerably when adopting MLP as prefix encoder.In the first row, the embedding-only method leads to poor results of 64.40% accuracy.Contrarily, introducing a 2 layer MLP to encode prefixes increases the performance by around 15%.More importantly, the result is stable and reproducible.It indicates that using MLP to reparameterize prefixes is crucial in obtaining a consistent performance.Results using prefix tuning alone (rows 1 and 2) in this task are slightly worse than ADB's finetuning results.In particular, the performance gap in identifying open intent is more salient, revealing prefix-tuning's lower capacity for out-of-scope classification.However, when we incorporate prefix tuning along with tuning the last layer of transformer, we find a surprisingly large performance increase.For embedding and MLP methods, tuning the last layer of transformer gives a performance boost to 86.40% and 90.07%, respectively, with only additional 0.1% of ADB's parameters.Since the latter transformer layer captures highlevel feature of utterances, we believe that this small amount of parameter steer the higher layers to learn more task-oriented information as well as fit intents into a better-distributed latent space.
We also try the common method of fully finetuning, i.e., unfreezing all of BERT's parameters which was not done in (Zhang et al., 2021a).The performance is still 1% lower than our method while we use only 8.1% of parameters.

Impact of Prefix Lengths
We experimented with the prefix length to determine its effect on performance.From Table 5, we observe that with the increase of the prefix length from 10 to 100 (parameter size from .1% to 1.6%), the results do not follow the same ascending pattern.We argue that simply adding more prefix tokens would not lead to a consistent performance boost due to its bottleneck.(He et al., 2022) determined that prefix tuning is another form of low- rank update, which cannot make use of more than 0.1% of additional parameters.

Fine-Tuning Different Groupings of Layers
Combining prefix-tuning with fine-tuning a subset of transformer layers is, to the best of our knowledge, a novel approach.Fine-tuning the last layer alone is ideal for minimizing trainable parameters.We aim to determine whether varying which layer or group of several layers is unfrozen can achieve better results than the last layer alone.Table 6 summarizes our findings.The layer of interest is specified with the variable x. "Just x" is fine-tuning layer x alone and "x and Rest" is fine-tuning the layer x and all subsequent layers.Using this notation, "Layer 1 and Rest" is akin to fine-tuning all of BERT."No-FT" refers to prefixtuning without any additional fine-tuning.For this row and when x is 12, the results between the two main columns are of course the same.Several interesting observations are evident in Table 6.Firstly, the fine-tuning of at least one layer in addition to prefix-tuning is strictly necessary for optimal performance.Under the constraint of tuning just a single layer, the last performs the best.The latter layers of the model are where higher-level details of natural language are processed.We hypothesize that tuning this layer best incorporates the propagation of prior prefixes with the base model.Tuning prior layers may have a similar effect, but if the subsequent layers are frozen, the understanding of prompts is obfuscated since the latter frozen layers have no experience attending to prefixes.
Another notable finding is that if performance is to be prioritized, fine-tuning the final two lay- ers together is better than the last layer alone.This suggests that the prefixes are complex enough such that their value is maximized when the final two layers tune in tandem.However, the trade off of minor performance increase at the cost of doubling the trainable parameters may not be worth it depending on the application.Lastly, we note that as layers beyond two are trained in the "x and Rest" column, performance begins to degrade.This supports the observation made by (Kumar et al., 2022) that fine-tuning disturbs pre-trained features in the base model.Training only the final layer(s) avoids perturbing lowlevel semantics learnt in earlier layers of the base model, but still adds sufficient capacity to attend to the prefixes.

Fine-Tuning Various Components in Last Layer
While fine-tuning only the last transformer layer reduces the trainable parameter count to 8%, this is still a large value compared to the 0.1% parameter count of the prefixes alone.We isolate various components of the last transformer layer to determine if some could be frozen to further reduce parameter count.The results are presented in Table 7. Tuning the entire layer significantly outperformed any other variation, alluding that there is an important relationship between the prefixes and every component of the final transformer layer.
Tuning each of the components in the last layer is essential to procure maximum prefix performance.

Conclusion
We have shown that incorporating prefix-tuning with the ADB intent representation pre-training method achieves SOTA results in the financial domain on the banking77 benchmark dataset and others.Furthermore, our tuning method does not sacrifice excessive parameters count for the per-formance gain.The combination of prefix-tuning with fine-tuning only the last layer of transformer is simple yet novel to the best of our knowledge and surfaces interesting questions regarding the mechanisms they use to interact.We intend to address the limitations presented hereafter in the near future.

Limitations
Despite achieving SOTA results on open intent classification tasks, our work has several facets that could be furnished further.Firstly, we tune the last layer of transformer along with the prefixes, making our method less parameter efficient than prefixes alone.Other approaches to fine-tuning the last layer of the transformer during pre-training should be investigated.Moreover, this work does not include any other PET method such as adapter tuning (He et al., 2021) or LoRA (Hu et al., 2021).We anticipate that using other PET methods will reveal new observations regarding their interaction with partial fine-tuning.We restrict our study to simple single intent dialogues while industrydeployed models would likely encounter noise as well as multiple intents.Testing the robustness of our method under these conditions could be valuable.Lastly, we plan to research whether our success with prefix-tuning in combination with partial fine-tuning generalizes to other NLU and financial tasks.

Ethics Statement
Recent impressive achievements in NLP thanks to the advent of LLMs do not come without cost.Most relevant to our paper is the environmental impact and inequitable distribution of such technologies (Strubell et al., 2019).The resources required to train a LLM are large which from the environmental perspective increases our contribution to climate change and from an equity perspective limits who can access, research, and use the model.While the self-supervised pre-training step often has the greatest resource requirements, finetuning LLMs is undertaken by many more parties following a model's public release.The numerous task-specific deployments of popular models likely have greater net CO 2 emissions than the initial pre-training.Our work directly combats this concern by promoting parameter efficient tuning as an efficacious alternative to relatively expensive fine-tuning.The fraction of trainable parameters reduces tuning memory requirements, in turn reducing power consumption and environmental impact.Additionally, the reduction of required memory enables the adoption of LLMs by those who do not have access to expensive high-quality hardware or cloud platforms.Finally, storing copies of the model for each task is efficient.Only a single copy of the frozen LLM is needed along with the smaller prefixes and in our case, trained last layer of transformer, resulting in similar benefits as the reduction of memory.

Table 3 :
Open and known comparison of main results for known intent ratios 25%, 50%, and 75% on BANKING, OOS, and StackOverflow datasets.F1-Score and macro F1-Score are reported for open class and known classes respectively.

Table 5 :
Results of tuning with different prefix lengths.We use OOS dataset with 75% known intent ratio.