Making Pre-trained Language Models Better Learn Few-Shot Spoken Language Understanding in More Practical Scenarios

Most previous few-shot Spoken Language Understanding (SLU) models typically need to be trained on a set of data-rich source domains and adapt to the target domain with a few examples. In this paper, we explore a more practical scenario for few-shot SLU, in which we only assume access to a pre-trained language model and a few labeled examples without any other source domain data. We concentrate on understanding how far the few-shot SLU could be pushed in this setting. To this end, we develop a prompt-based intent detection model in few-shot settings, which leverages the BERT original pre-training next sentence prediction task and the prompt template to detect the user’s intent. For slot filling, we propose an approach of reconstructing slot labels, which reduces the training complexity by reducing the number of slot labels in few-shot settings. To evaluate the few-shot SLU for a more practical scenario, we present two benchmarks, FewShotATIS and FewShotSNIPS. And a dynamic sampling strategy is designed to construct the two datasets according to the learning difficulty of each intent and slot. Experiments on FewShotATIS and FewShotSNIPS demonstrate that our proposed model achieves state-of-the-art performance.

Most previous few-shot Spoken Language Understanding (SLU) models typically need to be trained on a set of data-rich source domains and adapt to the target domain with a few examples. In this paper, we explore a more practical scenario for few-shot SLU, in which we only assume access to a pre-trained language model and a few labeled examples without any other source domain data. We concentrate on understanding how far the few-shot SLU could be pushed in this setting. To this end, we develop a prompt-based intent detection model in fewshot settings, which leverages the BERT original pre-training next sentence prediction task and the prompt template to detect the user's intent. For slot filling, we propose an approach of reconstructing slot labels, which reduces the training complexity by reducing the number of slot labels in few-shot settings. To evaluate the few-shot SLU for a more practical scenario, we present two benchmarks, FewShotATIS and FewShotSNIPS. And a dynamic sampling strategy is designed to construct the two datasets according to the learning difficulty of each intent and slot. Experiments on FewShotATIS and FewShotSNIPS demonstrate that our proposed model achieves state-of-the-art performance.

Introduction
Spoken Language Understanding (SLU) is one of the fundamental modules for task-oriented dialogue systems, which mainly includes two sub-tasks, intent detection and slot filling. The remarkable success of most neural SLU models typically relies on a large quantity of training data Qin et al., 2020Qin et al., , 2021Wang et al., 2022). However, acquiring large amounts of annotated data for domain-specific is arduous and expensive in practical applications. Situations like few or even no training data may happen in a brand-new application, which motivates us to address the challenge of the SLU module in few-shot settings.
The previous few-shot SLU studies mainly focused on semi-supervised learning (Basu et al., 2021;Gaspers et al., 2021;Kumar et al., 2022) and metric learning (Hou et al., 2020a;Krone et al., 2020a;Yang and Katiyar, 2020;Hou et al., 2021a;Yang et al., 2022;Gao et al., 2022;Hou et al., 2022;Yang et al., 2022). The models need to be trained on source domains with abundant data and then adapt to the data-scarce target domain. Straying from the pattern by making minimal assumptions about available resources, we explore a more practical scenario for few-shot SLU, in which we only use moderately sized pre-trained models, such as BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019), and only a tiny amount of annotated examples to fine-tune the pre-trained model. The settings are pretty attractive as (1) the few-shot settings conform more to real application scenarios, as it is straightforward to get a few annotated data (e.g., 10 examples); (2) such models are not demanding hardware resources for training; and (3) the development of prompt tuning has brought a novel paradigm for few-shot learning, and updating parameters of the pre-trained model typically leads to better performance (Nigam et al., 2019;Han et al., 2021Han et al., , 2022Zhu et al., 2022a). We focus on exploring what the performance of the few-shot SLU model can achieve without any other source domain data.
We revisit existing few-shot SLU benchmarks (Larson et al., 2019;Hou et al., 2020a,b; and find that the settings of them tend to deviate more from practical application scenarios. The main reasons are as follows. (1) These benchmarks provide data-rich source domains for training, such as SNIPS (Hou et al., 2020a) and CLINC (Larson et al., 2019). However, in practical applications, there might be little or even no training data available.
(2) Some benchmarks typically uniformly random sample k examples for each intent and slot. However, in practical applications, the learning complexity varies with different intents and slots and may necessitate different amounts of training data. To fill these gaps, we construct two multidomain few-shot SLU benchmarks, FewShotATIS and FewShotSNIPS, without any source domain data, which are based on ATIS (Price, 1990) and SNIPS (Coucke et al., 2018), respectively. In addition, we design a dynamic sampling strategy that allocates the number of samples depending on the evaluation metrics of each intent and slot. The sampling strategy is more reasonable and adheres to practical scenarios since slight variations in the training data can have a major impact on the model's performance in few-shot settings. The proposed benchmarks provide a fair comparison between various methods on common ground and measuring progress in practical scenarios.
Recently, prompt tuning achieved competitive results with only a limited amounts of training data, such as prompt tuning based on BERT with the masked language model (as shown in Figure 1(a)), which has been proven to be effective for text classification (Gao et al., 2020;Hu et al., 2021;Zhu et al., 2022b). However, it suffers from the following challenges when employed for intent detection.
(1) The candidate vocabulary for each intent must be manually constructed based on natural language templates. (2) The semantics of intent are complex and difficult to represent with fixed-length tokens.
Moreover, there are some challenges for the slot filling task in few-shot settings. (1) The BIO labeling format (Huang et al., 2015) is typically utilized  Figure 2: The BIO labeling format and reconstructing slot labels format. W is token of the dialogue. S and E-T denote slot and entity type. in standard datasets for slot filling, as shown in Figure 2(a). In the case of sufficient data, the beginning labeling 'B-' of the entity provides supervised information to the model to detect the entity boundaries. Nevertheless, in few-shot settings, the excessive amount of slot labels in the labeling format exacerbates the difficulty of the model training.
(2) As the training sample decreases, it becomes increasingly difficult for the model to differentiate between similar slot labels, such as "from.city_name" vs. "to.city_name" in Figure 2.
To address the aforementioned challenges, we proposed a BERT-NSP-Prompt model for intent detection and a reconstructing slot labels approach for slot filling. In particular, BERT-NSP-Prompt leverages the next sentence prediction (NSP) task of BERT and a constructed prompt template, to complete intent detection in the few-shot or even zero-shot settings, as shown in Figure 1(b). It is used to assess which intent description text is the most "fluent" following sentence of the user's dialogue. Moreover, an approach of reconstructing slot labels is proposed to reduce the model training complexity by reducing the number of slot labels in few-shot settings, as shown in Figure 2(b). We convert the BIO slot labeling format to slot entity labeling format, which resulted in the slot category cutting in half. Furthermore, we introduce the focal loss function (Lin et al., 2017) to distinguish between similar slot labels.
To sum up, our key contributions are as follows.
(1) We investigate more practical scenarios for few-shot SLU and propose the BERT-NSP-Prompt model and a reconstructing slot labels approach.
(2) We construct two multi-domain few-shot SLU benchmarks, FewShotATIS and FewShot-SNIPS, to encourage the research community to create algorithms that can demonstrate generalization capabilities with minimal data.
(3) We conduct extensive experiments to demonstrate the effectiveness of our model, and the experimental results reflect that our models achieve state-of-the-art performance on the two datasets.

Related work
Most works use semi-supervised learning or metric learning to implement few-shot SLU tasks (Hou et al., 2020a;Zhu et al., 2020;Krone et al., 2020b;Zhang et al., 2020;Hou et al., 2021b;. Hou et al. (2020a) adopted the TapNet and label dependency transferring to complete the slotfilling task. Krone et al. (2020b) used a prototype network to complete a few-shot SLU task. Han et al. (2021) further extend the prototypical network to achieve joint learning, which guarantees that two tasks can mutually enhance each other.  adopted the retrieval framework to match the token spans in the input with the most similar labeled spans in the retrieval index. Cui et al. (2021) leverages prompts for few-shot NER. However, these models still require a set of datarich source domains. Recently, prompt tuning has taken advantage of the powerful generalization ability of pre-trained language models and dramatically reduces the reliance on supervised data for downstream tasks Lester et al., 2021;Gu et al., 2021). The paradigm narrows the gap between the pre-trained model and downstream tasks, which provides a novel insight into few-shot learning. We propose a simple and effective method based on pre-trained models to accomplish few-shot SLU tasks. The proposed model does not have a transfer from the source domain to the target domain, nor does it use any additional data resources. It will serve as an essential baseline for future exploration. In addition, we believe the constructed FewShotATIS and FewShotSNIPS datasets can comprehensively evaluate the models and inspire the research of the few-shot SLU.

Datasets: FewShotATIS and FewShotSNIPS
In few-shot SLU, existing benchmarks usually uniformly random sample k examples for each intent and slot, such as SNIPS (Hou et al., 2020a), and CLINC (Larson et al., 2019). However, a reasonable and conform to practical scenarios sampling strategy is: suppose the total number of samples is limited; for easy-to-classify intents or slots, it will sample less than k samples; in contrast, it will sample more than k samples for hard-to-classify intents and slots. Thus, we design a dynamic sampling strategy to simulate the "sampling-iterative" process, which assigns the number of samples according to the evaluation metrics of each intent and slot. Since small changes in the training data may significantly affect the model's performance in few-shot settings, the advantages of the strategy will be magnified further. Finally, we adopt the dynamic sampling strategy to build two new multidomain few-shot SLU benchmarks, FewShotATIS and FewShotSNIPS 1 , respectively.

Dynamic Sampling Strategy
In the few-shot settings, the models are susceptible to subtle variations of training data. Small changes in the training set may significantly affect the model's performance. More importantly, the pre-experimental results show that some intents and slots are easy to predict correctly only by keywords. In contrast, some intents and slots are difficult to predict correctly, and they need to predict based on context and syntactic information. Experiments reflect that different types of intents and slots have different learning difficulties and may require different training numbers of samples. Based on the above discussion and better simulating practical application, we propose a dynamic sampling strategy when constructing a few-shot dataset, as shown in Figure 3. Specifically, suppose the number of samples is fixed in the iteration process (i.e., limiting the total sum of samples for each iteration). In that case, the dynamic sampling strategy assigns the number of samples for each intent and slot based on the evaluation metrics at each iteration. The dataset is built through a "samplingiterative" process. Compared to random sampling, the dynamic sampling strategy can more reasonably sample data with different intents and slots.
More specifically, in the experiment of fewshot learning, multiple sets of comparative experiments usually sample K samples of each category uniformly, called K−shot experiment (Dong and Xing, 2018). K ∈ {k 1 , k 2 , · · · , k n } is a parameter. k 1 -shot, k 2 -shot, · · · , k n -shot experiments are independent of each other. However, in the dynamic sampling strategy, for a k n shot experiment,  if k = 0 then 11: end if 15: for j = 0 to N do 16: //Sampling for each category. 17: end for 20: Sample data based on num_sampled[0, · · · , N ] and training model k ; 21: Evaluate model k on F1 score metric.

22:
Update F1[k][0, · · · , N ] based on evaluation result 23: end for we treat it as an n-iteration experiment. For the (i + 1)-th iteration, the total number of samples is increased by For easy-to-classify intents or slots, it will sample less than (k i+1 − k i ) samples, while it will sample more than (k i+1 − k i ) samples for hardto-classify intents and slots. The pseudo-code is shown in Algorithm 1. It is worth noting that the number of samples per sampling is stored in the variable temp (line 17), and we use BERT to calculate the F1 score (line 21). Table 1 reports the details of FewShotATIS and FewShotSNIPS. k samples are sampled on each intent and slot to form the k-shot dataset. Due to the differences between the intent detection and slot filling tasks, we take different sampling spans in constructing the datasets. During the experiment, we adopt finer-grained and larger-range sampling. Experimental results show that the intent detection task in 10-shot and the slot filling task in 40-shot achieve decent performance, as shown in Appendix A. Therefore, we construct datasets ranging from 2-shot to 10-shot for the intent detection task and ranging from 5-shot to 40-shot for the slot-filling task. It is worth noting that the test set in the two datasets is consistent with the standard dataset. Thus, the two datasets can measure the model's performance more comprehensively and solidly. The datasets are constructed to simulate real applications and encourage the community to build algorithms capable of generalizing with only a few intents and slots.

Task Definition
Task-oriented dialogue systems are usually oriented to a specific domain, in which the intents and slots are typically predefined and limited in number. Thus, intent detection is considered a text classification task, and the slot filling task is considered a sequence labeling task. In few-shot settings, the number of samples that can be used for training is much smaller than the number of samples in a standard dataset. Learning both tasks simultaneously maybe increases the complexity of model training. Furthermore, it is difficult to unify the few-shot sampling method of two tasks in Kshot sampling (Dong and Xing, 2018). For a given dialogue X = {x 1 , x 2 , x 3 , · · · , x m }, the intent detection task is to learn a model M I to get the intent I; the slot filling task is to learn a model M S to get  Figure 4: Illustration of the BERT-NSP-prompt model. For each dialogue, the input with the prompt template is constructed according to each intent in the dataset.

Prompt-based Intent Detection Model
In few-shot settings, we proposed a BERT-NSPprompt model, which utilizes the next sentence prediction (NSP) task of BERT pre-training to detect the intent of the dialogue. No new parameters are introduced to the whole model. The architecture of the BERT-NSP-prompt is shown in Figure 4. The NSP task aims to predict whether the relationship between two sentences is contextual. Therefore, the NSP task can evaluate whether two sentences are related to the same topic and contain similar semantics. We transform intent detection into the NSP task by building a prompt to unify the BERT pre-training and intent detection tasks. The input of the BERT-NSP-prompt consists of four parts: the sentence template T A , the original dialogue A, the sentence template T B , and the natural language text description of the intent label B. T A and T B are manually designed prompt templates. The format of the input sequence is as follows: [CLS] is the special identifier used to complete the next prediction task, and [SEP] and [EOS] are the special identifiers for sentence segmentation and sentence ending.
For example, there are a total of seven intents in the SNIPS dataset. For an original dialogue x, the final intent is predicted by constructing seven inputs X ′ with prompt. Appendix B shows the description text of each intent in the SNIPS dataset. After the construction of input X ′ , the representation of each input token E = (e 0 , · · · , e n ) consists of Token Embedding, Segment Embedding, and Position Embedding. n is the length of the inputs X ′ . And then, the input sequence E is encoded by BERT to obtain the hidden state (h 0 , · · · , h n ) of  Figure 5: Illustration of slot filling model based on reconstructing slot label.
the final layer output: where h 0 is the hidden state vector of [CLS].
The output layer uses the BERT pre-trained NSP task classifier to determine the relationship between sentence A and sentence B. Then, the most relevant text descriptions of intent to the user's dialogue are determined by comparing the "IsNext" tags scores of the NSP task. Finally, the intent is predicted. The probability distribution of the intent is shown in Equation (3).
where W nsp and b nsp are learnable parameters.
The loss function for intent detection is calculated by: where l I is the intent label.

Slot Filling Via Reconstructing Slot Labels
The slot filling task is generally solved as a sequence labeling task. The slot filling uses the BIO labeling format in standard datasets. The beginning of the entity with 'B-' and the interior of the entity with 'I-'. In the case of sufficient data, the beginning labeling 'B-' of the entity provides supervised information for the model to detect the entity boundaries. However, in few-shot settings, the excessive number of slot labels increases the difficulty of model training.
To address this issue, we adopt a two-step approach to reconstructing slot labels. First, the sequence labeling task in BIO labeling format is transformed into a predicting slot entity task. The model predicts the entity types of all tokens in dialogue. Second, the prediction results are reconstructed into BIO format by rules for evaluation according to the order of natural language from left to right. The reconstructing slot labels approach is shown in Figure 5. Applying the approach in the slot filling task can reduce the number of slots by half. Thus, half of the classifier parameters are reduced, ultimately reducing the difficulty of training models.   Table 3: Performance of slot filling on FewShotATIS and FewShotSNIPS (F1-score). "BIO" is the model labeled with BIO format, "ET" denotes the model with reconstructed slot labels, and "ET+FL" represents the model with reconstructing slot labels and the focal loss function.
In addition to the problems mentioned above, in low-resource scenarios, semantically similar slots are often more difficult to distinguish, e.g., from.city_name vs. to.city_name slots in the ATIS dataset. The reasons are as follows. (1) Due to a small number of samples, it is difficult for the model to learn the features of the text structure of the dialogue, such as from location A to location B.
(2) Both are slots of location type entities, and the encoded representations are closer in the semantic space. To address the problem that similar labels are difficult to distinguish, we introduce the focal loss function to replace the typically used cross-entropy loss function.
where p t is the probability that the model predicts the correct, γ is a non-negative hyperparameter that regulates the balance of the loss values of the easy-to-classify and hard-to-classify samples, and when γ is 0, the focal loss is consistent with the calculated result of cross entropy. The focal loss function introduces the coefficient term (1 − p t ) γ to the cross entropy for reducing the relative loss of easily-to-classify samples (p t > 0.5) so that the model focuses more on the hardto-classify samples. When p t →1, the loss value of easily-to-classify samples is toned down; when p t →0, the coefficient term (1 − p t ) γ →1, which is not much different compared to the cross entropy loss. As γ increases, the term (1 − p t ) γ decreases the contribution of loss values for easy-to-classify samples, and the weights of hard-to-classify samples are relatively elevated, increasing the importance of hard-to-classify samples.

Experiments
As stated in the paper, the few-shot SLU setting assumes access to a moderately sized pre-trained language model and a few labeled data, without the data from any other source domain. As such, moderately sized pre-trained models, namely BERT and RoBERTa, are employed as the baseline models. The details of implementation and evaluation metrics can be found in Appendix C. The main reasons for not considering metric learning models as baseline models are as follows. (1) These models necessitate training on source domains with a voluminous dataset to adapt to the data-scarce target domain, which is not feasible in our few-shot SLU setting.
(2) The two presented benchmarks are with limited annotation and without source domain data, making it difficult for the previous metric learning models to be trained efficiently. As such, it is impossible to fairly compare the performances of the proposed and previous models on proposed two datasets. To further compare with the previous few-shot SLU models, we attempt to conduct comparative experiments following the dataset setting of Hou et al. (2020a). The results are presented in Appendix D.

Main Results
We compare the proposed models with previous pre-trained BERT and RoBERTa models on a range of training data amounts in few-shot settings. The results concerning intent detection and slot filling on FewShotATIS and FewShotSNIPS are presented in Table 2 and Table 3 2 , respectively. To mitigate the potential impact of randomness on the experiment, we conducted it five times with different random seeds and reported the average performance with standard deviation. It is evident that the proposed BERT-NSP-Prompt and BERT-SF(ET+FL) outperform the baseline models. It is noteworthy that the proposed model's performance is significantly improved in a more practical scenario.
In particular, BERT-NSP-Prompt has revealed exceptionally competitive results in zero-shot settings. The results verify that BERT-NSP-Prompt can well motivate the knowledge related to the intent detection task in the pretrained model. Even in the absence of training data, intent detection can still leverage the knowledge contained in the pre-trained model to achieve satisfactory performance. BERT-SF (ET+FL) has also demonstrated remarkable performance. Notably, reconstructing slot labels has yielded a remarkable improvement in the model. We attribute the improvement to the method that reduced the complexity of the training process for the slot filling task. In contrast, intent detection performs better and shows a marked improvement in the few-shot settings. The main reasons are outlined below. (1) The intent detection task is relatively straightforward. (2) The features and patterns of intent are easier to capture. It indicates that reducing the classification difficulty of the model is a beneficial idea for few-sample learning. The results demonstrate that the proposed model is stable in terms of performance and brings values of practical applications. Table 2 shows that BERT-NSP-Prompt achieves intent accuracy of 75.42% and 79.57% in zero-shot settings on FewShotATIS and FewShotSNIPS respectively, which indicates that BERT-NSP-Prompt is capable of accurately detecting the user's intent by using prompt and a priori information from the NSP task, even without domain-specific labeled data for training. It justifies that our designed prompt template reduces the gap between the BERT pre-training task and the intent detection task, so as to effectively utilize the knowledge acquired by BERT during the pre-training stage. As the number of samples increases, both BERT-NSP-Prompt and BERT-ID steadily increase in intent accuracy. In the 10-shot setting, BERT-NSP-Prompt still outperforms BERT-ID on FewShotATIS and Few-ShotSNIPS by 5.37% and 3.48%, respectively.

Effects of Intent Detection Model
Experimental results demonstrate the effectiveness of the proposed model in a variety of amounts of training data. Furthermore, FewShotATIS and Few-ShotSNIPS can effectively be used to assess the model's generalizability. Table 2 reports that intent detection using BERT-ID yields an accuracy of only 20.29% on Few-ShotATIS but 59.02% on Few-ShotSNIPS in the 2-shot setting. An obvious reason is that Few-ShotATIS contains only samples of the flight domain with relatively high similarity between intent labels, making intent detection more challenging.
In contrast, FewShotSNIPS includes more domains. Thus, thereby making the boundaries of different intents more conspicuous. Moreover, the performance of RoBERTa-ID is slightly lower than that of BERT-ID in the few-shot setting. The higher number of RoBERTa parameters may necessitate more training data to adequately fit the downstream task. As the quantity of training data increases, the size of the model plays a dominant factor in performance, and the performance of RoBERTa will continue to improve. Our preliminary analysis is due to the higher number of RoBERTa pre-trained model parameters, which require more training data to fit the downstream task. With more training data, however, the model size plays a dominant factor in model performance, and the performance of RoBERTa will keep improving. An additional advantage of the proposed model is that it can be effortlessly transferred to a multi-intent SLU without any modifications. Table 3 shows that slot label reconstructing can bring about 4.41% and 7.64% improvement on FewShotATIS in 5-shot and 40-shot settings, respectively. Similarly, it shows that the improvement yields 9.64% and 2.08% respectively in 5shot and 40-shot settings on FewShotSNIPS. The most significant improvement is observed in the 10-shot setting. The experimental results of BERT-SF (ET+FL) are similar to RoBERTa (ET+FL). It illustrates the effectiveness of the proposed slot label reconstructing approach for slot filling across varying training data ranges. We believe that it is mainly due to the lack of training data in the few-shot setting and that reducing the training complexity of the model is beneficial to its training. Comparing the results of BERT-SF(ET) with BERT-SF(ET+FL) (in Table 3), it has been found that  replacing the cross-entropy function with the focal loss function can improve by 8.10% on Few-ShotATIS and 2.28% on FewShotSNIPS in 10-shot settings. It indicates that introducing focal loss can balance the "hard-to-classify" and "easy-toclassify" slots. In addition, the slot filling task requires more training data for higher performance in comparison to the sentence-level intent detection task. The primary reason is that the slot filling task can be regarded as a word-level classification task, which is more complex.

Effects of Reconstructing Slot Labels
The results indicate that slot label reconstructing contributes more significantly to the improvement of the proposed model than introducing the focal loss function. It is an exciting exploration to reduce the complexity of slot labels, thereby necessitating a lesser number of samples for training. That means the approach can be extended to other sequence labeling tasks for use in scenarios with limited resources.

Effects of Dynamic Sampling Strategy
To further assess the effectiveness of the proposed dynamic sampling strategy, we compare the results of applying the dynamic sampling strategy with the average sampling strategy on two datasets in each iteration. Both sampling approaches ensure that the training set of the previous experiment is a subset of the training set of the latter experiment, e.g., the 2-shot training data is encompassed within the 4-shot training data.
The results of applying different sampling strategies on BERT-NSP-Prompt and BERT-SF(ET+FL) for the intent detection and slot filling tasks are shown in Table 4 and 5, where k-shot represents the K samples of each intent. The results of the Dynamic Sampling (DS) strategy are better than the Average Sampling (AS) strategy in the majority of settings on both FewShotATIS and FewShotSNIPS, which indicates that the Dynamic Sampling strategy can effectively sample according to the performance of different intents and slots.  Comparing BERT-NSP-Prompt(AS) with BERT-NSP-Prompt(DS) in the 4-shot and 6-shot settings on FewShotATIS, an improvement of 6.16% and 5.13% was observed, respectively. Additionally, comparing BERT-SF(ET+FL+AS) with BERT-SF(ET+FL+DS) in the 10-shot and 20-shot settings on FewShotATIS, an improvement of 13.4% and 3.56% in intent accuracy was observed, respectively. It verifies that the smaller the number of labeled data within a specific range, the more significant the model performance improvement with the dynamic sampling strategy. It reflects that the SLU model is susceptible to the amount of data due to the limited amounts of training data.
The experimental results for each intent on both datasets are in Appendix E. The experimental results report that the model with a dynamic sampling strategy has a smaller variance than the model with an average sampling strategy. It further demonstrates the effectiveness of the dynamic sampling strategy from fine-grained intent categories.
In addition, we compare the two sampling methods on the pre-trained model RoBERTa. More experiment results are in Appendix F. The experimental results are consistent with the results on BERT. It reflects that the BERT model can sample data according to the difficulty of each intent in the process of dynamic sampling rather than overfitting the BERT model. We believe the dynamic sampling strategy significantly enhances the model's performance in few-shot settings and effectively extends to other data sampling tasks.

Effects of γ in Focal Loss Function
The setting of γ in the focal loss function significantly affects the model's performance. Therefore, a sensitivity analysis of γ is conducted. We conduct experiments on Few-ShotATIS and FewShotSNIPS for BERT-SF(ET+FL) with the values of γ of 0.25, 0.5, 1, 2, 4. Figure 6 illustrates a sensitivity analysis of γ. In focal loss function, the larger the value of γ, the higher weight the loss function assigns to the "hard-to-classify" samples and the model pays more attention to the "hard-to-classify" samples. On both datasets, the F1 scores of slot filling show a trend of increasing and then decreasing with increasing γ, indicating that when assigning too much weight to the "hard-to-classify" samples affects the overall performance of the model. Based on the results, we set γ as 2 in our experiments.

Conclusion
In this paper, we explore a more realistic scenario for the few-shot SLU. There are two main contributions toward developing a few-shot SLU. (1) We built two benchmarks, FewShotATIS and Few-ShotSNIPS, to simulate the few-shot SLU task in a more realistic scenario. (2) We develop BERT-NSP-Prompt, which utilizes BERT's NSP task with a prompt template to detect user intent. In addition, we proposed a reconstructing slot label approach to reduce the difficulty of training the classifier by reducing the number of labels. Experimental results indicate that the proposed model achieves the SOTA performance and endows the SLU module with solid generalization ability. There are two interesting directions for future work in the few-shot setting. (1) Fully explore the close relationship between intent and slot to improve the performance of SLU. (2) It is an idea worth exploring that reduces the complexity of training models.

A The Results of BERT
To simulate intent detection and slot-filling tasks in low-resource scenarios, we sample few-shot data from the ATIS and Snips datasets and conduct kshot experiments. We set fine-grained k values in the experiment to observe how the model's performance changes as the training data increases. The k-shot settings for intent detection and slot filling are shown in Table 6 and Table 7. We can reasonably set the k value according to the experimental results when constructing the k-shot dataset. In this way, the model's performance can be thoroughly evaluated through only a few k-shot experiments. Table 8 shows the description text of each intent in the SNIPS dataset, which contains seven intents.

Intent Description Text
AddToPlaylist He is adding something to playlist. PlayMusic He wants to play some music. BookRestaurant He is booking a restaurant.

SearchCreativeWork
He is searching for something like game, videos or TV shows.

RateBook
He is rating a book with a score.
GetWeather He wants to know the weather information. SearchScreeningEvent He is searching for a movie.

C Implementation Detail and Evaluation Metrics
We implement BERT (Devlin et al., 2018), and RoBERTa (Liu et al., 2019) models based on Hugging-face Pytorch Transformers (Wolf et al., 2019). The pre-trained model BERT is initialized with Bert-base-uncased. The pre-trained model RoBERTa is initialized with RoBERTa-base. It contains a 12-layer Transformer encoder, a multiheaded attention mechanism containing 12 attention heads on each layer, and a hidden layer size of 768. The initial learning rate of the model is 5e-5. The sentence length is 80. The model parameters are optimized using Adam (Kingma and Ba, 2014) with a weight decay optimizer. The dropout is 0.1. With the above experimental environment and parameter settings, the experiment requires 4,776M memory and spends 145s of each epoch training time. To overcome the randomness of the experiment, we ran it 5 times with different random seeds and reported average performance and standard deviation for the experiments. The intent accuracy is used as the evaluation metric of the intent detection task, while the slot F1 score is used as the evaluation metric of the slot filling task.

D The Result of SNIPS dataset
We evaluate our models following the k-shot dataset provided by Hou et al. (2020a) on SNIPS (Coucke et al., 2018). SNIPS contains seven intents: GetWeather (We), PlayMusic (Mu), AddToPlaylist (PI), RateBook (Bo), Search-ScreeningEvent(Se), BookRestaurant (Re), and SearchCreativeWork (Cr). The previous few-shot SLU methods train models on five source domains, use the sixth one for development, and test on the seventh domain. And then, we complete the testing of all seven domains by cross-validation seven times. In comparison, our model is only fine-tuned in the testing domain.    is reported in . Experiments reflect that the proposed model achieves competitive performance compared to previous few-shot SLU models even though it does not use source domains. It is worth noting that our model is only finetuned on the testing domain without training on any source domains. The main reason for the model performance improvement is that the reconstructing slot label approach reduces the complexity of model training. Specifically, the test set has only one domain. Thus there are few categories of slots. Moreover, reconstructing slot labels reduces the number of slot types in half. Therefore, it extremely reduces the search space of the model. Experiments demonstrate that the reconstructing slot label approach is simple and effective in the few-shot SLU.

E Intent Accuracy on FewShotSNIPS and FewShotATIS
The results of BERT-NSP-Prompt on the FewShot-SNIPS and FewShotATIS (6-shot) are reported in Table 10 and Table 11. The experimental results report that the model with a dynamic sampling strategy has a smaller variance than the model with average sampling. The main reason for the model performance improvement may be that the dynamic sampling strategy can assign a different number of samples according to the learning difficulty of each intent. It further demonstrates the effectiveness of the dynamic sampling strategy. This sampling method is more in line with the practical scenario and improves the overall performance of intent detection.

F The Results of RoBERTa in Different Sampling Strategies
The results of different sampling strategies on the RoBERTa models for the intent detection and slot filling task are shown in Table 12 and Table 13. With the increase in the total number of samples, the experimental results of the dynamic sampling strategy show a steady improvement trend. The experimental results are consistent with the results on BERT. It reflects that the BERT model can sample data according to the difficulty of each intent in the process of dynamic sampling rather than overfitting the BERT model. We believe the dynamic sampling strategy significantly enhances the model's performance in few-shot settings and effectively extends to other data sampling tasks.     Table 13: The performance of different sampling strategies for slot filling (F1). "BIO" is the model labeled with BIO format, "ET" denotes the model with reconstructed slot labels, and "ET+FL" represents the model with reconstructing slot labels and the focal loss function. "AS" is the Average Sample, and "DS" denotes the Dynamic Sample.