PROTAUGMENT: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning

Recent research considers few-shot intent detection as a meta-learning problem: the model is learning to learn from a consecutive set of small tasks named episodes. In this work, we propose ProtAugment, a meta-learning algorithm for short texts classification (the intent detection task). ProtAugment is a novel extension of Prototypical Networks, that limits overfitting on the bias introduced by the few-shots classification objective at each episode. It relies on diverse paraphrasing: a conditional language model is first fine-tuned for paraphrasing, and diversity is later introduced at the decoding stage at each meta-learning episode. The diverse paraphrasing is unsupervised as it is applied to unlabelled data, and then fueled to the Prototypical Network training objective as a consistency loss. ProtAugment is the state-of-the-art method for intent detection meta-learning, at no extra labeling efforts and without the need to fine-tune a conditional language model on a given application domain.


Introduction
Intent detection, a sub-field of text classification, involves classifying user-generated short-texts into intent classes, usually for conversational agents applications (Casanueva et al., 2020).Since conversational agent applications are domain-specific, intent detection is a challenging task because of labeled data scarcity and the number of classes (intents) it usually involves (Dopierre et al., 2020).As a consequence, recent research (Snell et al., 2017;Ren et al., 2018) considers few-shot intent detection as a meta-learning problem: the model is trained to classify user utterances from a consecutive set of small tasks named episodes.Each episode contains a limited number of C classes alongside a limited number of K labeled data for each of the C classes -this is usually referred to as a C-way K-shots setup.At test time, the algorithm is evaluated on classes that were not seen during training.That is the reason why meta-learning is sometimes referred to as learning to learn: it mimics human abilities to learn iteratively from different and small tasks.Meta-learning has successfully been applied to a wide set of NLP tasks: hypernym detection (Yu et al., 2020), low resource machine translation (Gu et al., 2018), machine understanding tasks (Dou et al., 2019) or structured query generation (Huang et al., 2018).Most metalearning algorithms (Section 2) were developed in the course of the last 5 years.It has recently been empirically demonstrated that comparative studies in follow-up papers of (Snell et al., 2017) are debatable -for short texts classification -because of the two following main issues (Dopierre et al., 2021).First, comparative studies involve simple and limited datasets in terms of number and separability of classes (SNIPS (Coucke et al., 2018), a very popular dataset, includes only 7 classes, with the current best model performing over 99% accuracy (Cao et al., 2020)).Second, as we further better understand (Niven and Kao, 2019), fine-tune (Liu et al., 2019b;Hao et al., 2020) and refine (Khetan and Karnin, 2020) BERT-derived models, it is not clear if the different meta-learning frameworks can be considered state-of-the-art due to their architecture or due to the improvements of available text encoders at the time of conception.(Dopierre et al., 2021) concludes that Prototypical Networks (Snell et al., 2017) (that were using LSTM-based text encoders when introduced in NLP) are actually the state-of-the-art for intent detection when equipped with a fine-tuned BERT text encoder model.Ultimately, improving Prototypical Networks have therefore been proven to be a very challenging task in reality.
A cornerstone challenge is that meta-learning models can easily overfit on the biased distribution introduced by a few training examples (Yang et al., 2021).In order to prevent overfitting and inspired by (Xie et al., 2020), we introduce an unsupervised diverse paraphrasing loss in the Prototypical Networks framework.A key idea is consistency learning: by augmenting unlabeled user utterances, PROTAUGMENT enforce a more robust text representation learning.Unfortunately, back-translation is a poor data augmentation strategy for short-texts: neural machine translation provides very similar (if not the same) sentences to the original ones, which hinders its ability to provide diverse augmentations (Section 5.3).Consequently, in this work, we transfer a denoising autoencoder pre-trained on the sequence-to-sequence task (Lewis et al., 2020) to the paraphrase generation task and then use it to generate paraphrases.As fine-tuning is very efficient for such a model, it is not easy to optimize it for diverse paraphrasing.(Goyal and Durrett, 2020) presents an approach for diverse paraphrasing that reorders the original sentence to guide the conditional language model to generate diverse sentences.The diversity in that work is provided by the reordering of the elements, which surprisingly affects the attention mechanism.In (Liu et al., 2020), expression diversity is part of the unsupervised paraphrasing system supported by simulated annealing.Both approaches imply domain transfer, and consequently, as many diverse paraphrasing models to maintain as the number of considered application domains, which do not scale very well.In this work, we instead introduce diversity in the downstream decoding algorithm used for paraphrase generation.Diverse decoding methods are mostly extensions to the beam search algorithm, including noise-based algorithms (Cho, 2016), iterative beam search (Kulikov et al., 2019), clustered beam search (Tam, 2020) and diverse beam search (Vijayakumar et al., 2018).There is no clear optimal solution, the choice is task-specific and dependent on one's tolerance for lower quality outputs as a diversity/fluency trade-off (Ippolito et al., 2019).While diverse beam search allows controlling the diversity/fluency trade-off partially, we further demonstrate that adding constraints to diverse beam search in order to generate tokens not seen in the input sentence (that is, constrained diverse beam search) is a simple yet powerful strategy to further improve the diversity of the paraphrases.Paired with paraphrasing user utterances and its consistency loss incorporated in Prototypical networks, our model is the best method for intent detection meta-learning on 4 public datasets, with neither extra labeling efforts nor domain-specific conditional language model fine-tuning.We also show that PROTAUGMENT, having access to only 10 samples of each class of the training data, still significantly outperforms a Prototypical Network which is given access to all samples of the same training data.

Neural architectures for meta-learning
Past works on meta-learning for classification tasks investigate how to best predict a query point's class at an episode scale.This process is bounded to the set of the C classes considered in a given episode.
Matching Networks (Vinyals et al., 2016) predict the class of a query point as the average cosine distance between the query vector and all support vectors for each class.Prototypical Networks (Snell et al., 2017) extend Matching Networks: after obtaining support vectors from the encoder, a class prototype is produced via a class-wise vector averaging operation.All query points are then predicted with respect to their distance (cosine or euclidean) to all prototypes.Like Prototypical Networks, Relation Networks (Sung et al., 2018) emerged from Computer Vision application and were later successfully applied to NLP (Zhang et al., 2018).They introduce a relation module, which captures the relationship between data points: instead of using a pre-defined distance (euclidean or cosine most of the time), this approach allows such networks to learn this metric by themselves.This is achieved using either a shallow feed-forward sub-network or a Neural Tensor Layer relation module (Socher et al., 2013) (intermediate learnable matrices).Another extension to Prototypical Networks is provided in (Ren et al., 2018).Unlabeled data are incorporated using two distinct approaches: i) taking unlabeled data from the same classes as the episode or ii) using any unlabeled data and incorporating both a distractor cluster and masking strategy to minimize the impact of distant unlabeled points.The first approach is unrealistic for meta-learning, as it implies knowing the unlabeled data class.The second method assumes that all the noise is centered around a single distractor cluster and introduces an additional hyperparameter for masking -which is hardly fine-tuneable for small few-shot datasets.

Prototypical networks
In prototypical networks, each class is mapped to a representative point, called prototype.Each sample is first encoded into a vector using an embedding function f φ with learnable parameters φ -this is the function we want to optimize.Using these embeddings, we compute each prototype p c , c ∈ C ep as the mean vector of embedded support points belonging to the class c, as described in Equation 1.
Given those prototypes and a distance function d, prototypical networks assign a label to a query point by computing the softmax over distances between this point's embedding and the prototypes, as in Equation 2. In the original paper, (Snell et al., 2017) use the euclidean distance and we also observed consistent slightly worse results with the cosine distance.
The supervised loss function L is the average negative log-probability of the correct class assignments for all query points.At test time, episodes are created using classes from C test , and accuracy is measured as the query points assignments, given prototypes derived from the support points.

PROTAUGMENT
In this section, we present our semi-supervised approach PROTAUGMENT.Along with the labeled data randomly chosen at each episode, this approach uses U unlabeled data randomly drawn from the whole dataset -that is, data from training, validation, and test labels.We first do a data augmentation step from this unlabeled data, where we obtain M paraphrases for each unlabeled sentence.The m th paraphrase of x will be denoted xm .
Then, given unlabeled data and their paraphrases, we compute a fully unsupervised loss.Finally, we combine both the supervised loss L (the Prototypical Network loss using labeled data) and unsupervised loss (denoted L) and run back-propagation to update the model's parameters.

Generating augmentations through paraphrasing
The BART (Lewis et al., 2020) model is a Transformer-based neural machine translation architecture that is trained to remove artificially corrupted text from the input thanks to an autoencoder architecture.While it is trained to reconstruct the original noised input, it can be fine-tuned for taskspecific conditional generation by minimizing the cross-entropy loss on new training input-output pairs (Bevilacqua et al., 2020).In PROTAUGMENT, we fine-tune a pre-trained BART model on the paraphrasing task.The paraphrase sentence pairs we use for this task are taken from 3 different paraphrase detection datasets 1 : Quora (Sharma et al., 2019), MSR (Zhao and Wang, 2010), and Google PAWS-Wiki (Yang et al., 2019;Zhang et al., 2019).
Those datasets have different sizes, and the largest one -Quora -consist of 149,263 pairs of duplicate questions.To balance turns of sentences (questions/non questions paraphrases), 50% of our fine-tuning paraphrase datasets is made of Quora, 5.6% of MSR and 44.4% PAWS-Wiki.This yields 94,702 sentence pairs to train the model on the paraphrasing task.We include both code and data on our github repository 2 .Using this fine-trained paraphrasing model, we can generate paraphrases of unlabeled sentences, hopefully having paraphrases representing the same intents as the original sentences.To add some diversity in the generated paraphrases, we use Di-  BART is pre-trained for the paraphrasing task on three datasets: Quora (Sharma et al., 2019), MSR (Zhao and Wang, 2010) and Google PAWS-Wiki (Yang et al., 2019;Zhang et al., 2019).The paraphrase model is used to paraphrase unlabeled samples but equipped with diversity strategies (back translation being proposed as a baseline).
The final loss is computed using a loss annealing scheduler, which is expected to smooth the supervised (given shots) and unsupervised (augmented unlabeled sentences) prediction errors to yield parameter gradients.A new episode means sampling other classes along with their support and query points.
verse Beam Search (DBS) instead of the regular Beam Search.As Vijayakumar et al. ( 2018) has shown in the original paper, adding a dissimilarity term during the decoding step helps the model produce sequences that are quite far from each other while still retaining the same meaning.The next section describes how we constrained this decoding to enforce even more diversity among generated paraphrases in PROTAUGMENT.

Constrained user utterances generation
While DBS enforces diversity between the generated sentences, it does not ensure diversity between the generated paraphrases and the original sentences.It was formerly designed for tasks that do not need this diversity with the original sentence (translation, image captioning, question generation).To enforce that our generated paraphrases are diverse enough, we further constraint DBS by forbidding using parts of the original sentences.In the following paragraphs, we introduce two forbidding strategies.
Unigram Masking.In this strategy, we randomly select tokens from the input sentence which will be forbidden at the generation step.The goal here is to force the model to use different words in the generated sentences than it saw in the original sentences.Each word of the input sentence is randomly masked using a probability p mask .The underlying assumption is that forbidding tokens at the beginning of a sentence with a higher probability than the end of the sentence may have a greater impact on the beam search algorithm.Indeed, as the decoding is a conditional task based on prior generated tokens, masking the first tokens may significantly impact diversity.We therefore introduce two additional variants: one where we put more probability on the first tokens and the reverse where there is more weight in the last tokens.To ensure that all three variants mask the same amount of tokens on average, we ensure the area under the curve of the three probability functions are equal to a fixed value noted p mask .

Bi-gram Masking
Another strategy we consider is to prevent the paraphrasing model from generating the same bi-grams as in the original sentence.This time, we are not masking any single word but forcing the model to change the sentence's structure, which will, hopefully, increase the diversity of the generated paraphrases.

Unsupervised diverse paraphrasing loss
After generating paraphrases for each unlabeled sentence, we create unlabeled prototypes.For each unlabeled sentence x u ∈ U , we derive the unlabeled prototype p xu as the average embedding of the paraphrases of x u (Equation 3).
After obtaining the unlabeled prototypes, we compute the distances between all unlabeled samples and all unlabeled prototypes.Given such distances, we model the probability of each unlabeled sample being assigned to each unlabeled prototype (Equation 4), as in the supervised part of the Prototypical Networks -except this time, it is fully unsupervised.This probability should be close to 1 between an unlabeled sample and its associated unlabeled prototype and close to 0 otherwise.
Given assign probabilities between unlabeled samples and unlabeled prototypes, we can compute a fully unsupervised cross-entropy loss L, training the model to bring each sentence closer to its augmentations' prototype and further from the prototypes of other unlabeled sentences.Recall that f φ is the embedding function with φ as learnable parameters (Section 3.2).
After obtaining both supervised loss L and unsupervised loss L, we combine them into the final loss L using a loss annealing scheduler (see Equation 5), which will gradually incorporate the unsupervised loss as training progresses.
The goal here is to mainly use the supervised loss first so that the model gets a sense of the classification task.Then, incorporating more and more knowledge from unlabeled samples will make the model more robust to noise, which is essential as it is constantly tested on classes it has never seen before.We explore three different strategies for gradually increasing the unsupervised contribution: a linear approach (α = 1), an aggressive one (α = 0.25), and a conservative one (α = 4).

Datasets
We consider the DialoGLUE benchmark (Mehri et al., 2020), a set of natural language understanding benchmark for task-oriented dialogue, which contains three datasets for intent detection: Banking77, HWU64 and Clinic150the three datasets were already available prior the release of DialoGLUE.Additionally, we also consider the Liu57 intent detection dataset, as it contains the same order of magnitude of intent classes and is user-generated as well.All datasets are public and in English.

Banking77
The Banking77 dataset (Casanueva et al., 2020) classifies 13, 083 user utterances related to into 77 different intents.This dataset i) is specific to a single domain (banking) and ii) requires a fine-grained understanding to classify due to intents being very similar.Following (Mehri et al., 2020) and contrary to (Casanueva et al., 2020), we designate a validation set along a training and a testing set for that dataset (Table 1 resented only in one set of labels.This ensures the model learns to discriminate between both intents and domains. Clinic150 This dataset (Larson et al., 2019) classifies 150 user intents in perfectly equallydistributed classes.This chatbot-like style dataset was initially designed to detect out-of-scope queries, though, in our experiments, we discard the out-of-scope class and only keep the 150 labeled classes to work with, as in (Mehri et al., 2020).
Liu57 Introduced by Liu et al. (2019a), this intent detection dataset is composed of 54 classes.It was collected on Amazon Mechanical Turk, where workers were asked to formulate queries for a given intent with their own words.It is highly imbalanced: the most (resp.least) common class holds 5, 920 (resp.24) samples

Experimental settings
Conditional language model and language model.For the BART fine-tuning process, we used the defaults hyper-parameters reported in (Lewis et al., 2020), and we fine-tuned the BART model for a single epoch (two hours on a Titan RTX GPU).
Increasing the number of epochs for fine-tuning BART degrades performances on the intent detection task: the downstream diverse beam search struggles to find diverse enough beam groups since the model perplexity has been lower with further fine-tuning (this is also hinted in (Bevilacqua et al., 2020)).Our text encoder f φ is a bert-base model, and the embedding of a given sentence is the last layer hidden state of the first token of this sentence.For each dataset, this model is fine-tuned on the masked language modeling task for 20 epochs.
Then, the encoder of our meta learner is initialized using the weights of this fine-tuned model.
Datasets From a dataset point-of-view, we create two data profiles: full (all the training dataset is available, the usual meta-learning scenario) and low (only 10 samples are available for each training class, an even more challenging meta-learning scenario in which a model meta-learns on very few samples per training class).All experimental setups are run 5 times.For each run, we randomly select training, validation, and testing classes, as well as the samples for the low setting.We train the few-shot models for a maximum of 10, 000 C-way K-shots episodes, evaluating and testing every 100 episodes, stopping early if the evaluation accuracy has not progressed for at least 20 evaluations.We evaluate and test using 600 episodes, as in other few-shot works (Snell et al., 2017;Chen et al., 2019).We compare the systems in the following standard few-shot evaluation scenarios: 5-way 1-shot, and 5-way 5-shots.
Paraphrasing.At each episode, we draw U = 5 unlabeled samples to generate paraphrases from.
For the back-translation baseline, we use the publicly available3 translation models from the Helsinki-NLP team.We use the following pivot languages: fr, es, it, de, nl, which yields 5 augmentations for each unlabeled sentence.For our experiments with Diverse Beam Search, we generate sentences using 15 beams, group them into 5 groups of 3 beams.In each group, we select the generated sentence which is the most different from the input sentence using BLEU as a metric for diversity.This yields M = 5 paraphrases for each unlabeled sentence, as in the back-translation baseline.DBS uses a diversity penalty parameter to penalize words that have already been generated by other beams to enforce diversity.As advised in the original DBS paper (Vijayakumar et al., 2018), we set the diversity penalty to 0.5 in our experiments, which provides diversity while limiting model hallucinations.Our Unigram Masking strategy's masking probability is set to p mask = 0.7 found by linear search from 0 to 1 with steps of 0.1.
orig: How long will my transfer be pending for?back: How long will my transfer be on hold?dbs 0: How long will my transfer be pending?I am in first year.

Evaluation of paraphrase diversity
We evaluate the diversity of paraphrases for each method, and report results for two representative datasets in Table 3 (due to space limitations, the report for all datasets is given in appendix B).For each paraphrasing method and each dataset, metrics are computed over unlabeled sentences and their paraphrases.To assess the diversity of paraphrases generated by the different methods, the popular BLEU metric in Neural Machine Translation is a poor choice (Bawden et al., 2020).We use the bi-gram diversity (dist-2) metric as proposed by (Ippolito et al., 2019), which computes the number of distinct 2-grams divided by the total amount of tokens.We also report the average similarity (denoted use) within each sentence set, using the Universal Sentence Encoder as an independent sentence encoder.Results show that paraphrases obtained with back-translation are too close to each other, resulting in a high sentence similarity and low bi-gram diversity.On the other hand, DBS generates more diverse sentences with a lower similarity.Our masking strategies strengthen this effect and yield even more diversity.The measured diversity strongly correlates with the average accuracy of the intent detection task (Table 4).

Intent detection results
In this section, we discuss the accuracy results for the different meta-learners, for the standard 5-way and {1, 5}-shots meta-learning scenarios, as provided in Table 4.The reported metric is the accuracy on the test set at the iteration where the validation set's accuracy is maximal.Our DBS+unigram strategy row corresponds to the flat masking strategy, with p mask = 0.7.First, all methods augmented with unsupervised diverse paraphrasing outperform prototypical networks.However, back translation demonstrates only a limited improvement over the vanilla prototypical network due to their narrow diversity for short texts.Using paraphrases from DBS yields better results -about 0.5 points over BT, on average -, hinting that using diverse paraphrases in the unsupervised consistency loss allows the few-shot model to build more robust sentence representations and therefore provides improved generalization capacities.Those results are consistent across the different datasets, except for Clinic for which accuracies are all very high, making all methods hardly separable.The dataset is not challenging enough, or in other words, metalearning is robust to unbalanced short text classification problems given the nature of that dataset.
These results illustrate the need for unsupervised paraphrasing and show that using diverse paraphrases provide a significant performance leap.
In the 1-shot (resp.5-shot) scenario, our best meta-learner improves prototypical networks by 5.27 (resp.2.85) points on average.Remember that these improvements are made in an unsupervised manner hence at no additional cost.Slightly different from to (Xie et al., 2020), we do not find statistical differences depending on the rate at which L is annealed in PROTAUGMENT loss (α ∈ {0.25, 1, 4}), which makes it easier to tuneour unsupervised loss serves as a consistency regularization.Due to space limitations, this analysis is available in appendix D.
Adding our masking strategies on top of DBS has a significant impact on all datasets, with the unigram variant being up about 2 points over the vanilla DBS on average.On all datasets except Clinic, given only 10 labeled samples per class (low profile), it even outperforms the supervised baseline which is given the full training data (full profile).This means that PROTAUGMENT does better than prototypical networks with much less -15 times, and up to 47 times, depending on the dataset -labeled sentences per class.Those results indicate that our method more than compensates for the lack of labeled data and that no matter the amount of data available for the training class, there is a performance ceiling you cannot overcome without adding unsupervised knowledge from the validation and test classes.In the full profile, when given all the training data, our method greatly surpasses the Prototypical Network -3.58 points given 1 shot, on average.Moreover, PROTAUGMENT is not only suited for the case where very little training data is available (low profile): when sampling shots from the entire training dataset (full profile), it outperforms a fully supervised baseline.Furthermore, note that our method is consistently more stable than the supervised baselines, as its average standard deviation over the different runs is much lower  Table 4: 5-way 1-shots and 5-way 5-shots accuracy on the test sets for each dataset.The ours method is PROTAUG-MENT (unsupervised consistency loss using diverse paraphrases) equipped with different paraphrasing strategies.
For each dataset × C-way K-shot setting, we compute the average and the standard deviation over the 5 runs (see Section 5.2), so that the last two columns contains average accuracy and ± the average standard deviations.For each data profile, we highlight the best method in bold.We underline the methods on the low profile which perform better than the Prototypical Networks on the full profile.We trained 400 different meta-learners -5 methods, 2 data profiles, 4 datasets, 2 meta-learning setup (K = 1, 5) and 5 runs for each configuration.
than the vanilla Prototypical Network.

Masking strategies
We experimented with three variants of the unigram strategy (Section 4.2), each assigning a different drop chance to each token depending on its position in the input sentence.In our experiments, we did not observe any significant difference in performance when putting more weight on the first tokens (down), or last tokens (up), or the same weight on all tokens (flat) (Detailed results in appendix C).
We also conducted experiments where we tune the value p mask , from 0 to 1, selecting 0.7 as the best trade-off (Figure 2).This figure also clearly shows that the Clinic dataset is one order of magnitude easier to solve than the other datasets.

Conclusion
In this work, we proposed PROTAUGMENT, an architecture for meta-learning for the problem of classifying user-generated short-texts (intents).We first introduced an unsupervised paraphrasing consistency loss in the prototypical network's framework to improve its representational power.Then, while the recent diverse beam search algorithm was designed to enforce diversity between the generated paraphrases, it does not ensure diversity between the generated paraphrases and the original sentences.To make up for the latter, we introduce constraints in the diverse beam search generation, further increasing the diversity.Our thorough eval-  uation demonstrates that PROTAUGMENT offers a significant leap in accuracy for the most recent and challenging datasets.PROTAUGMENT vastly outperforms prototypical networks, which was found to be the best meta-learning framework for shorttexts (Dopierre et al., 2021) against unsupervisedextended Prototypical Networks (Ren et al., 2018), Matching Networks (Vinyals et al., 2016), Relation Networks (Sung et al., 2018), and Induction Networks (Geng et al., 2019), thereby making PRO-TAUGMENT the new state-of-the-art for this task.We provide the source code of PROTAUGMENT as well as code for evaluations reported in this paper on a public repository 4 A Diverse paraphrase samples orig: what's the recipe for fish soup back: What is the recipe for fish soup dbs 0: How do you make fish soup?How is the recipe determined?dsb 1: How can you recipe for fish-sugary food?dbs 2: What are the recipes for Fish soup and how is it prepared?
orig: Find easy recipe for almond milk back: Find an easy recipe for almond milk dbs 0: What are some good recipe for Almond milk?dsb 1: What are some good ways of making Almond milk?dbs 2: How do I make Almond milk for a beginner?

Figure 1 :
Figure1: PROTAUGMENT illustrated on a 3-way 2-shot short text classification meta-learning task (C = 3, K = 2).BART is pre-trained for the paraphrasing task on three datasets: Quora(Sharma et al., 2019), MSR(Zhao and Wang, 2010) and Google PAWS-Wiki(Yang et al., 2019;Zhang et al., 2019).The paraphrase model is used to paraphrase unlabeled samples but equipped with diversity strategies (back translation being proposed as a baseline).The final loss is computed using a loss annealing scheduler, which is expected to smooth the supervised (given shots) and unsupervised (augmented unlabeled sentences) prediction errors to yield parameter gradients.A new episode means sampling other classes along with their support and query points.

Figure 2 :
Figure 2: 5-way 1-shot accuracy of DBS-unigram-flat method using different values of p mask .Setting this value to 0 corresponds to the vanilla DBS without masking strategies.
orig: Will I need to wear a coat today?back: Should I wear a coat today?dbs 0: Today, do I need to put on a coat dsb 1: Should I wear a coat and what kind of coat dbs 2: What should I wear to work today, and why orig: can you play m3 file back: can you read m3 file dbs 0: M3 files: can I play the entire M3 file?dsb 1: Is there any way to play 3M files on Earth without downloading it dbs 2: Is there any way to play M3 files on Windows?
). HWU64 HWU64 (Xingkun Liu and Rieser, 2019) classifies 25, 716 user utterances with 64 user intents.It features intents spanning across 21 domains (alarm, audio, audiobook, calendar, cooking, datetime, . . .).When separating training, validation, and test labels, we ensure each domain is rep- Table1: Main statistics of intent detection evaluation datasets.For HWU64, each split's number of classes varies at each run to ensure there is no cross-split domain, hence the decimal number.
Is there any way to play 3M files on Earth without downloading it dbs 2: Is there any way to play M3 files on Windows?
dsb 1: When are all transfers coming up and how many days are they expected?dbs 2: If I have a transfer for a while, how long should I wait for it?orig: I am not sure where my phone is.back: I don't know where my phone is.dbs 0: I am not really sure where my phone is located dsb 1: How can I find the location of any Android mobile dbs 2: I don't know where is my cell phone
Are you considered a bot? back: Do you consider yourself a bot? dbs 0: Is there any way to be considered a bot? dsb 1: Is there any way to know if someone is bot? dbs 2: Is there any way to know if a person is bot?What are some ways to cancel a transaction (in any way)?dsb 1: What are some ways of cancelling an account (in some cases also involving transaction ).dbs 2: Is there any way in which I can cancel an existing transaction?Quiet for a day.dbs 0: Do you stay silent for a day or go silent for another day dsb 1: To the rest of the day, stay peaceful and collected.dbs 2: So, to the rest of the day, go silent only.
orig: orig: I am not sure where my phone is.back: I don't know where my phone is.dbs 0: I am not really sure where my phone is located dsb 1: How can I find the location of any Android mobile dbs 2: I don't know where is my cell phone orig: What do I need to do for a refund?back: What do I need to do to get a refund?dbs 0: What should I do now that I am not getting a refund?dsb 1: What are things that should be done before resumption of service ( reimbursed)?dbs 2: What should you do if you want to refund your period of data?orig: does pizza hut have my order ready back: Does Pizza Hut has my order ready dbs 0: Does the pizza Hut have all your orders ready?dsb 1: Does pizza Hut have all your orders made up? dbs 2: Does the pizza Hut have all your orders ready for delivery?orig: go silent for a day back:

Table 6 :
Paraphrase evaluation on all 4 datasets.The unigram variant exposed here is using the flat masking strategy with p mask = 0.7.

Table 7 :
Performances of DBS+unigram strategies putting either more chance to mask first tokens (down), last tokens (up), or the same chance to all tokens (flat).All strategies use p mask = 0.7.Overall, there is no significant difference between the three strategies.

Table 8 :
Performances of DBS+unigram strategies with different values of the loss annealing parameter α.All strategies use p mask = 0.7.Overall, there is no significant difference when changing the value of α.