Dense Retrieval as Indirect Supervision for Large-space Decision Making

Many discriminative natural language understanding (NLU) tasks have large label spaces. Learning such a process of large-space decision making is particularly challenging due to the lack of training instances per label and the difficulty of selection among many fine-grained labels. Inspired by dense retrieval methods for passage finding in open-domain QA, we propose a reformulation of large-space discriminative NLU tasks as a learning-to-retrieve task, leading to a novel solution named Dense Decision Retrieval (DDR ). Instead of predicting fine-grained decisions as logits, DDR adopts a dual-encoder architecture that learns to predict by retrieving from a decision thesaurus. This approach not only leverages rich indirect supervision signals from easy-to-consume learning resources for dense retrieval, it also leads to enhanced prediction generalizability with a semantically meaningful representation of the large decision space. When evaluated on tasks with decision spaces ranging from hundreds to hundred-thousand scales, DDR outperforms strong baselines greatly by 27.54% in P@1 on two extreme multi-label classification tasks, 1.17% in F1 score ultra-fine entity typing, and 1.26% in accuracy on three few-shot intent classification tasks on average. Code and resources are available at https://github.com/luka-group/DDR


Introduction
Many discriminative natural language understanding (NLU) tasks require making fine-grained decisions from a large candidate decision space.For example, a task-oriented dialogue system, when responding to users' requests, needs to frequently detect their intents from hundreds of options (Zhang et al., 2020b;Ham et al., 2020).Description-based recommendation in e-commerce needs to search from millions of products in response to users' descriptions (Gupta et al., 2021;Xiong et al., 2022).
Previous studies for such NLU tasks still train classifiers as the solution (Gupta et al., 2021;Yu et al., 2022;Lin et al., 2023a).However, we argue that this straightforward approach is less practical for large-space decision making for several reasons.First, more decision labels naturally lead to data scarcity, since collecting sufficient training data for all labels will need significantly more cost.This issue is also accompanied by the issue of rare labels in the long tail of highly skewed distributions suffering severely from the lack of sufficient training instances (Zhang et al., 2023), leading to overgeneralization bias where frequent labels are more likely to be predicted compared with rare ones (Xu et al., 2022).Second, the non-semantic logit-based representation for decision labels in a classifier also makes the model hard to generalize to rarely seen labels, and not adaptable to unseen ones in training.This issue also impairs the applicability of the decision-making model to real-world scenarios, such as recommendation (Xiong et al., 2022) and task-oriented parsing (Zhao et al., 2022a), where the decision space may expand rapidly and crucial labels may be absent in training.
On the contrary, motivated by semantic similarity of examples annotated with identical labels, recent studies propose contrastive learning schemes that leverage Siamese encoding architectures to maximize similarity scores of representation for positive example pairs (Henderson et al., 2020;Zhang et al., 2020a;Dahiya et al., 2021a;Mehri and Eric, 2021;Zhang et al., 2021;Xiong et al., 2022).Meanwhile, inductive bias from NLU tasks such as masked language modeling (Mehri et al., 2020;Dai et al., 2021) and natural language inference (NLI; Li et al. 2022;Du et al. 2022) has shown beneficial for learning on rare and unseen labels via indirect supervision (Yin et al., 2023).However, when dealing with very large decision spaces, existing methods still face critical trade-offs between generalizability and efficiency of prediction. 2nspired by the recent success of dense retrieval methods that learn to select answer-descriptive passages from millions of candidate documents for open-domain QA (Karpukhin et al., 2020;Lee et al., 2021;Zhan et al., 2021), we propose an indirectly supervised solution named Dense Decision Retrieval (DDR ).DDR provides a general reformulation of large-space decision making as learning to retrieve from a semantically meaningful decision thesaurus constructed based on taskrelevant resources.The model adopts the dualencoder architecture from dense retrieval models to embed input texts and label descriptions, and learns to predict by retrieving from the informative decision thesaurus instead of predicting fine-grained decisions as logits.In this way, DDR not only leverages rich indirect supervision signals from other easy-to-consume learning resources for dense retrieval.It also leads to enhanced prediction performance and generalizability with a semantically meaningful representation of the large decision space.We evaluate DDR on large decision spaces ranging from hundreds for few-shot intent detection, ten-thousand for ultra-fine entity typing, to hundred-thousand scales for extreme multi-label classification.DDR obtains state-of-the-art performance on 6 benchmark datasets in multiple fewshot settings, improving the most competitive baselines by 27.54% in P@1 for extreme classification, 1.17% in F1 for entity typing and 1.26% in accuracy for intent detection on average.Ablation studies show that both the constructed informative label thesaurus and indirect supervision from dense retrieval contribute to performance gain.
The technical contributions of this work are three-fold.First, we present a novel and strong solution, DDR , for NLU tasks with large-space decision making that leverages indirect supervision from dense retrieval.Second, we provide semantically meaningful decision thesaurus construction that further improves the decision-making ability of DDR .Third, we comprehensively verify the effectiveness of DDR on tasks of fine-grained text classification, semantic typing and intent detection where the size of decision spaces range from hundreds to hundreds of thousands.

Related Work
Indirect Supervision Indirectly supervised methods (Roth, 2017;He et al., 2021;Yin et al., 2023) seek to transfer supervision signals from a more resource-rich task to enhance a specific more resource-limited task.A method of this kind often involves reformulation of the target task to the source task.Previous studies have investigated using source tasks such as NLI (Li et al., 2022;Yin et al., 2019;Lyu et al., 2021, inter alia), extractive QA (Wu et al., 2020;FitzGerald et al., 2018;Li et al., 2020, inter alia), abstractive QA (Zhao et al., 2022a;Du and Ji, 2022) and conditioned generation (Lu et al., 2022;Huang et al., 2022b;Hsu et al., 2022, inter alia) to enhance more expensive information extraction or semantic parsing tasks.Recent studies also transformed these technologies to specialized domains such as medicine (Xu et al., 2023) and software engineering (Zhao et al., 2022b) where model generalization and lack of annotations are more significant challenges.There has also been foundational work which studies the informativeness of supervision signals in such settings (He et al., 2021).
However, the aforementioned studies are not designed for discriminative tasks with very large decision spaces, and do not apply directly due to issues such as high inference costs (NLI) and requiring decisions to be inclusive to the input (QA).We instead propose dense retriever that naturally serves as a proper and efficient form of indirect supervision for large-space decision making.
NLU with Large Decision Spaces Many concrete NLU tasks deal with large decision spaces, including description-based recommendation in ecommerce (Gupta et al., 2021;Xiong et al., 2022) and Web search (Gupta et al., 2021), user intent detection in task-oriented dialog systems (Zhang et al., 2020b;Ham et al., 2020), and fine-grained semantic typing (Choi et al., 2018;Chen et al., 2020), etc.Previous studies for such tasks either train classifiers (Yu et al., 2022;Lin et al., 2023a) or rely on contrastive learning from scratch (Zhang et al., 2021;Xiong et al., 2022), which are generally impaired by insufficiency of training data and hard to generalize to rarely seen and unseen labels.These challenges motivate us to explore a practical solution with indirect supervision from a dense retriever.

Thesaurus Construction
Intent Detection: "my refund is not here yet" Ultra-fine Entity Typing: He retired by the lower house in 1890.Figure 1: Overview of DDR that reformulates general large-space decision making tasks as dense retrieval.Label thesaurus is constructed with detailed descriptions from publicly available resources and the dual-encoder learns to maximize similarity between embeddings of input and label thesaurus entry with indirect supervision from a pre-trained dense retriever.

Method
We first describe the reformulation of large-space decision making task as dense retrieval ( §3.1).We then introduce several automatic ways to construct label thesaurus of high quality ( §3.2).Lastly, we demonstrate the dual-encoder architecture of DDR (illustrated in Fig. 1) to retrieve decisions ( §3.3).

Preliminary
The mechanism of dense retrieval for general decision-making in large spaces can be formalized as follows.Given a textual input, such as "Illy Ground Espresso Classico Coffee: Classico, classic roast coffee has a lingering sweetness and delicate..." for description-based product recommendation, a model learns to infer their corresponding labels by mapping input to textual descriptions in a large label thesaurus (such as names, brands, component, etc.), instead of simply predicting the label index.Formally speaking, given a textual input x, a decision space covering L labels and a label thesaurus that contains corresponding entries (label descriptions) D = {d 1 , d 2 , . . ., d L }.We first split every entry into text passages of a maximum length (e.g., 100 words) as the basic retrieval units and get M total passages P = {p 1 , p 2 , . . ., p M } 3 .A retriever R : (x, P ) → P R is a function that takes as input a text sequence x and the label thesaurus P and 3 Since a label-descriptive entry, such as those in XMC tasks, is sometimes too long to encode with a Transformer encoder, we break it down into multiple passages to avoid the quadratic dependency w.r.t input length.returns a ranking list of passages P R , where labels represented by higher ranked passages are more likely to be relevant to the input.As demonstrated in Fig. 1, DDR also leverages indirect supervision from open-domain QA by initializing parameters of retriever R with those from DPR.

Thesaurus Construction
Depending on the target task, the decision thesaurus can provide knowledge about the decision space from various perspectives.Given label names, we create informative descriptions for each label automatically without extra expenses.We can refer to publicly available resources for constructing thesaurus of high quality.To name a few: 1) Lexicographical knowledge bases: dictionaries such as WordNet (Miller, 1995) and Wiktionary4 provide accurate definitions as well as plentiful use cases to help understand words and phrases.
2) Large Language Models: LLMs trained on instructions like ChatGPT and Vicuna-13B (Chiang et al., 2023) are able to generate comprehensive and reasonable descriptions of terms when they are prompted for explanation.
3) Training examples: directly adopting input text when labels with detailed information also appear as example input, or aggregating input texts from multiple examples assigned with the same label if they share very high-level semantic similarities.
We describe thesaurus construction in practice for each investigated downstream task in §4.

Learning to Retrieve Decisions
Encoders Since a key motivation of DDR is to leverage indirect supervision from retrieval tasks, we adopt the dual-encoder architecture similar to a dense retriever such as DPR (Karpukhin et al., 2020), where two sentence encoders with special tokens (BERT (Devlin et al., 2019) by default), E x and E p , are leveraged to represent input text x and passage p individually. 5.The similarity between input and passage is computed using dot product of their vectors denoted by sim(x, p).
Training The objective of the reformulated largespace decision making problem is to optimize the dual encoders such that relevant pairs of input text and label description (thesaurus entry) should have higher similarity than irrelevant pairs in the embedding space.Hence, the overall learning objective is a process of contrastive learning.
To tackle single-label classification where each instance contains one input x i , one relevant (positive) passage x + i , along with n irrelevant (negative) passages p − i,j , we minimize the negative log likelihood of the positive label description = − log e sim(x i ,p + i ) e sim(x i ,p + i ) + n j=1 e sim(x i ,p − i,j ) .
We extend to multi-label classification tasks where each input has multiple positive passages and more than one of them participates in model update simultaneously.Accordingly, the cross entropy loss is minimized to encourage similarity between m positive pairs instead: .
Positive, negative and hard negative passages: the decision thesaurus entries that describe true labels in thesaurus for each input are deemed as positive.In-batch negatives are positive entries for other input in the same mini-batch, making the computation more efficient while achieving better performance compared with random passages or other negative entries (Gillick et al., 2019;Karpukhin et al., 2020).If not otherwise specified, we train with positive and in-batch negatives to obtain a weak DDR model firstly; we then retrieve label thesaurus with the weak model for training examples and leverage wrongly predicted label description as hard negatives, which are augmented to training data to obtain a stronger DDR .
We note that, different from open-domain QA where one positive pair can be considered per time with Eq. 1, considering multiple positive pairs is very important for tasks where the decision space though large but follows highly skewed distributions, when in-batch negatives are adopted for model update.In the mini-batch, multiple inputs may share same labels that are popular in the head of long-tail label distribution.In this case, these input instances have multiple positive passages: one randomly sampled from label passages as normal and some from positives of other in-batch inputs.
Inference After training is completed, we encode all passages in label thesaurus once with E p and index them using FAISS offlne (Johnson et al., 2019).Given a new input, we obtain its embedding with E x and retrieve the top-k passages with embeddings closest to that of input.

Experiments
We evaluate DDR on three NLU tasks with large deicsion spaces in two criteria, 1) a minority of label space is observed during training: Extreme Multi-label Classification ( §4.1) and Entity Typing ( §4.2), or 2) limited amounts of examples are available per label: Intent Classification ( §4.3).We investigate decision spaces of up to hundreds of thousands.Dataset statistics are shown in Appx.Tab. 1.

Extreme Multi-label Classification
Task Many real-world applications can be formulated as eXtreme Multi-label Classification (XMC) problems, where relevant labels from the set of an extremely large size are expected to be predicted given text input.Considering the ever-growing label set with newly added products or websites and the time-consuming example-label pairs collection, we follow the few-shot XMC setting (Gupta et al., 2021;Xiong et al., 2022) to choose relevant labels from all available seen and unseen labels.

Datasets and Metrics
The public benchmark dataset LF-Amazon-131K collects pairs of relevant products with textual descriptions for commercial recommendation, while LF-WikiSeeAlso-320K7 aims to link relevant Wikipedia passages for reference (Bhatia et al., 2016;Gupta et al., 2021).
We use metrics widely adopted by XMC (Chien et al., 2023) and IR (Thakur et al., 2021) literature: precision@k for the proportion of the top-k predicted labels to be true labels, and recall@k for the proportion of true labels found in the top-k predictions. 8aselines We compare DDR with strong XMC baselines in three categories: 1) Transformer-based models that learn semantically meaningful sentence embeddings with siamese or triplet network structures, including Sentence-BERT (Reimers and Gurevych, 2019) and MPNet (Song et al., 2020); 2) competitive methods originally proposed for scalable and accurate predictions over extremely large label spaces, including XR-Linear (Yu et al., 2022) that recursively learns to traverse an input from the root of a hierarchical label tree to a few leaf node clusters, Astec (Dahiya et al., 2021b) with four sub-tasks for varying trade-offs between accuracy and scalability, and SiameseXML (Dahiya et al., 2021a) based on a novel probabilistic model to meld Siamese architectures with high-capacity extreme classifiers; 3) methods specifically designed to improve performance with few-shot labels available, including ZestXML (Gupta et al., 2021) that learns to project a data point's features close to the features of its relevant labels through a highly sparsified linear transform, and MACLR (Xiong et al., 2022) that pre-trains Transformer-based encoders with a self-supervised contrastive loss.

Implementation Details
The label space of LF-WikiSeeAlso-320K is composed of relevant passage titles, hence the corresponding full content should be available in Wikipedia and we use them as the label thesaurus.9For relevant product advertisement and recommendation, concrete product description provides sufficient information as the label thesaurus for LF-Amazon-131K.We extract product textual descriptions mainly from Amazon data in XMC benchmark (Bhatia et al., 2016) and public Amazon product meta datasets (Ni et al., 2019), leading to 95.67% of labels well documented.As introduced in §3.2, we then use Chat-GPT10 to obtain relevant information for remaining undocumented labels11 .
We perform label thesaurus construction on Intel(R) Xeon(R) Gold 5217 CPU @ 3.00GHz with 32 CPUs and 8 cores per CPU.It takes 59.12 minutes to construct label thesaurus for LF-WikiSeeAlso-320K, while 28.24 minutes for LF-Amazon-131K.We leverage the same infrastructure to prepare label thesaurus in following tasks unless otherwise specified.
Due to a lack of explicit knowledge about the connection between the majority of labels and their examples for training, DDR only learns to distinguish positive example-label pairs from in-batch negatives without further negative mining.For inference, with the constructed label thesaurus, we first retrieve the large label space with a test example based on Sentence-BERT embeddings of the title and the first sentence in an unsupervised way.After training DDR with few-shot labels, we mimic the practice of DPR+BM25 for open-Method 1% Labels 5% Labels Precision Recall Precision Recall @1 @3 @5 @1 @3 @5 @10 @100 @1 @3 @5 @1 @3 @5 @10 @100 LF-Amazon-131K

Ultra-fine Entity Typing
Task Entities can often be described by very fine grained-types (Choi et al., 2018) and the ultra-fine entity typing task aims at predicting one or more fine-grained words or phrases that describe the type(s) of that specific mention (Xu et al., 2022).Consider the sentence "He had blamed Israel for failing to implement its agreements."Besides person and male, the mention "He" has other very specific types that can be inferred from the context, such as leader or official for the "blamed" behavior and "implement its agreements" affair.Ultra-fine entity typing has a broad impact on various NLP tasks that depend on type understanding, including coreference resolution (Onoe and Durrett, 2020), event detection (Le and Nguyen, 2021) and relation extraction (Zhou and Chen, 2022).

Datasets and Metrics
We leverage the UFET dataset (Choi et al., 2018) to evaluate benefits of DDR with indirect supervision from dense retrieval.Among 10,331 entity types, accurately selecting finer labels (121 labels such as engineer) was more challenging to predict than coarse-grained labels (9 labels such as person), and this issue is exacerbated when dealing with ultra-fine types (10,201 labels such as flight engineer).Following recent entity typing literature (Li et al., 2022;Du et al., 2022), we train DDR on (originally provided) limited crowd-sourced examples without relying on distant resources such as knowledge bases or head words from the Gigaword corpus.We follow prior studies (Choi et al., 2018) to evaluate macro-averaged precision, recall and F1.
Baselines We consider two categories of competitive entity typing models as baselines: 1) methods capturing the example-label and label-label relations, e.g., BiLSTM (Choi et al., 2018) that concatenates the context representation learned by a bidirectional LSTM and the mention representation learned by a CNN, LabelGCN (Xiong et al., 2019) that learns to encode global label co-occurrence statistics and their word-level similarities, LRN (Liu et al., 2021a) that models the coarse-to-fine label dependency as causal chains, Box4Types al., 2021) that captures hierarchies of types as topological relations of boxes, and UniST (Huang et al., 2022a) that conduct namebased label ranking; 2) methods leveraging inductive bias from pre-trained models for entity typing, e.g., MLMET (Dai et al., 2021) that utilizes the pretrained BERT to predict the most probable words for "[MASK]" earlier incorporated around the mention as type labels, LITE (Li et al., 2022) and Context-TE (Du et al., 2022) that both leverage indirect supervision from pre-trained natural language inference.

Implementation Details
The dataset UFET first asked crowd workers to annotate entity's types and then used WordNet (Miller, 1995) to expand these types automatically by generating all their synonyms and hypernyms based on the most common sense.Therefore, we automatically obtain label thesaurus entries from definitions and examples in WordNet and Wiktionary, which covers 99.99% of the whole label set.The time cost for prior label thesaurus construction is around 2.07 minutes.Initialized with the original DPR checkpoint pretrained on open-domain QA datasets, DDR firstly optimizes the model given positive example-label pairs and in-batch negatives.With the fine-tuned model, we then perform dense retrieval on the label set for each training example, keeping label documents with high scores but not in the true label set as hard negatives.DDR further updates the model with these additional hard negatives.For inference of mult-label entity type classification, we adopt labels with retrieval scores higher than a threshold that leads to the best F1 score on the development set.

Results
In Tab. 3, we show performance of DDR and other entity typing methods.DDR obtains the state-of-the-art F1 score over the the best baseline training from scratch (LRN) by 6.7 and the best baseline with inductive bias from the language model (Context-TE) by 0.6.

Few-shot Intent Classification
Task As a fundamental element in task-oriented dialog systems, intent detection is normally conducted in the NLU component for identifying a user's intent given an utterance (Ham et al., 2020).Recently, accurately identifying intents in the fewshot setting has attracted much attention due to data scarcity issues resulted from the cost of data collection as well as privacy and ethical concerns (Lin et al., 2023b).Following the few-shot intent detection benchmark (Zhang et al., 2022), we focus on the challenging 5-shot and 10-shot settings.

Datasets and Metrics
To evaluate the effectiveness of DDR for NLU with large decision spaces, we pick three challenging intent datasets with a relatively large number of semantically similar intent labels.Banking77 (Casanueva et al., 2020) is a single-domain dataset that provides very fine-grained 77 intents in a Banking domain.HWU64 (Liu et al., 2021b) is a multi-domain (21 domains) dataset recorded by a home assistant robot including 64 intents ranging from setting alarms, playing music, search, to movie recommendation.CLINC150 (Larson et al., 2019) prompts crowd workers to provide questions or commands in the manner they would interact with an artificially intelligent assistant covering 150 intent classes over 10 domains.We report accuracy  for this single-label classification task.
Baselines There are two families of intent learning algorithms to cope with limited amounts of example-label pairs: 1) Classifiers based on sentence embeddings from PLMs, including USE (Yang et al., 2020) (Lin et al., 2023a), a most recent work tackles the challenge of limited data for intent detection by generating high-quality synthetic training data.It first fine-tunes a PLM on a small seed of training data for new data point synthesization, and then employs pointwise V-information (PVI) based filtering to remove unhelpful ones.12Implementation Details Based on the observation that utterances labeled by identical intents share very similar sentence-level semantic meanings, we first utilize the whole set of unlabeled training examples to represent their corresponding pseudo labels predicted by Sentence-BERT (Reimers and Gurevych, 2019).We perform prediction on a single NVIDIA RTX A5000 GPU and it takes 4.17 minutes to complete label thesaurus construction for Banking77, 6.45 minutes for HWU64 and 7.61  Regardless of the basis for label thesaurus construction, DDR is trained for two phases in each round: only few-shot positive example-label pairs and in-batch negatives are used in the first phase, while labels wrongly predicted for training examples from the prior phase are used as additional hard negatives in the second phase.After the second round of training, the latest DDR makes predictions on the whole set of training examples for the final label thesaurus construction, from which DDR retrieves the label with the highest score as the prediction.
Results Tab. 4 presents performance of different intent detection methods on 5-and 10-shot setting.Without extra training data synthesization, we observe significant performance gain from the proposed DDR over existing baselines.Although ICDA obtains enhanced accuracy by increasing data augmentation scale, DDR is able to achieve comparable performance without crafting 4x even 128x synthetic instances.In Tab. 5, we additionally study the quality of constructed label thesaurus in accordance with the prediction on the unlabeled whole training set from Sentence-BERT and DDR after the first round of training.We find that with a more accurate pseudo connection between example and label leads to a higher quality of label thesaurus construction, and finally benefits intent detection.

Conclusions
In this paper, we focus on discriminative NLU tasks with large decision spaces.By reformulating these tasks as learning-to-retrieve tasks, we are able to leverage rich indirect supervision signals from dense retrieval.Moreover, by representing decision spaces with thesaurus, we provide rich semantic knowledge of each label to improve the understanding of labels for dense decision retrievers.Experiments on 6 benchmark datasets show the effectiveness of our method on decision spaces scaling from hundreds of candidates to hundreds of thousands of candidates.Future work can extend our method to more large-space decision making tasks, especially in the low-resource setting.

Acknowledgement
We appreciate the reviewers for their insightful comments and suggestions.The logo for DDR used in this paper is sourced from the Wikipedia page13 under CC BY-SA 3.0, for which we appreciate the contribution of the Wikipedia community.
Nan Xu is supported by the USC PhD Fellowship.Fei Wang is supported by the Annenberg Fellowship and the Amazon ML Fellowship.

Limitations
The proposed DDR leverages the dual-encoder architecture with the inner dot product to compute embedding similarity, obtaining state-of-the-art performance on three investigated challenging largespace decision making tasks.More expressive models for embedding input and label thesaurus, such as joint encoding by a cross-encoder is not discussed.Moreover, other ways to model the connection between input and label thesaurus entry, such as using a sequence-to-sequence language model to generate the label name given input text and label thesaurus entry, is not explored yet.We believe adopting more advanced dense retrieval algorithms can further promote performance for large-space decision making.We leave this as an exciting future direction.
Mingtao Dong is supported by the Provost's Research Fellowship.Muhao Chen is supported by the NSF Grant IIS 2105329, the NSF Grant ITE 2333736, the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research, a Cisco Research Award, two Amazon Research Awards, and a Keston Research Award.The computing of this work has been partly supported by a subaward of NSF Cloudbank 1925001 through UCSD.

Table 1 :
Statistics and experimental settings on datasets with label spaces ranging from 64 to 320K.

Table 2 :
Results of few-shot XMC where the training subset covers 1% (left) and 5% (right) labels from the whole set.DDR outperforms the second best MACLR in both settings of two datasets.Indirect supervision from DPR boosts performance against training from scratch (the second row from the bottom), while label thesaurus construction improves accuracy over those using textual label names (the third row from the bottom).

Table 3 :
Results of UFET task.DDR outperforms competitive baselines with (upper 5 methods) or without (lower 3 methods) inductive bias from task pre-training.

Table 4 :
Few-shot Intent Detection Accuracy on three benchmark datasets.DDR achieves much higher accuracy than existing strong learning baselines without data augmentation, while competitive with ICDA requiring a considerable amount of augmented data.ICDA prepares additional synthetic data with the scale ranging from the same amount (ICDA-XS) to 128 times (ICDA-XL) of the original few-shot train size.In the bottom row, we also mark performance of DDR with the most comparable ICDA variant.Results not available in original papers are marked as −.

Table 5 :
Impact on intent detection from label thesaurus constructed with predictions from Sentence-BERT and DDR for the unlabeled whole set of the training set.minutes for CLINC150.After training DDR with this first version of label thesaurus, we then update pseudo labels for each training example with predictions from DDR for the second-round training.