Explaining Why: How Instructions and User Interfaces Impact Annotator Rationales When Labeling Text Data

In the context of data labeling, NLP researchers are increasingly interested in having humans select rationales, a subset of input tokens relevant to the chosen label. We conducted a 332-participant online user study to understand how humans select rationales, especially how different instructions and user interface affordances impact the rationales chosen. Participants labeled ten movie reviews as positive or negative, selecting words and phrases supporting their label as rationales. We varied the instructions given, the rationale-selection task, and the user interface. Participants often selected about 12% of input tokens as rationales, but selected fewer if unable to drag over multiple tokens at once. Whereas participants were near unanimous in their data labels, they were far less consistent in their rationales. The user interface affordances and task greatly impacted the types of rationales chosen. We also observed large variance across participants.


Introduction
The conversational agents such as Amazon Alexa, Apple Siri are characterised by the skill of understanding intents which help them to efficiently handle a user's query. For example, the query 'Will it be colder in Ohio' requires getting the weather updates for the city 'Ohio' and would be associated to the intent GetWeather. The agents are trained with a pre-defined set of intents such as {GetWeather, RateBook, BookRestaurant} so as to perform the goal-oriented user tasks. But with the passage of time, a user may be interested in performing newer tasks adding hitherto unknown intents. For example, 'Play some music from 1954' would be * Equal contribution † Work done while the author was a student at IIT Kharagpur associated to the intent PlayMusic that may not be a part of the set of pre-defined intents.
Emergence of novel intent detection has been periodically checked by different models in the last decade. There are works on incremental learning in dynamic environment for evolving new classes (Zhou and Chen, 2002;Kuzborskij et al., 2013;Scheirer et al., 2012). There are also several approaches (Sun et al., 2016;Masud et al., 2010;Haque et al., 2016;Wang et al., 2020;Mu et al., 2017b,a) to detect new classes in the form of outlier detection but they do not distinguish among multiple new class labels so are not effective in novel multi-class detection. Xia et al. (2018); Siddique et al. (2021) detect user intents using zeroshot generalized intent detection framework. However, they assume that the unseen intent class LA-BELS are already known, while in our case neither the number of unseen intent classes, nor the corresponding class labels are known. The other line of works (Xia et al., 2021;Halder et al., 2020) supply the system with new intents, albeit with a limited amount of tagged data per class and then have an efficient algorithm to incrementally learn new classes. These models work on the assumption that some instances of these new classes would be provided for model building. However, in a realistic setting, the system may not have any knowledge of the number and types of new intents appearing, it may at most understand that some new out-of-domain samples are generated. So, the problem statement is to probe the incoming data wisely and use minimum human intervention to identify all types of novel intents emerging and intelligently tag a limited set of data covering all discovered intents, which can be be fed into a model for retraining. More concretely the system is at first trained with an initial set of known intents; side-by-side an out-of-distribution (OOD) detector classifier is also trained to identify datapoints which do not fit the known intents. When substantial amount of such points are detected, the task is to (a) identify whether the points are originating from introduction of a single novel intent or multiple and (b) choose (a limited number of) samples to annotate so that the classifier can be retrained efficiently.
In order to determine the number of novel intents present in the OOD data, we undertake a clustering based approach with the idea that each cluster would represent a novel intent. By increasing the cluster number progressively, we can make a highly accurate estimate of the number of novel intents. If sample points of an intent mainly correspond to a well formed cluster, the implication is that without much probing we can shortlist enough training samples (through silver tagging) for that class. On the other hand, if the sample points of an intent tend to intertwine with other intent points in the feature space, these can be considered as uncertain points and require human intervention for tagging (gold tagging). With this intuition in place, we design a mix of silver and gold tagging to produce high-quality training samples which can be used to retrain the classifier.
Our proposed framework of Multiple Novel Intent Detection (MNID) is compared with competitive baselines and evaluated across several standard public datasets in NLU domain where it performs substantially better. We use datasets with different number of intent classes. SNIPS (Coucke et al., 2018) and ATIS (Tur et al., 2010) are smaller datasets, consisting of less number intent classes -7 and 21, respectively. HWU (Liu et al., 2019a), BANKING (Casanueva et al., 2020) and CLINC (Larson et al., 2019) consist of large number of intent classes -64, 77 and 150, respectively.
The paper is organized as follows. We discuss the Problem Setting and solution overview in Section 2. Our algorithmic framework is described in Section 3. We present the datasets with experimental statistics and data pre-processing in Section 4. In Section 5, we discuss the experimental design and baselines. Detail evaluation results with different algorithmic variations are in Section 6. We conclude with a summary in Section 7 1 .  K ← 2 * K 9: end while 10: CL = Store All K Clusters 11: Return (L,N new , CL) known apriori. Let T ∈ W be the test set and W − T = D be the rest of the dataset, out of which |D init | (<< |D|) labelled data of N init (< N ) Algorithm 4 Cluster Quality Based Annotation CQBA(L, CL) 1: Take p points from each of the clusters (CL) and annotate to find Good Cluster (G CL ) and Bad Cluster (B CL ). 2: Add annotated p * |CL| point labels to L. 3: for each Bad Cluster do 4:

Problem Setting and Solution Overview
Take q more points from Bad cluster and annotate.

5:
Add q * |B CL | point labels to L. classes is initially provided, while the rest of the data is unlabelled. The task is to design an algorithm to (a). detect all the remaining N − N init classes and (b). spent a limited budget (B -|D init |) to annotate high fidelitous new datapoints, so that the classifier can achieve high accuracy when retraining. Solution Overview: The solution steps are as follows: (a) Identify the OOD (out of distribution) datapoints which do not belong to the initial N init classes. This can be considered as a preprocessing step. (b) Use a part of the allotted budget to annotate a portion of these OOD datapoints. These points (for annotation) are selected by repeatedly running a clustering algorithm with increasing num- The intuition/expectation is that each cluster hosts a separate intent, hence annotating the cluster centres would lead to discovery of maximum number of novel intents. (c) Further identify the classes which are well clustered in feature space and which are not. Use another portion of the budget to increase the annotations of not-so well formed clusters and then build up a classifier with all the classes. Rationale: If a cluster is well-formed, most likely it is hosting a single class, hence there is no need to annotate further points there, rather annotate more points in not-so-well-formed clusters. (d) Use the classifier to classify points from the clusters. Identify low-confidence points from the bad clusters and annotate them. High-confidence points from good clusters are silver annotated. Rationale: The low-confidence points in the bad clusters are the most uncertain points, hence annotating them helps in increasing classifier accuracy. Similarly highconfident points in the good clusters almost surely will belong to that particular cluster, hence silver annotation is pursued. (e). Retrain the classifier.
The overall MNID framework with different algorithmic modules is shown in Fig. 1.

MNID: Solution Detail
The proposed framework for Multiple Novel Intent Detection (MNID) is explained through Algorithm 1. As highlighted in the overview, the algorithm consists of data pre-processing step, followed by three stages, each of them are discussed below. The total budget of (gold) annotation is B. Besides the system can undertake unlimited silver annotation. The advantage of silver strategy is that it is free as no human probing is required. However, it is also likely to bring in noise if used indiscriminately. Pre-process: OOD Detection (OODD): For the dataset (D), this module (Algorithm 2) takes the initial labelled data (D init ) as input and predicts the Out-of-Domain (OOD) samples on the remaining data, (D − D init ). We call the set of OOD samples predicted as OS. This is a part of the data preprocessing. Stage 1. Novel Class detection (NCD): In this sub-module (Algorithm 3), we aim at finding all the new classes, N new . On the OOD samples (OS), obtained in the previous sub-module (Algorithm 2), we do clustering using K-Means. We start the algorithm with K = 1 and number of new classes, N new = 0. We perform -(i) K-Means clustering.
(ii) Annotate x points from each cluster, add those points to L and identify n ′ new classes. (iii) Increase new class count (N new + n ′ ). (iv) Double the number of cluster count (we compare N new with K/2). We execute the above steps until cluster count exceeds the new intent count. The algorithm returns current annotations (L), newly discovered class count (N new ) and newly formed clusters (CL). The budget spent in this step is B 1 . Stage 2. Cluster Quality Based Annotation (CQBA): In this step (Algorithm 4), we evaluate the quality of each of the clusters obtained by the previous algorithm. We annotate p points from each of these clusters and if all the p points belong to the same class, we term it as a good cluster or else a bad cluster. An example of a bad cluster in BANKING would be the one containing data points from multiple classes, which may have high similarity, such as: declined_cash_withdrawal and pending_cash_withdrawal. For the bad clusters, we annotate q more points. All these annotated points are then added to the labelled data, L. The budget spent in this step is B 2 . Hence the remaining budget B -(B 1 + B 2 ) is used in the next step. Stage 3. Post Processing Annotation Strategy (PPAS): In this step (Algorithm 5), we add more data to the labelled set, L, through gold annotation (gold strategy), as well as silver-annotated data (silver strategy). To select these data points, we first train a classifier (M) with the labelled set, L as obtained in the last step (CBQA), and consider the clusters CL. We predict on the remaining points of the clusters to get the confidence of the datapoints. We perform silver strategy based on confidence score (CS) and gold strategy in round-robin way to operate on each cluster one after another. Gold Strategy: Least confident data-points are annotated from the bad clusters (if present) or else from the good clusters. Gold strategy is performed in a round-robin way to retrieve data points with the least score for each cluster until our budget exhausts. Silver strategy: If the confidence score (CS) of a datapoint from a cluster is greater than a predefined threshold (T H), we measure the average cosine similarity of points annotated within that cluster with this point. If similarity is above a predefined threshold (τ ), we label this point with class label of that cluster. The predefined threshold (τ ) is required to choose good samples selectively instead of choosing all the points. Silver strategy does not require human intervention therefore there is no extra addition to the annotation cost, but the multiple conditions are checked to prevent noise in the training set. Final Step: We again train the neural model M on L and test on T to find out Accuracy and F1.

Dataset and Pre-Processing
We perform our experiments on a variety of datasets, which are widely used as benchmarks for Natural Language Understanding tasks. The datasets are SNIPS (Coucke et al., 2018), ATIS (Tur et al., 2010), HWU (Liu et al., 2019a), BANK-ING (Casanueva et al., 2020) and CLINC (Larson et al., 2019). SNIPS (7) and ATIS (21) are smaller datasets consisting of less number of intents (in bracket) where HWU (64), BANKING (77) and CLINC (150) are larger datasets with many intents. ATIS is the most imbalanced, skewed dataset. In BANKING data -several intents are highly similar among themselves. The detailed statistics of these datasets including our experimental framework are shown in Table 1. Since the datasets are already fully labelled, annotation essentially means utilizing the already available labels. Hence, we   In the pre-processing step, we filter the out-ofdomain samples. We consider four algorithms for detecting out-of-domain samples. i) Softmax Prediction Probability (MSP) (Hendrycks and Gimpel, 2018)  We fine-tune BERT embeddings using all these out of domain sample detection algorithms. We use bert-base-uncased for these methods. We set the threshold for MSP as 0.5 as in Lin and Xu (2019), . The results of all these algorithms are shown in Table  2. FS-OOD (Tan et al., 2019) provides us the best accuracy and F1 for detecting OOD samples (OS). Only DOC performs better in case of SNIPS but overall FS-OOD outperforms other approaches so we use FS-OOD produced out-of-sample data. 2 2 FS-OOD: https://github.com/SLAD-ml/ few-shot-ood and other OOD models: https: //huggingface.co

Experimental Setup
The efficacy of the algorithm needs to be tested on two aspects. (a). The number of unknown intents identified. (b). The accuracy achieved when the data is annotated by our algorithm, MNID. To test the accuracy, we use state-of-the-art several classification algorithms used for intent detection. Different Neural Models: We explore different neural models to evaluate MNID as discussed next: 1. IFSTC (Xia et al., 2021): This finetunes a trained model on few shot data of new classes using an entailment and hybrid strategy. We use the hybrid strategy (best performing in their case).
2. PolyAI (Casanueva et al., 2020): It performs intent classification task based on dual sentence encoders -Universal Sentence Encoders (USE) (Cer et al., 2018) and ConveRT. Since authors have taken down the ConveRT model, we apply USE only. 3 Along with the above two, we also consider other standard models, 3. BERT ('bert-baseuncased') (Devlin et al., 2019) and 4. RoBERTa ('roberta-base-uncased') (Liu et al., 2019b) for evaluation on these datasets. We finetune these pre-trained language models for 15 epochs for the smaller datasets (SNIPS, ATIS) 50 epochs for the larger datasets (HWU, BANKING, CLINC) and with a learning rate of 2e-05 and Adam optimizer 4 . Early stopping was employed to stop training. For all methods, we provide the same number of gold annotated data obtained using our pipeline and report its performance. Baselines: We compare the performance of our method using two annotation techniques for choosing B − ∥D init ∥ data points: 1) Gl F : This is the ideal scenario where we are given F (=10) data points for each of the new classes -Gold Few , abbreviated as 'Gl F '. 2) Rn F : Here, we randomly choose B −∥D init ∥ data points from the unlabelled data -Random Few , abbreviated as 'Rn F '. Clustering Algorithms: One of the building blocks of MNID is to cluster datapoints, so the efficacy of MNID depends on employing an efficient clustering algorithm. We do a detailed study by employing several unsupervised and semi-supervised clustering algorithms and choose the best.
The unsupervised algorithms are: (i) K-Means (KM) (MacQueen et al., 1967)    Other than KM and AG, all the other unsupervised methods along with some of the semisupervised methods such as DTC and CDAC need the information of the ground truth number of clusters for training and we provide them so (it is an extra advantage for them). For semi-supervised methods such as KCL, MCL, DTC and DAL, we start with double the number of ground truth clusters and let the method determine the number of clusters.
Hyper-parameters and Settings: For Post-Processing annotation strategy of MNID, we set the cosine similarity threshold, τ as 0.8 and the confidence threshold, T H as 0.5 6 . For all datasets, 5 Code: https://github.com/thuiar/TEXTOIR 6 This combination of τ and T H provides the best results among different experimented results.
we use a setting similar to κ-shot with κ = 10. For N intents, we define our total budget B = κ × N . We use same budget for all our experiments. We experiment on NVIDIA Tesla K40m GPU with 12 GB RAM, 6 Gbps clock cycle and GDDR5 memory. All the methods took less than 8 GPU hours for training.

Experimental Results
In this section, we discuss the experimental outcomes for MNID and competing baselines. We also show results of different clustering algorithms and variations of distinct components of MNID. new intents from the unknown intent set were discovered, respectively. For SNIPS and ATIS, we could discover 2 out of 2 (100%) and 7 out of 8 (87.5%) new intent classes, respectively. Due to data skewness (ATIS) and high intent similarity (BANKING, CLINC) MNID misses one intent.
(B) Performance of MNID: Table 3 shows the performance of different models -IFSTC, PolyAI, BERT and RoBERTa when trained with datasets provided by MNID. In order to maintain the fairness, MNID, Rn F and Gl F use the (overall) same number of gold-annotated data points. Besides MNID uses silver-annotated data points, while the others do not have any way of creating high quality silver annotated data. Each cell in the table consists of values from Gl F , Rn F and MNID. As expected, Rn F performs the worst across all settings. However, except two scenarios, we observe that MNID consistently performs better than the Gl F dataset. For all these four different settings across five datasets, MNID improvements over Gl F predictions are statistically significant (p < 0.05) as per McNemar's Test. It is observed that our approach also works well on the highly imbalanced ATIS dataset in which some of the classes have less than 10 data points and highly similar BANKING dataset in which the intents are closely related eg., 'top-up-reverted' and 'top-up-failed'. This is because although Gl F chooses uniformly across all classes, MNID selectively labels datapoints having high uncertainty thus providing the classifier with the right ingredient to perform better. In IFSTC on SNIPS dataset, MNID underperforms as compared to Gl F but with a very small margin. This happens because in the case of SNIPS dataset, the number of new classes is very less, hence Gl F can choose ideal candidates. The best performance of MNID as well as the two baselines is in the PolyAI setting when it is used with Universal Sentence Encoders (USE). Since PolyAI performs the best, all our subsequent results are provided on PolyAI (USE).

Different Variations of MNID
MNID consists of three steps (a). novel class detection (NCD), (b). cluster quality based annotation (CQBA) and (c). post-processing annotation strat-egy (PPAS). In each of these steps, certain parameters can be varied. We systematically discuss the impact of these parameters on MNID performance.

Variations at NCD
(a) Performance of Clustering Algorithms: We explore different unsupervised and semi-supervised clustering algorithms in our MNID framework.
Overall accuracy and F1-Score for open intent discovery by different approaches are shown in Table  4. From Table 4, it is seen that unsupervised approaches perform better than semi-supervised models. The semi-supervised techniques get biased by the initial seed and fail to discover diverse clusters needed to detect all the new intent classes. K-Means (KM) performs the best across all datasets in terms of accuracy and F1 score except for HWU dataset where DEC and DTC (F1 only) outperforms it. This is most probably due to its robustness and absence of any outlier in the dataset. So we use K-Means as the clustering algorithm for MNID.
(b) Class Discovery with number of clusters: From Fig 3a, we observe an increasing trend in the number of classes discovered with increasing number of clusters which show that classes get evenly distributed across clusters as the number of clusters increases. The rate at which new classes are discovered is linear with the new clusters until significant classes are detected. The horizontal lines represent the gold number of new intents.
(c) Effect of number of points (x) used in clustering: Fig 3b shows that the accuracy on all datasets drops as we increase the number of points used for new class discovery in clustering beyond x = 2. This is because most of the budget gets exhausted while clustering and we have a very small budget to annotate low-confidence points in the next steps. Note that at least two points from a cluster need to be annotated for new class discovery.     We experiment with different values of point selection (p, q) for the module CQBA (Algo 4) and observe how accuracy changes for three larger datasets -HWU, BANKING and CLINC. We get the best accuracy for (p, q) = (3, 2) i.e 3 (p) points from good cluster and 5 (p + q) points from bad cluster as shown in Table 6. Since, we perform gold annotation strategy on the bad clusters, a higher number of point selection is re-quired to identify classes.  Table 5. We observe that the best result is obtained on MNID-9, i.e., choosing high confidence points from the good clusters for silver strategy and low confidence points from the bad clusters (if detected or else from the good clusters) only for gold strategy. This strategy ensures that during silver annotation we choose points with high fidelity and side by side for gold annotation choose points with high uncertainty, both of which help in developing a highly accurate classifier. Silver strategy on   Table 7: Accuracy (in %) and usage of average number of datapoints per class in silver strategy high confidence points from good cluster (7 vs 9) and gold strategy on low confidence points from bad cluster (4 vs 9) alone enhances ∼1-3% accuracy and F1 for the three large datasets.MNID-9 corresponds to our proposed approach, MNID. (b) Silver Strategy Analysis: We inspect silver strategy based on cosine similarity, confidence score and strategy accuracy.
(i) Effect of Cosine Similarity and Confidence Score (CS): We study the effect of cosine similarity of silver strategy for MNID. From Fig. 4a, we observe that the best results are always obtained using a higher threshold of 0.8 cosine similarity. In case of BANKING, HWU and ATIS accuracy drops at 0.9 whereas for other datasets it remains almost identical. Fig. 4b shows how accuracy varies for different confidence scores. We observe that for all the datasets the best results are obtained at a threshold of 0.5. This is because a lower threshold allows more diverse datapoints to be selected using cosine similarity and this in turn improves the model performance. In both the cases if the cosine similarity or the threshold is increased beyond the optimal point, that results in selection of too less datapoints which is not enough for the classifier to do a meaningful learning. Hence accuracy drops. So we choose the parameters -cosine similarity = 0.8 and τ = 0.5 -while choosing high confidence point to be annotated by silver strategy.
(ii) Strategy Accuracy: The accuracy of data point selection by silver strategy for different MNID variations is shown in Table 7. We see the strategy of choosing high-confidence points from good clusters produce points with high fidelity. Table 7 also shows the average number of points per class as selected by this strategy for various datasets. Here we see that enough number of silver points are annotated even after considering a very strict criterion. Note, average points per intent class count is the highest for BANKING because multiple intents are very similar to each other and hence more points qualify the cosine similarity threshold, τ . .

Conclusion
We have developed MNID (Multiple Novel Intent Detection), an end-to-end framework to identify multiple novel intents within a fixed annotation cost. The algorithm intelligently uses the concept of clusters to first discover the classes and then estimate the nature in which datapoints of a class is distributed, that is, whether the datapoints of a class congregate strongly within themselves and separate from other classes or are entangled with datapoints of other classes. In the two types of situations, we propose two different strategies, silver strategy to take advantage of the clusters so that we can annotate many points without any extra human cost and gold strategy to annotate highly uncertain points. This two-pronged approach helps us to annotate highly precise points automatically while annotating the most uncertain (with respect to the class it belongs) points using human assistance. We have done a very rigorous analysis/experimentation to establish the core idea of our algorithm. We observe that the accuracy of classifiers when fed with the dataset created by MNID can beat the standard best few-shot setting where it is assumed that 'κ' instances of each class are provided and annotated by human whereas in our case we have to first discover the classes and then have to find the instances of each class. One limitation of MNID is that it is not able to detect intents where classes are very similar to each other. For example, the query "Can you explain why my payment is still pending?" in BANKING dataset is from the "pending transfer" category but our system detects as "pending card payment" intent as both intents are quite similar. We shall try to address this issue in future. We have presently worked on a setting where novel intents appear in one step, we would strive to extend this framework to explore the dynamics of periodically evolving intents. 290