infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information

The success of NLP systems often relies on the availability of large, high-quality datasets. However, not all samples in these datasets are equally valuable for learning, as some may be redundant or noisy. Several methods for characterizing datasets based on model-driven meta-information (e.g., model’s confidence) have been developed, but the relationship and complementary effects of these methods have received less attention. In this paper, we introduce infoVerse, a universal framework for dataset characterization, which provides a new feature space that effectively captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. infoVerse reveals distinctive regions of the dataset that are not apparent in the original semantic space, hence guiding users (or models) in identifying which samples to focus on for exploration, assessment, or annotation. Additionally, we propose a novel sampling method on infoVerse to select a set of data points that maximizes informativeness. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines in all applications. Our code and demo are publicly available.


Introduction
The construction of large datasets is one of the essential ingredients for success in various NLP tasks (Wang et al., 2019). However, not all data points are equally important to learn from; many datasets often contain low-quality samples, e.g., incorrect labels (Toneva et al., 2019) or annotation artifacts (Gururangan et al., 2018). Thus, data characterization (Roth and Mattis, 1990), a technique for transforming raw data into useful information for  Figure 1: The proposed framework, infoVerse. By capturing the various aspects of data characteristics, infoVerse provides an effective data characterization. By selecting maximally informative subsets on infoVerse, we could improve performance on a variety of datacentric real-world problems like active learning. a target task, has a huge potential to improve the model's performance by trimming the problematic samples (Pleiss et al., 2020) or providing better practices for effective data collection, e.g., active learning (Beluch et al., 2018) and adversarial annotation (Nie et al., 2019).
However, data characterization via human assessment is highly limited due to the huge cost of dealing with a large dataset and the vagueness of the assessment itself. To this end, several model-driven meta-information 2 have been investigated; for example, the model's confidence is a standard meta-information widely used in active learning (Beluch et al., 2018). Swayamdipta et al. (2020) recently show that the training dynamics of the model's prediction can indicate the relative importance of training samples. Various types of metainformation are continuously proposed from different intuitions (Salazar et al., 2020;Paul et al., 2021), but their relationship and potential beyond relying on individual one have yet to be explored. Hence, this work answers the following two research questions: (1) is there a (hidden) complementary effect between various meta-information for better data characterization, and (2) is the combined metainformation useful for real-world applications?
In this paper, we introduce infoVerse: a universal framework for better dataset characterization by incorporating multiple aspects of data characteristics. To be specific, infoVerse combines various types of meta-information which offer the different aspects of data characteristics (e.g., how difficult the sample is to learn, how certain multiple models are, and how likely the sample is). Consequently, we can extract richer information about data informativeness from their complementary effect, and infoVerse could guide users (or models) in what samples to focus on for the exploration, assessment, or annotation. To extend the advantages of infoVerse into real-world problems, we further propose a novel sampling method suitable for in-foVerse based on determinantal point processes (DPP), which is known to be effective for finding a diverse and high-quality set of samples (Gillenwater et al., 2012;Chen et al., 2018). It enables us to select data points that maximize the information at a set level rather than a sample level on the multidimensional space of infoVerse.
In detail, we first construct infoVerse based on the diverse meta-information, which could be broadly classified into four different categories in Section 3. The complementary effect from the multiple meta-information in infoVerse helps reveal distinct regions in the dataset, such as hard-to-predict and mis-labeled ones, which are not observable in the original semantic feature space (Section 4). In Section 5, we empirically show that our framework has consistently outperformed the strong baselines in various data-centric applications, like data prun-ing (Toneva et al., 2019;Paul et al., 2021), active learning (Yuan et al., 2020), and data annotation (Xie et al., 2020a), although it is not specifically designed for those problems. This result opens up the potential of infoVerse to other data-centric applications, unlike the application-specified approaches.
Our results show that a dataset could be distinctively characterized when many different but complementary dimensions are considered together. We believe that our infoVerse framework could evolve continuously with the development of new meta-information and hence serve as an effective platform for better characterization of datasets and construction of high-quality datasets.

Related Works
Quantifying and characterizing dataset. Although the large quantity of datasets is usually got attention for the success of various NLP tasks, the quality of the dataset is also an important factor. While constructing data with human-in-the-loop is quite reliable like Dynabench (Kiela et al., 2021a), it is expensive and laboring. Hence, some works show the benefits of using a model for quantifying and characterizing datasets; for example, Rodriguez et al. (2021) demonstrates that the model has the ability to annotate, detect annotation errors, and identify informative examples. In this line of work, several model-driven meta-information have been proposed (Toneva et al., 2019;Swayamdipta et al., 2020;Beluch et al., 2018), as we provide the detailed explanation in Section 3 and Appendix A. Most of the prior works focuses on finding a new meta-information; however, as they are obtained under different intuition and aspects of data characteristics, one can expect the complementary effect between them to provide richer information about the dataset. Such direction is under-explored from now on, and we try to fill this gap in this work.
Informative subset selection. Finding an informative subset is key for various real-world applications; for example, active learning requires selecting the most informative samples among unlabeled samples for labeling (Settles, 2009). The most widely used approaches to finding those samples in active learning are based on uncertainty.
Although several uncertainty measurements have been successfully applied in various NLP tasks, Dasgupta (2011) pointed out that focusing only on the uncertainty leads to a sampling bias with repetitive patterns. To this end, diversity-based  (Sener and Savarese, 2018). However, as this approach might select samples that provide little new information, recent works suggest methods combining uncertainty and diversity to take advantage of both methods (Ash et al., 2020;Yuan et al., 2020). Our work provides a better way to select informative samples by effectively incorporating multiple aspects of data characteristics with a single universal framework.

infoVerse: Universal Framework for
Multi-aspect Data Characterization In this section, we present infoVerse, a universal framework for better data characterization. Our high-level idea is extracting the complementary effect between various meta-information, as they are oriented from the different aspects of data characteristics. In Section 3.1, we briefly introduce the used meta-information to construct infoVerse. In Section 3.2, we present the novel sampling method to extend the advantages of infoVerse for solving real-world problems. We remark that our framework can be easily extended with the new metainformation and not limited to specific ones, while the fixed ones are used for the experiments.

Meta-information for infoVerse
To construct infoVerse, one needs to determine which meta-information to use, and it is expected to get better capability for data characterization with more meta-information. Hence, we first conduct an extensive investigation of the existing metainformation and find that they could be categorized into four different classes based on how they extract the data characteristics: (1) Static Measures,  Figure 2: Correlation between meta-information considered in infoVerse on QNLI dataset. and Entropy (Shannon, 1948) of the predictive probability from the classifier's output have been popularly used to characterize each data point. BADGE score (Ash et al., 2020), a gradient norm of the linear classifier with respect to training loss, is designed to capture the uncertainty of the prediction. Finally, Task Density and Relative Density defined with kNN distance on the classifier's feature embeddings (DeVries et al., 2020) effectively estimates the uniqueness of data at a task-level.
Training Dynamics. Training dynamics of samples largely varies depending on the samples' characteristics and hence can provide useful information, e.g., the confidence of mislabeled samples slowly increases relative to the normal ones. Swayamdipta et al. (2020) investigate the usefulness of the mean (Confidence) and standard deviation (Variability) of the model's prediction to true class across training epochs; they observe that high variable samples are usually useful and low confident ones have a risk of the noisy label. Forgetting Number (Toneva et al., 2019), a number of the transitions from being classified correctly to incorrectly during training, is also shown to be an effective measurement for finding redundant samples. Pleiss et al. (2020) reveals that Area Under Margin (AUM), a sum of the gap between the logits of true class and most confusing class across training epochs, is different among easy, hard, and mislabeled samples.
Model Uncertainty. As intrinsic randomness within the model affects the samples' prediction, such uncertainty has been widely used in various fields (Lakshminarayanan et al., 2017;Lee et al., 2021b). There are two popular uncertainty measurements: Monte-Carlo Dropout (MC-Dropout) (Gal and Ghahramani, 2016) with different Dropout masks, and Deep Ensembles (Lakshminarayanan et al., 2017) with differently random initialized models. Specifically, the following four metainformation are used for uncertainty quantification: 1) Entropy of the average predicted class distribution of multiple models, 2) BALD (Houlsby et al., 2011): mutual information between data samples and classifier, 3) Variation Ratio (Beluch et al., 2018): a proportion of predictions different with the majority voted one, and 4) EL2N score (Paul et al., 2021): an approximated contribution to the change of training loss. In addition, we also include the average and variability of confidence across different models.
Pre-trained Knowledge. As general text representation provides complementary information to task's one, using the pre-trained language models to extract meta-information is another popular direction. For example, MLM score (Salazar et al., 2020), a Pseudo-Log-Likelihood (Wang and Cho, 2019) of Masked Language Model (MLM), gives low values to the sentences with inconsistent context. To reduce the computational cost, we use its approximation following Yuan et al. (2020). Also, Semantical density of each sample based on kNN distance (DeVries et al., 2020) using sentence-BERT (Reimers and Gurevych, 2019) can assess its uniqueness compared to other sentences.
Overall, with these 23 meta-information, we construct a new feature space infoVerse. Note that some complementary or redundant measures are noticed by low or high correlation values in Figure  2, respectively. Remarkably, we observe that similar correlations consistently appear across different datasets and models, meaning that this "meta"information is quite a dataset-and task-agnostic (see Appendix D).

Maximally-Informative Subset Selection
Consideration of multiple meta-information via in-foVerse enables better data characterization, but it's non-trivial to apply this framework to practical problems, such as data pruning and active learning. One of the main challenges is from the multidimensional nature of infoVerse, as it requires a new sampling method rather than an existing singlescore based one; for example, Beluch et al. (2018); Swayamdipta et al. (2020) choose top samples ordered by a specific meta-information like confidence or uncertainty but such ordering is hard to be defined in multidimensional space. On the other hand, the single-score strategy cannot capture the relationship between the selected samples in the subset; hence, it suffers from the lack of diversity, especially when the size of the subset is small (see Figure 3). Lastly, the manual verification of the effectiveness of each feature becomes very costly when multiple features are considered. Motivated by this, we propose to focus on maximizing the informativeness of the subset and employing DPP, as it enables easy customization of the sampling method by defining appropriate score and similarity functions. In the end, we provide a new effective subset selection method for infoVerse based on DPP, to select the maximally informative subset by leveraging the capability of infoVerse for data characterization at a set level.
Determinantal point processes. Formally, a DPP on a set of samples X = {x 1 , . . . , x N } is a probability measure P on 2 X , the set of all subsets of X . Under the assumption that P gives a nonzero probability to the empty set, the probability of each subset X ⊆ X is P(X) ∝ det(L X ) where L ∈ R N ×N is a real, positive semidefinite (PSD) kernel matrix, and L X denotes the sub-matrix of L which is indexed with the subset X. Here, we can define the entries of L as follows: where q(x) ∈ R + is a score of sample x which is used to weight the samples with a high quality or desired property such as high confidence or uncertainty. Next, S ij = ϕ(x i ) T ϕ(x j ) ∈ [−1, 1] represents a similarity between samples x i and x j with a normalized feature vector ϕ(x) ∈ R d , ||ϕ(x)|| 2 = 1. We note that the determinant det(L X ) is proportional to the volume spanned by the vectors q(x)ϕ(x) for x ∈ X, and hence the sets with highquality and diverse samples have the high probability under distribution from DPP. Consequently, DPP provides a natural way to find the maximallyinformative subset by selecting the subset with the highest probability among the sets with the same number of samples (Kulesza and Taskar, 2011). To apply DPP on infoVerse, we consider the following design choices. For S ij , we use a Gaussian kernel with Euclidean distance (Bıyık et al., 2019) on a normalized featuresx of infoVerse: where we use a fixed value β = 0.5. Regarding score q(x), we use a density defined by kNN distance (Carbonera and Abel, 2015) D KNN to the same class' samples on infoVerse for data pruning: whereẑ θ ∈ D train \{z θ (x)} and min K {·} is defined as the Kth smallest value in a set. In our experiments, we commonly set K = 5. As D KNN has a negative value, we use its negative inverse, i.e., q(x) = −1/D KNN . Intuitively, it encourages selecting the samples that preserve the informativeness captured on infoVerse as much as possible.
For active learning and data annotation, we use its inverse as q(x) to select samples that have information hard to be captured by their neighbor and hence will be beneficial when labeled. Finally, we adopt the efficient greedy method (Chen et al., 2018) for DPP, as finding a set with the highest probability is NP-hard. Figure 3 shows the 10 selected samples with different selection methods on the QNLI training dataset with a fine-tuned RoBERTa-large classifier. Here, one can observe that single-score based selection methods like Ambig and Hard (Swayamdipta et al., 2020) actually suffer from the lack of diversity. CoreSet (Sener and Savarese, 2018) or K-means clustering can select diverse samples, but they are known to be vulnerable to the existence of outliers (Georgogiannis, 2016). In contrast, DPP successfully selects informative and diverse samples, as shown in the right below of Figure 3; their log determinant, i.e., approximate setinformativeness, is much higher than the others.

Data Characterization via infoVerse
In this section, we demonstrate how infoVerse could help analyze a given dataset via better data characterization. Specifically, Figure 4 presents infoVerse on QNLI dataset 3 along with other representative feature spaces for data characterization: classifier's embedding (at final layer before linear head) and data map (Swayamdipta et al., 2020). As the classifier's embedding and infoVerse are high dimensional space, we project them to 2D via t-SNE (Van der Maaten and Hinton, 2008) for visualization. First, one can observe that infoVerse maps the samples into distinguishable regions based on their characters; for example, samples with high variability are further mapped to some different regions. To be specific, in Figure 4, they have high variability in a common while having a difference in other measures: the regions with dotted boxes have relatively high Ensemble Entropy and MC Entropy, respectively, which cannot be distinguishable in the data map, showing the advantage of infoVerse in dataset characterization.
This benefit of infoVerse for data characterization is more clear when we focused on the incorrectly predicted samples (black squares/circles in Figure 4). As shown in the left of Figure 4, it is hard to find their characteristics on the classifier's embedding as they are scattered over different regions. Data map (Swayamdipta et al., 2020) maps these samples to regions with low confidence (hardto-learn) or high variability (ambiguous), but not perfectly as these regions also include correctly predicted samples. In contrast, infoVerse successfully characterizes these incorrectly predicted samples and maps them into three distinct regions with a different distribution of meta-information: 1) Hardand-disagreed, 2) Easy-mistake, and 3) Ambiguous. As shown in the right of Figure 4, both Hard-anddisagreed and Ambiguous regions have high ensemble uncertainty, but Hard-and-disagreed region has relatively low confidence and variability which means that it is also hard to learn. It might imply its incorrect annotation due to intrinsic difficulty as one can verify in the examples in Figure 4. In contrast, Easy-mistake region has much lower uncertainty than other incorrectly predicted regions, which indicates the prediction is certainly wrong. It might indicate that the mistakes are happened during annotation even though they are easy ones to correctly annotate. More results of data characterization on other datasets with infoVerse are presented in Appendix D.

infoVerse for real-world applications
In this section, we demonstrate the advantages of infoVerse on three real-world problems: 1) Data Pruning (Swayamdipta et al., 2020), 2) Active Learning (Beluch et al., 2018), and 3) Data Annotation (Kiela et al., 2021b). These problems commonly require characterizing and quantifying data to determine which samples to select, but with dif-ferent goals; hence, various methods have been explored separately. In contrast, we will demonstrate the potential of infoVerse as a universal framework to deal with such data-centric problems.

Data Pruning
The goal of data pruning is selecting the most informative subset of a given training dataset while keeping the performance of the model trained on the subset; hence, measuring the sample's informativeness becomes a key for data pruning. This problem has been popularly investigated with various meta-information as an important problem for improving the efficiency of various NLP tasks.
Setups. For the experiments of data pruning, we first use two datasets, QNLI ( To demonstrate the effectiveness of infoVerse-DPP, we compare it with various state-of-the-art approaches to data pruning. We first consider a random-sampling (Random); then, we consider three different approaches in (Swayamdipta et al., 2020) (Easy, Hard, and Ambig), which selects the samples by scoring them with a specific metainformation (average confidence and variability). In addition, we introduce two additional data pruning works: Forget (Toneva et al., 2019) and EL2N (Paul et al., 2021). Finally, we consider densitybased approaches as they are arguably the most natural ways to preserve the characteristics of the dataset: Coreset (Sener and Savarese, 2018) and Density (Yuan et al., 2020). More details of datasets and training can be found in Appendix B.2.
Results. Figure 5 shows the performance under varied pruning ratios on WinoGrande (see Appendix C for the other tasks). We first note that the effectiveness of each pruning method significantly varies on the pruning ratio. For example, Hard and Ambig show good performance at small pruning ratios, but they often fail at large pruning ratios, simi-   To demonstrate the advantages of infoVerse-DPP, we present Figure 6 to qualitatively show how the proposed method works for data pruning. Interestingly, types of majority meta-information of selected samples dynamically change as the pruning ratio increases, from confidence to model uncertainty to variability. After the most redundant samples (i.e., high-confidence) are pruned, followed by hard and uncertain samples. In contrast, other baselines do not have any pattern for selection or just have static ones, as shown in Figure 15. It is worth noting that effective pruning strategies would indeed vary with the pruning ratio or given; for instance, (Swayamdipta et al., 2020) disclose that high-confidence samples should be pruned at low pruning ratios due to redundancy, but these samples become essential for training as the ratio increases (e.g., 83%). While (Swayamdipta et al., 2020) could manually check the varied effectiveness of confidence and find the effective pruning strategy based on that, such a manual approach becomes very costly when the number of considered measurements increases. In this aspect, our framework offers an efficient solution as it prunes the samples toward maximizing the informativeness of the remaining samples, rather than focusing on specific measurements. Also, the observed pruning curriculum demonstrates how infoVerse with DPP actually outperforms the other selection methods, by automatically adapting the pruning strategy across varying ratios.
In addition, to verify the complementary effect between different categories of meta-information, we compare the model's performance by pruning the samples based on each category using DPP. As shown in Table 3, the effectiveness is largely different depending on whether each category can capture the important aspect of the dataset for data pruning. However, when they are combined to construct infoVerse, the performance is significantly improved which implicitly reveals that they are mutually complementary. More results are presented in Appendix C and E.

Active Learning
Active learning (AL) is a task that finds the most informative subset from unlabeled samples when labeled and used for training the model. AL usually consists of multiple iterations of the following two steps: (1) select a subset of unlabeled data under a specific sampling method and expand the labeled data by annotating the subset. (2) Then, train the model with the new training dataset. More details and experimental results are in Appendix B.3. Setups. To demonstrate the effectiveness of info-Verse in AL, we compare it with the state-of-the-art AL methods on the various datasets, following the recent works of AL for NLP tasks (Yuan et al., 2020;Margatina et al., 2021). Specifically, we evaluate infoVerse on three datasets: SST-2, RTE, and AGNEWS (Zhang et al., 2015). Also, several AL methods are used as the baselines, which are based on three different strategies: (1) uncertaintybased (Entropy and BALD, as described in §3.1), (2) diversity-based (BERT-KM and FT-BERT-KM (Yuan et al., 2020)) which focus to cover data distribution, and (3) hybrid method to consider both aspects jointly (BADGE (Ash et al., 2020) and ALPS (Yuan et al., 2020)), and Random sampling. All experiments are conducted using BERT-base (Devlin et al., 2019). We construct infoVerse of unlabeled datasets with their pseudo-labels (Lee et al., 2013).
Results. We first summarize the results in Figure 7 and Table 4. Figure 7 presents the test accuracy of the trained model at each AL iteration on SST-2 dataset (the results of other datasets are presented in Appendix C, due to limited space).
In addition, Table 4 shows the average test accuracy across multiple AL iterations and implies the overall performance of each method. Here, one can observe that infoVerse-DPP shows consistent improvements over other baselines; infoVerse-DPP outperforms the baselines in RTE and SST-2, while it shows comparable performance with the highest performing baseline BALD in AGNEWS. Conse- quently, infoVerse-DPP achieves the lowest average rank (1.3) among the tested AL methods. Next, we conduct additional experiments to understand in depth how infoVerse-DPP selects the informative unlabeled samples and improves the model's performance. Specifically, on SST-2, we compare the average of meta-information of the selected samples by infoVerse-DPP and two representative baselines, Random and Entropy. 4 Figure 8 presents the results; Entropy selects the mostly uncertain samples (8(c)), but it relatively suffers to select the unseen samples (8(d)) and also has a risk to select noisy samples (8(a)). In contrast, infoVerse-DPP incorporates the multiple meta-information during the selection; for example, it selects the mostly variable samples with moderately low confidence, which has been demonstrated as a key characteristic for effective training samples (Swayamdipta et al., 2020). Also, the selected samples capture a certain level of uncertainty along with a low sentence-level density (i.e., hence can introduce the new pattern in training samples).

Data Annotation
Finally, we demonstrate the advantage of infoVerse on data annotation (Kiela et al., 2021b), to provide the most effective set of unlabeled samples that are expected to improve the model's performance after  they are annotated with human labelers. Setups. We consider two datasets, SST-5 (Socher et al., 2013) and IMP datasets (Du et al., 2021). Following (Du et al., 2021), we first conduct an unsupervised data retrieval to prepare highquality 10,000 candidates among 2M unlabeled sentences from Common Crawl (Wenzek et al., 2020) and Reddit corpus. 5 We then apply each selection method to choose final queries for data annotation: 1,000 samples for SST-5 and 600 samples for IMP, respectively. Finally, we ask crowd-workers to annotate the selected samples using Amazon's Mechanical Turk (Crowston, 2012) with at least three different annotators. We compare the two representative methods, Random and Entropy, with ours (infoVerse-DPP) due to the limited resources. We include more details in Appendix B.4.
Results. Table 5 shows the performance with different selection methods on SST-5 and IMP datasets. One can observe that infoVerse with DPP consistently finds more informative sets of 5 https://convokit.cornell.edu/ documentation/subreddit.html samples leading to extra performance gain than the other sampling methods on both datasets. We further measure disagreement between annotators on the newly-annotated dataset in the IMP task in Figure 9. The order of annotated samples by ours is more linearly aligned with the annotators' disagreement than other sampling methods, indicating that our method prefers to choose more hard and informative samples first. Consequently, unlike the prior methods relying on single metainformation like confidence (Xie et al., 2020b) or uncertainty (Mukherjee and Awadallah, 2020), our multi-dimensional approach with infoVerse could provide useful contributions for data annotation. Finally, we remark that experimental results and relevant discussions about computational cost and complementary effect of meta-information are presented in Appendix E and F, respectively.

Conclusion
We propose a new framework, infoVerse to characterize the dataset in various aspects of data informativeness. To be specific, infoVerse utilizes various types of meta-information which offer different aspects of data characteristics. The combination of diverse meta-information helps detect distinct regions of dataset characteristics, which are not observable in the previous feature spaces. In addition, we further propose a novel sampling method to select data points that maximize the information at a set level rather than a sample level on the multidimensional space of infoVerse. We empirically demonstrate the benefit of infoVerse on three applications: data pruning, active learning, and data annotation. infoVerse with the proposed subset selection method shows consistent improvement over the strong baselines of each problem. We believe our framework will emerge with the growth of data-centric approaches and contribute to a better understanding of the dataset and improvement of the dataset's quality.

Limitations
In this paper, we propose a new framework that extracts the various aspect of information about given data, relying on the existing model-driven metainformation from the trained models. Hence, if there are some flaws within the used models, such as biased prediction (Sun et al., 2019) or learning of spurious correlation (Liu et al., 2021), then our framework can be directly affected and may have a risk of inheritance or amplification of such problematic behaviors. However, as our framework is not limited to any specific models and metainformation, one can prevent this problem by using the robustly trained models (Sagawa et al., 2020) or introducing more specialized meta-information (Lee et al., 2021a) for these problems. In addition, despite the empirical gains we find, our subset selection method is not theoretically guaranteed to be (or tightly bound to) the optimal set of max informativeness, which remains an interesting direction. A further study is necessary showing that selected samples from infoVerse could lead to low inter-annotator agreement in manual annotation but provide more accurate information than pseudolabels. Abnormality detection using infoVerse, like noisy labels, out-of-distribution, or annotation artifacts, could be interesting future directions.

Broader Impact and Ethical Implications
Our work aims to quantify the data informativeness with multi-perspective for capturing properties that can not be revealed by a single perspective. Especially, infoVerse lends some insight into data by models what we have. Thus, infoVerse has the potential for guiding the construction of high-quality datasets, e.g., removing mis-labeled samples. From these points, it is possible to develop a system or general platform for effectively collecting data like Dynabench 6 and Snorkle 7 . We anticipate that the general platform of infoVerse could be contributing to human-involved machine learning systems.
Although our work empirically demonstrates the improvement over various real-world problems, the current version of infoVerse has a potential risk to be vulnerable to sample undesirable properties (e.g., gender bias (Bordia and Bowman, 2019)) in a dataset, as we construct infoVerse with metainformation measure do not consider such properties. However, it can be easily alleviated by

A Summary and Formal Definition of Meta-information
In this section, we first present a detailed summarization of considered meta-information in Table  6. Then, we provide a formal definition of each meta-information introduced in Section 3.1. Here, we consider a classification task with K classes for the explanation. x and y indicate the input token and the corresponding true label, respectively. f θ indicates the classifier, which is pre-trained Transformer (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019).
is a predictive distribution of classifier and z θ is a contextualized embedding before linear classifier in f θ = W T z θ .

A.1 Static Measures
Static measures are the meta-information extracted from a single static model, which is the most natural and easy way. In total, 5 different metainformation is used.

Task Density (DeVries et al., 2020)
Here, Euclidean distance to Kth nearest sample is used as density following Carbonera and Abel (2015).
whereẑ θ ∈ D train \{z θ (x)} and min K {·} is defined as the Kth smallest value in a set. In our experiments, we set K = 5. (DeVries et al., 2020) As the Task Density does not utilize the label information, we further consider the relative density which is the difference of kNN density to true class samples and other class samples. Hence, if this value is large, it implies that x is near to the true class and far from other classes. In our experiments, we set K = 5.

A.2 Training Dynamics
Training dynamics of samples largely varies depending on the samples' characteristic and hence can provide useful information, e.g., the confidence of mislabeled samples slowly increases relative to the normal ones. We totally find 4 corresponding meta-information in this category. Here, E is the total training epoch.

A.3 Model Uncertainty
As intrinsic randomness within the model affects the samples' prediction, such uncertainty has been widely used in various fields. Total 6 different meta-information are considered. As we consider the ensemble from MC-Dropout and the ensemble of multiple random seed models, total 12 measures are considered. Here, T is the total number of models trained with different random seeds.  Table 6: Categorization of used meta-information to construct infoVerse. The arrow between the parentheses indicates more informative direction for each measure: e.g., less confident data (↓) are less likely seen so more informative.

Confidence
(↓) • Predictive probability to true label Entropy (↑) • Entropy of the predictive probability BADGE (↑) • Norm of the gradient with respect to parameters in the final (linear) layer Task Density (↓) • Euclidean distance to the K th nearest element on the contextualized embedding from fine-tuned classifier Relative Density (↓) • Difference of Task density within other class' samples and true class samples.

Training Dynamics
Confidence (↓) • Average predictive probability to true label over the training epochs Variability (↑) • Variance of predictive probability to true label over the training epochs

A.4 Pre-trained Knowledge
As general text representation provides complementary information to task's one, using the pre-trained language models to extract meta-information is another popular direction. We use 2 meta-information measures extracted from pre-trained models which are agnostic to the target task. 22. Semantical Density (DeVries et al., 2020) whereẑ θ ∈ D train \{z θ (x)} and min K {·} is defined as the Kth smallest value in a set. We set K = 5. Unlike Task Density, z is extracted from a pretrained sentence encoder, e.g., sBERT (Reimers and Gurevych, 2019) which is known to capture the relationship between sentences. 23. Pseudo-Log-Likelihood (PLL) (Salazar et al., 2020;Yuan et al., 2020) Originally, Salazar et al. (2020) use the following masked language modeling (MLM) score from pre-trained language model θ MLM as Pseudo-Log-Likelihood (PLL) (Wang and Cho, 2019).
where x \l := (x 1 , . . . , x l−1 , x l+1 , x L) . However, it requires L times of inference for calculation. Hence, we instead its approximation following Yuan et al. (2020), which just calculates PLL at once without masking tokens.

B Details of Experiment B.1 Dataset
In this section, we provide details of all datasets we used in this work and hyperparameters used for training the models. For all datasets, we used the given standard training and validation sets. We present the details of downstream datasets in Table  7. All of the data we used can be downloaded from HuggingFace dataset https://huggingface. co/datasets/. For experiments, we report accuracy using the official test set on WinoGrande and AGNEWS. For those where the label for the official test set is not available (SST-2, RTE, MNLI, and QNLI), we use the given validation set. Also, the maximum sequence length is commonly set to 128.

B.2 Data Pruning
For data pruning experiments, we commonly fine-tune RoBERTa-large classifier (Liu et al., 2019) which has the 355M parameters, following (Swayamdipta et al., 2020). For fine-tuning, we commonly train it for 10 epochs with learning rate 1e-5 and batch-size 16 (except WinoGrande with 64 following (Swayamdipta et al., 2020) due to the optimization difficulty) with Adam optimizer (Kingma and Ba, 2014). For each pruning ratio, selection method, and dataset, we run three times with different random seeds.

B.3 Active Learning
Active learning (AL) is a task that finds the most informative subset from unlabeled samples when they are labeled and added to the training dataset. AL usually consists of multiple iterations of the following two steps: (1) annotates a subset of unlabeled data chosen by a sampling method, and (2) adds the labeled data to the previous round of the dataset and re-train the model with the new training dataset. For each round, we trained the model from scratch to avoid overfitting, following Hu et al. (2019).
To be specific, for the experiments in Section 5, we select 100 examples for RTE and 500 examples for CoLA and AGNEWS for each iteration from the training dataset, respectively. 8 Note that the specific selection Then, they are moved to the labeled dataset from the unlabeled pool in each iteration.
To simulate AL, we sample a batch of k sentences from the training dataset, query labels for this batch, and Batch size k is set to 500 for SST-2 and AGNEWS, and 100 for RTE which is a relatively small dataset. For each sampling method and dataset, we run an AL simulation five times with different random seeds. Also, we fine-tune models on five epochs for SST-2 and AGNEWS, and ten epochs for the RTE dataset. We experiment with the BERT-base model which has 110M parameters provided by HuggingFace Transformers (Wolf et al., 2019) with Apache License 2.0. Our implementation is based on existing code repositories 9 with MIT License and used the same hyperparameter (Yuan et al., 2020). We use AdamW (Loshchilov and Hutter, 2019) with a learning rate of 2e-5. BERT-KM (Yuan et al., 2020): As a diversitybased baseline, applying k-means clustering to the l2 normalized BERT output embeddings of the finetuned model to select k data points. FT-BERT-KM (Yuan et al., 2020): Using the same algorithm as BERT-KM except for the BERT embeddings from the previously fine-tuned model are used. ALPS (Yuan et al., 2020): Input sentence is randomly masked, then predict the masked language model(MLM) loss of BERT as a proxy for model uncertainty

B.4 Data Annotation
Here, we provide the details about the annotation pipeline with crowd workers. During experiments, we annotate the selected unlabeled samples with each selection method for SST-5 and IMP datasets. To this end, we use Amazon's Mechanical Turk crowd-sourcing platform (Crowston, 2012). Figure  10 and 11 show the interfaces used to collect annotations from crowd workers for each task. The top provides the summary, the middle provides de-  To improve the quality of collected preference labels, we only hire the Master workers identified as high-performing workers from Amazon's Mechanical Turk system. Overall, we gather at least 3 annotations for each sample. For the experiments with annotated samples, we use the same experimental setups with data pruning in Section5.1. Also, for the annotator disagreement, we report variance within the multiple annotations. We will release the annotated dataset for future research.
C Additional Results

C.1 Data Pruning
First, in Figure 13, we plot the test accuracy of fine-tuned RoBERTa-large across different pruning ratios on CoLA, SST-2, RTE, and QNLI datasets; while the baseline methods suffer from inconsistent performance on different pruning ratio (for example, Hard and Ambig show good performance on low pruning ratio, but steeply degraded when the pruning ratio increases), infoVerse-DPP shows the consistently outperforming performance in overall. In addition, we plot the dynamics during data pruning with infoVerse-DPP in Figure 14, similar to Figure 6. Here, one can observe that infoVerse-DPP automatically finds the effective strategy adaptively. Finally, we present the ablation results of our component (1) infoVerse and (2) DPP-based sampling method. As shown in Table 8, the DPP-based sampling method provides a clear improvement in multidimensional space (vs Coreset). Furthermore, as infoVerse provides a richer feature space than the standard classifier's embedding, such gain is further enlarged when they are combined.

C.2 Active Learning
In Figure 12, we present the test accuracy of finetuned BERT-base with each AL iteration on RTE and AGNEWS, respectively. Here, one can observe that infoVerse-DPP shows comparable performance with the state-of-the-art baseline of AL.

D infoVerse on Other Datasets
In this section, we first present the correlation matrices between 23 meta-information on other datasets, similar to Figure 2. As one can see in Figure 16, the correlation matrices between different datasets (and tasks) are quite similar, which implies that the meta-information captures the general characteristic of datasets.

E Experiments to Verify Complementary Effect of Meta-information
To further verify the complementary effect between multiple meta-information, we conduct simple toy experiments in this section. Similar to (Swayamdipta et al., 2020), we train a simple linear classifier on each feature space by assuming that gold labels of each task are available for training; for example, given sample is noisy-labeled or not. Here, we consider four different abnormality detection tasks with the QNLI dataset: mispredicted, mislabeled (or noisy labeled), out-ofdistributed (OOD), and adversarial samples (Adv), respectively.
In Table 9, one can verify that the accuracy increases as more meta-information is used; it implies that they are actually complementary and can provide richer information when they are jointly considered. In the end, infoVerse shows a better performance than the original classifier's semantic embedding in all tested cases. Also, we consider the reduced feature space, * infoVerse, by applying the PCA-based feature selection method using the correlation matrix, and verify its comparable performance only using the half of meta-information. But, since only small costs are additionally required to use infoVerse compared to * infoVerse, we use all 23 meta-information for our experiments. It is noteworthy that new meta-information can be easily included in our framework, and contribute to compose more informative feature space. In the remaining part of the section, we further provide the details of this experiment. Here, we use RoBERTa-large classifier (Liu et al., 2019) finetuned on QNLI dataset (Wang et al., 2019). 1) Finding mispredicted samples: we train a single linear classifier with SGD optimizer on each feature space to classify whether a given sample is correctly predicted or not. Here, we assume that the binary labels that indicate whether a given sample is correctly predicted or not are available to train the linear classifier. Then, we only measure the performance on the test mispredicted samples.
2) Detecting mis-labeled samples: following the setups in (Swayamdipta et al., 2020), we artificially impose the 10 % label noise into training samples with high confidence (i.e., easy-to-learn). Then, we train a single linear classifier with SGD optimizer on each feature space to classify whether given samples has corrupted label or not. Here, we assume that the binary labels that indicate whether a given sample is corrupted or not are available to train the linear classifier.
3) Detecting out-of-distribution samples: we consider the samples of QNLI's development set as inD and samples of MNLI's development set (both matched and mismatched) as OOD. Then, we train a single linear classifier with SGD optimizer on each feature space to classify whether the given sample is from inD or OOD. Here, we assume that the binary labels that indicate whether a given sample is inD or OOD are available to train the linear classifier.
4) Detecting adversarial sentence: Here, we consider the task to classify whether the normal sentence is given (MNLI) or adversarially gathered sentence is given (ANLI (Nie et al., 2019)). Then, we train a single linear classifier with SGD optimizer on each feature space to classify whether the given sample is normal or adversarial. Here, we assume that the binary labels that indicate whether a given sample is normal or adversarial are available to train the linear classifier.

F Computational Cost
The computation cost of infoVerse depends on which meta-informations are used for constructing it. As introduced in Table 1, we considered meta-informations with four categories (static measures, training dynamics, model uncertainty, and pre-trained knowledge). The calculation of these categories requires (1, E, T , and 2) forward passes of the trained model for each sample, where E denotes the total training epochs, and T indicates the total number of trained models with different random seeds. In the case of the proposed sampling method, it has O(N 2 M ) time complexity when returning N items from M total items, but we remark that it can be further reduced with a simple approximation (Chen et al., 2018). Yet, we remark that our method does not require additional training costs since it only utilizes the trained models via standard way (cross entropy and multiple random seeds) and pre-trained models.
For example, we measured the runtime of our approach using the CoLA dataset, with the time consumption in seconds: training/constructing in-foVerse/DPP sampling consume 3852s/100s/10s, respectively. This demonstrates that our method's overall cost is relatively minor compared to training expenses. Moreover, it is worth noting that the cost of ours is only incurred once at initial construction. It is also worth noting that all meta-informations within the same category are obtained with the same forward passes, and there is no additional cost after infoVerse is constructed at once.
In addition, although this work did not focus much on reducing computational overhead, we believe simple practices can further reduce the cost of constructing infoVerse. For example, substituting the additional forward passes by saving the outputs of models during training on the fly (Huang et al., 2017;Tarvainen and Valpola, 2017)