Learning with Different Amounts of Annotation: From Zero to Many Labels

Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in two tasks, suggesting distributing labels unevenly among training examples can be beneficial for many NLP tasks.


Introduction
Crowdsourcing annotations (Rajpurkar et al., 2016;Bowman et al., 2015) has become a common practice for developing natural language processing benchmark datasets. Even after thorough quality control, it is often infeasible to reach complete annotator agreement, as annotators make mistakes (Freitag et al., 2021) and ambiguity is a key feature of human communication (Asher and Lascarides, 2005). Rich prior works (Passonneau et al., 2012;Pavlick and Kwiatkowski, 2019;Nie et al., 2020;Min et al., 2020;Ferracane et al., 2021) show 1 Code and data split is available at https://github. com/szhang42/Uneven_training_data. x 1 x 2 x 1 x 20 x 21 x 30 x 31 x100 y 11 y 12 y 201 y 202 < l a t e x i t s h a 1 _ b a s e 6 4 = " P K P / G / k 7 P w O h G B A L k d r 6 F d 5 P p Y s = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e i F 4 8 V 7 A e 0 o W y 2 m 2 b p 7 i b s b o Q S + h e 8 e F D E q 3 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I O F M G 9 f 9 d k p r 6 x u b W + X t y s 7 u 3 v 5 B 9 f C o o + N U E d o m M Y 9 V L 8 C a c i Z p 2 z D D a S 9 R F I u A 0 2 4 w u c v 9 7 h N V m s X y 0 U w T 6 g s 8 l i x k B J t c G i Q R G 1 Z r b t 2 d A 6 0 S r y A 1 K N A a V r 8 G o 5 i k g k p D O N a 6 7 7 m J 8 T O s D C O c z i q D V N M E k w k e 0 7 6 l E g u q / W x + 6 w y d W W W E w l j Z k g b N 1 d 8 T G R Z a T 0 V g O w U 2 k V 7 2 c v E / r 5 + a 8 M b P m E x S Q y V Z L A p T j k y M 8 s f R i C l K D J 9 a g o l i 9 l Z E I q w w M T a e i g 3 B W 3 5 5 l X Q u 6 t 5 V / f K h U W v e F n G U 4 Q R O 4 R w 8 u I Y m 3 E M L 2 k A g g m d 4 h T d H O C / O u / O x a C 0 5 x c w x / I H z + Q M W J I 5 H < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P K P / G / k 7 P w O h G B A L k d r 6 F d 5 P p Y s = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e i F 4 8 V 7 A e 0 o W y 2 m 2 b p 7 i b s b o Q S + h e 8 e F D E q 3 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I O F M G 9 f 9 d k p r 6 x u b W + X t y s 7 u 3 v 5 B 9 f C o o + N U E d o m M Y 9 V L 8 C a c i Z p 2 z D D a S 9 R F I u A 0 2 4 w u c v 9 7 h N V m s X y 0 U w T 6 g s 8 l i x k B J t c G i Q R G 1 Z r b t 2 d A 6 0 S r y A 1 K N A a V r 8 G o 5 i k g k p D O N a 6 7 7 m J 8 T O s D C O c z i q D V N M E k w k e 0 7 6 l E g u q / W x + 6 w y d W W W E w l j Z k g b N 1 d 8 T G R Z a T 0 V g O w U 2 k V 7 2 c v E / r 5 + a 8 M b P m E x S Q y V Z L A p T j k y M 8 s f R i C l K D J 9 a g o l i 9 l Z E I q w w M T a e i g 3 B W 3 5 5 l X Q u 6 t 5 V / f K h U W v e F n G  : Re-thinking how to distribute annotation budget. Each blue tag represents a human annotation for the corresponding x. Examples in the orange shaded area are assigned many labels (multi label data), examples in the yellow shaded area are assigned a single label (single label data), and examples in grey shaded area are not assigned any labels. Models trained on a combination of multi label, single label and unlabeled data outperform models trained on single label data on both label accuracy and label distribution metrics for entailment and entity typing task.
that disagreement among annotators is not an annotation artifact but rather core linguistic phenomena.
Despite observing such inherent ambiguity, most work have not embraced ambiguity into the training procedure. Most existing datasets (Wang et al., 2019;Rajpurkar et al., 2016) present a single label per each training example while collecting multiple labels for examples in the evaluation set, with a few notable exceptions on subjective tasks (Passonneau et al., 2012;Ferracane et al., 2021). We challenge this paradigm and re-distribute annotation budget unevenly among training examples, generating small amount of training examples with multiple labels. Without changing mainstream model architectures (Vaswani et al., 2017), we change the annotation budget allocation. Figure 1  We present a retrospective study (Liu et al., 2021) with datasets from prior work (Nie et al., 2020;Choi et al., 2018). We first evaluate our approach on densely annotated NLI datasets, where human disagreement is prevalent (Pavlick and Kwiatkowski, 2019). We report majority label accuracy and distribution metrics (e.g., KL divergence to measures models' ability to estimate human label distribution). Our experiment on a multi label task -fine-grained entity typing (Choi et al., 2018) -exhibits similar trend that acquiring multiple labels for a single example is more effective than labeling as many examples as possible.
Lastly, we present an in-depth study comparing models trained with multi label data and models trained with single label data. Training with single label examples leads the low entropy label distribution and unable to capture human disagreements. While calibration techniques such as smoothing distribution (Guo et al., 2018) can alleviate over confidence of model prediction and improves distributional metrics, it erroneously introduces uncertainty even for unambiguous examples. Our study suggests that introducing uneven label distribution scheme, paired with a learning architecture that combines three different types of training examples, can provide an efficient and effective solution.

Data Configuration
We first describe our training data configuration and then discuss our learning algorithms. We notate the input feature vector as x and output label distribution as y. We have three types of training example: unlabeled data set X u = {x 1 u , x 2 u . . . , x un u )}, where u n is the total number of unlabeled examples, single label data set X s = {(x 1 s , y 1 s ), (x 2 s , y 2 s ) . . . , (x sn s , y sn s )} where s n is the total number of single label examples, and multi label data set where m n is the total number of multi label examples and k is the number of annotations per example. For multi label examples, we will aggregate multiple annotations to generate y * m . Unlike y s , which is a one-hot vector, y * m will now be a distribution over labels (for label distribution estimation problem, averaging (y i m 1 , y i m 2 . . . y i m k ), and for label prediction problem, taking arg max k (y i m 1 , y i m 2 . . . y i m k )). The annotation cost for generating training datasets can be described as the function of two factors (Sheng et al., 2008): the number of examples and the number of labels. Both can have impacts on the model performance and are highly associated with the annotation cost. In most existing studies (Wang et al., 2019), the training data is a set of annotated example with single label, X s . Supervised learning assumes an access to X s , and unsupervised learning assumes additional unlabeled examples X u , and semi-supervised learning assumes a mixture of X u and X s . Here, we focus on annotation distribution over examples and make a simplifying assumption that annotation cost scales linearly to the number of labels.
We propose a set up where we distribute annotation label budget unevenly across training examples, resulting in unlabeled examples, single label examples, and multi label examples. We do not collect any new annotations in this work, and re-use dataset from prior work (Choi et al., 2018;Chen et al., 2020b) by resplitting existing datasets to simulate different label distribution scenarios. For each task, we study X s setting, which consider a fixed number of supervised, single label example. Then, we introduce X s + X m setting, which includes multi label examples and single label examples (but fixing the amount of total annotation same as the X s setting). Lastly, we study adding unlabeled examples X u to both settings.

Task
We consider two classification tasks, Natural Language Inference (NLI) and fine-grained entity typing. Recent papers (Pavlick and Kwiatkowski, 2019;Nie et al., 2020) have shown that human annotators disagree on NLI task for its inherent ambiguity. Such disagreement is not an annota-

Premise Hypothesis Old Labels New Labels
A woman in a tan top and jeans is sitting on a bench wearing headphones.
A woman is listening to music. E E N N E N (93) E (7)

Sentence with Target Entity Entity Type Labels
During the Inca Empire, {the Inti Raymi} was the most important of four ceremonies celebrated in Cusco. event, festival, ritual, custom, ceremony, party, celebration Table 1: Examples of ChaosSNLI and Ultra-fine Entity Typing dataset. In NLI task, each label corresponds to one annotator's judgement (entailment (E) / neutral (N) / contradiction (C)). In fine-grained entity typing, the entity mention is in blue with the curly brackets. Each positive type label is treated a single label.  The number of labels are consistent in all settings. In NLI task, each multi label example contains 10 labels, and in UFET task, each multi label example contains 2 labels. For completeness, we also provide original training data configurations.
tion artifact but rather exhibits the judgement of annotators with differing interpretations of entailment (Reidsma and op den Akker, 2008).
Named entity recognition (Sang and Meulder, 2003), in its vanilla setting with a handful of classes, is a straightforwad task with high interannotator agreement. However, when the label set grows, comprehensive annotation becomes challenging and most distant supervision examples only offers partial labels. Many real world tasks (Bhatia et al., 2016) involve such complex large label space, where comprehensively annotating examples are often infeasible. We choose ultra-fine entity typing dataset (Choi et al., 2018) which provides typing into a rich ontology consisting of over 10K label candidates. Unlike NLI task, fine grained entity typing is a multi class classification task, where a single example is assigned to a set of gold type labels. Thus, acquiring multiple labels for the same example provides correlation among the labels (e.g., musicians are also artists). Table 1 shows an example of each task, and Table 2 shows full experimental data configuration, which will be explained below.
NLI: Label Distribution Estimation NLI is a task (Dagan et al., 2005;Bowman et al., 2015) that involves deciding whether a hypothesis h is supported by a given premise p. It is a three-way classification task with "entailment", "contradiction", and "neutral" as labels, and recently reframed as a human label distribution prediction task.
We use the training data from the original SNLI (Bowman et al., 2015) and MNLI dataset (Williams et al., 2018), containing 549K and 392K instances respectively. Recent work presents ChaosNLI dataset (Nie et al., 2020) For ChaosNLI in the training, We randomly sample 10 out of 100 annotations for each examples in the training set. For single label data, we directly sample from the original SNLI/MNLI data based on the annotation budget such as 150k or 6k examples.
Ultra Fine Entity Typing (UFET): Multi Label Classification UFET takes a sentence and an entity mention, and labels this mention with a set of entity types from the rich type ontology covering 10K types. Each example is annotated with average 5 labels: 0.9 general types , 0.6 fine-grained types , and 3.9 ultra-fine types. We consider each positive type annotation as a single label, thus original data setting is a combination of X s and X m examples (most of them are X m ). We simulate X s setting and X s + X m setting for our study.
The dataset consists of 6K crowd-sourced examples, randomly split evenly into train, development, and test sets. We fix the total number of training label budget as 500 labels. For X s setting, we randomly sample 500 examples and sample one label for each example. For X s + X m setting, we sample 100 examples with one label, and 200 examples with two labels. We only modify training data and use the original evaluation dataset.

Learning
We introduce learning algorithms that can handle different types of training data. We describe feature extractors for both tasks, which maps natural language to a dense vector representation x then discuss learning algorithms. In the learning algorithms, we first discuss learning with annotated examples only (single label and multi label) and describe learning strategy to integrate unlabeled data. All learning configurations are optimized with the cross entropy (CE) loss.

Base Model
We present base models at here which is used to derive input feature vector x from natural language examples. Training details and hyperparameter settings can be found in the appendix.
NLI We use RoBERTa (Liu et al., 2019) based classification model, i.e., encoding concatenated hypothesis and premise and pass the resulting breaks that assumption, now a premise can occur in both training and evaluation with different hypotheses. However, we find that the performance on examples with/without overlapping premise in the training set does not vary significantly.
[CLS] representation through a fully connected layer to predict the label distribution.
UFET We follow the baseline architecture presented in Choi et al. (2018), a bidirectional LSTM which generates contextualized representation. The model computes weighted sum of contextualized representation for each word in the sentence to represent an example using attention. Then this representation is used to decide the membership of each label in 10K ontology.  (Sennrich et al., 2016;Zhang et al., 2021b) and word replacement. We describe original MixUp algorithm below.

Labeled Examples Only
Given two examples (x m , y m ) and (x n , y n ), where x is raw input vector and y is one-hot label encoding, it constructs augmented training examples by incorporating the intuition that linear interpolations of feature vectors should lead to linear interpolations of the associated targets: where λ is a scalar hyperparameter for mixing both the inputs and labels. It is sampled from a Beta(η, η) distribution with a hyper-parameter η. The newly generated training data (x,ỹ) are used as a training example, and the learning objective is: where L is the cross entropy loss and d(.; φ) is a classifier on top of the encoder model which take the mixed representationx as input and returns a probability over a label set. Interpolated annotated data x m and x n can be either single label data or multi label data. We define the loss from interpolating single label example and multi label example as L s,m , the loss from interpolating multi label example and multi label example as L m,m , the loss from interpolating single label example and single label example as L s,s . Thus the MixUp (Zhang et al., 2018) loss, in our X s + X m setting, is defined as where α is a coefficient (Tarvainen and Valpola, 2017;Berthelot et al., 2019;Fan et al., 2020).

Semi-supervised Learning
Now we introduce unlabeled examples into training algorithm. Following prior work (Berthelot et al., 2019), we generate pseudo labels for each unlabeled example. For unlabeled x u , we use hidden states of the model's prediction to generate the pesudo labels (Xie et al., 2020). Considering the unlabeled data set X u = (x 1 u . . . , x n u ) where n ∈ {1 . . . N }, the classifier model generates a pseudo label distribution q n for each data point x n u . We sharpen this distribution by taking the argmax of distribution q n , making a one hot vector q n over the labels. The classifier used to generate the pseudo labels trained jointly in a single end-toend learning, using the learning signals from the labeled data.
MixUp Three Types of Data After generating the pseduo labels for unlabeled data, we have three types of input: single label examples X s , multi label examples X m , and unlabeled examples X u , all with corresponding labels. We introduce MixUp interpolation among three types of data, integrating all into the objective function as below: Mixup(X s , X m , X u ) = L s,s + L m,m + α(L s,m + L s,u + L m,u ).
For all settings, we set the maximum value of loss weight α as 2.0 and linearly ramp up α from 0 to its maximum value over the first 100 iterations of training as is common practice (Tarvainen and Valpola, 2017;Berthelot et al., 2019).

Experiments
We present performances of our labeling scheme and learning framework in this section. All experimental results are rerun three times with different random seeds to determine the variance, which is small. 4

Evaluation Metrics
NLI We follow evaluation metrics from original papers (Bowman et al., 2015;Nie et al., 2020). We report classification accuracy, which is computed twice, once against aggregated gold labels in the original 5-way annotated dataset (old), and against the aggregated label from 100-way annotated dataset (new). Distributional evaluation metrics, Jensen-Shannon Divergence (Endres and Schindelin, 2003), and Kullback-Leibler Divergence (Kullback and Leibler, 1951) are also reported. We present analysis on different evaluation metrics in Section 4.5.
UFET We compute macro-averaged precision, recall, and F1, and the average mean reciprocal rank (MRR), following prior work.

NLI Results
In Table 3, we evaluate the impact of introducing multi label datasets in the full data setting. Even with a large annotation budget, learning with single label data shows a limited performance, and we see substantial gains on both accuracy and distribution metrics by replacing 5K single label examples with a small amount of multi label data (500 examples). X s + X m outperforms previously published results (X s ) from Nie et al. (2020). Here we try vanilla curriculum learning, which first trains a model with X s data and then fine tune with X m data. With this encouraging initial results, we further explore different learning objectives in more constrained annotation budget scenarios (150K and 6K). The results on ChaosMNLI dataset is presented in Table 4   compared to dedicating even a small amount of budget to generate multi annotated data (500 examples, each 10-way annotated).
Now we compare different methods to integrate multi label data and single label data. As a baseline, we notate simply combined multi label and single label data as CE (combined). Simple combination does not work when the number of multi label data (0.5K) is much smaller than the total number of single label data (145K), but shows comparable performance in 6K setting where multi label and single label data are more balanced (0.5K multi label data vs. 1K single label data). Upsampling multi label data shows improvement over the CE combined. CE (X s then X m ) which is first training the model with single label data and then fine tune with multi label data works better, consistently achieving strong performances in different experimental settings.
Next, we discuss gains from using MixUp data augmentation methods. We observe small yet consistent gains from using example MixUp in single label setting (i.e., X s : MixUp (X s ) vs. X s : CE) confirming findings from the previous studies (Zhang et al., 2018). Integrating multi label training examples into MixUp objective shows gains in low annotation budget setting. In high annotation budget settings, where we have fewer multi label examples (500 multi vs. 145K single), CE (X s then X m ) yields better results. Nonetheless, MixUp augmentation shows consistent gains compared to shuffling (MixUp(X s , X m ) vs. CE(combined)).
Our results suggest that annotation budget should be distributed carefully. Even under same label budget and the same learning objective, distribution of labels among examples resulted in performance differences (i.e., X s : CE vs. X s + X m : CE (combined)). Incorporating unlabeled examples (MixUp (X s , X u ) vs MixUp (X s )) improves the performances in low label budget settings (6K), but is detrimental in high label budget settings (150K). We hypothesize that imperfect pseudo label for unlabeled examples can interfere the learning. Table 5 reports performances on ultra fine entity typing dataset. Instead of using both crowdsourced data and distant supervision data (Choi et al., 2018), we focus on crowd-sourced data to  Table 5: Results on UFET dataset. Top two rows use the full crowd-sourced data and the bottom rows are based on smaller label annotation budgets, thus results are not comparable (see Table 2 for details).

UFET Results
simulate single label and multi label settings. Similar to previous results, each row block represents different annotation label budgets. Top two rows use the full crowd-sourced data and the results are not comparable to the bottom rows. The bottom rows are based on different annotation budgets such as 500 single label data (see Table 2 for details). Again in this task, using a single label per example results in inferior performances compared to having multiple labels per example (X s + X m : CE (X s then X m ) vs. X s : CE ) as multi label data helps model to learn label-label interaction. Similar to NLI task, adding MixUp objective to the single label setting shows gains (X s : MixUp (X s ) vs. X s : CE). Having multi label data is crucial for high performances, and MixUp again shows gains in this low resource setting.

Analysis
How does different learning algorithm compares under domain shift? We compare two promising methods -single and then multi (CE (X s then X m )) and MixUp (MixUp (X s , X m )) for their performance in out of domain setting. Prior work suggested MixUp approaches can effectively compensate for the mixmatch between test data and training data (Zhu et al., 2019). Table 6 shows the performances of model trained on SNLI and tested on MNLI dataset. We observe improved accuracy with MixUp compared to training with the curriculum approach (train with single label data and then fine tuning with multi label data).  randomly sampled examples. Easy-to-learn examples, with lowest label distribution entropy, are the least effective, but the difference is small in our settings. Similarly, our experiments of changing the number of labels (5-way, 10-way, 20-way) did not result in meaningful differences. The experimental results can be found in Table 10 in the appendix.
Can we use multi label data exclusively without any single label data? In our main experiments, we mixed multi label data with single label data. Here we present a study comparing a setting with X m only and X s only on the NLI task, while keeping small annotation budget steady (1K labels). On ChaosSNLI dataset, the model trained with single label data ( We observe a similar trend for ChaosMNLI dataset as well. We cannot claim that X m only will outperform X s only in all settings -as models will benefit from being exposed to diverse examples, but in this low resource setting, we observe gains from using multi annotated data alone.

Calibration: Alternative Approach to Improve Label Distribution Prediction
We introduce using multi label training examples as an efficient way to estimate the distribution of  labels. Here, we provide a study of alternative ways to improve label distribution prediction, borrowing ideas from calibration literature, and compare the calibration with training with multi label data.
The key observation is that the predicted label distribution from model trained with single label was over confident, with smaller predicted label entropy 0.414 in Table 7 compared to the human annotated label entropy 0.732. Thus, we smooth the output distribution with three calibration methods (Guo et al., 2018;Miller et al., 1996). The temp. scaling and pred smoothing are post-hoc and do not require re-training of the model. For all methods, we tuned a single scalar hyperparameter per dataset such that the entropy of prediction label distribution matching the entropy of human label distribution.
• temp. scaling: scaling by multiplying nonnormalized logits by a scalar hyperparameter. • pred smoothing: process softmaxed label distribution by moving α probability mass from the label with the highest mass to the all labels equally. • train smoothing: process training label distribution by shifting α probability mass from the gold label to the all labels equally. Table 7 reports performances of calibration methods. We find all calibration methods improve performance on both distribution metrics (JSD and KL). Temperature scaling yields slightly better results than label smoothing, consistent with the findings from Desai and Durrett (2020) which shows temperature scaling is better for in-domain calibration compared to label smoothing. Nonetheless, all these results were substantially worse than using multi label data during the training.
Can we estimate the distribution of ambiguous and less ambiguous examples? Figure 2 shows the empirical example distribution over entropy bins. The leftmost plot (a) shows the annotated human label entropy over our evaluation set, and the plot (b) next to it shows the prediction entropy of the baseline RoBERTa model predictions. The model is over-confident about its prediction with single label examples. With label smoothing (plot c), the over-confidence problem is relieved, but the entropy distribution still does not match the distribution of ground truth. Training with multi label data (plot d) makes the prediction distribution similar to the ground truth.

Related Work
Assessing the annotation cost associated with learning has long been studied (Turney, 2002). Sheng et al. (2008) studies the tradeoff between collecting multiple labels per example vs. annotating more examples. Researchers have also explored different data labeling strategies, such as active learning (Fang et al., 2017), providing fine-grained rationales (Dua et al., 2020), retrospectively studying the amount of training data necessary for generalization (Mishra and Sachdeva, 2020), and the policy learning approach (Kratzwald et al., 2020).
In this work, we study uneven distribution of label annotation budget for training examples, which has not been explored to our knowledge. Label propagation has been extensively used to infer pseudo-labels for unlabeled data, which are used to train the classifier (Zhou et al., 2004;Li et al., 2016). Our use of MixUp can be viewed as a way to propagate label information between the single labeled, multi labeled, and unlabeled data.
Rich prior work studies ambiguity in language interpretations (Aroyo and Welty, 2015). A few studies (Passonneau et al., 2012;Ferracane et al., 2021) frame diverging, subjective interpretations as a multi label classification, and few studies (Glickman et al., 2005;Zhang et al., 2017;Chen et al., 2020b) introduce graded human responses. Mayhew et al. (2020) studies training machine translation system with the goal of generating diverse set of reference translations. Pavlick and Kwiatkowski (2019) examines the distribution behind human references for NLI and Nie et al. (2020) presents a larger-scale data collection that we build on.
Earlier version of this paper (Zhang et al., 2021a) study capturing inherent human disagreement in the NLI task through calibration and using a small amount of multi-annotated training examples. This paper expands upon it, introducing a new learning framework for such uneven label distribution schemes. Concurrent to our work, Zhou et al. (2021) introduces distributed NLI, a new NLU task with a goal to predict the distribution of human judgements by applying additional distribution estimation methods such as Monte Carlo (MC) Dropout and deep ensemble methods. While we share a similar goal, our work focuses on how to distribute training labels across examples and how to learn under this new label distribution scheme.

Conclusion
Our work demonstrates the benefits from introducing a small amount of multi label examples at the cost of annotating fewer examples. The proposed learning algorithm, extended from MixUp, flexibly takes signals from different types of training examples (single label data, multi label data, and unlabeled data) and show gains upon simply combining different datasets in low annotation budget settings. In this work, we retrospectively study with existing data to question original annotation collection designs. Exploring reinforcement learning or active learning to predict an optimal distribution of annotation budget will be an exciting avenue for future work.    Xs + Xm 1k 0.5k 0 1k * 1 + 0.5k * 10 = 6k 1.5k S / MNLI Xs + Xu 6k 0 549k-6k 6k * 1 = 6k 549k Xs + Xm + Xu 1k 0.5k 549k-1.5k 1k * 1 + 0.5k * 10 = 6k 549k Table 11: Training data configurations for 6k NLI. Each configuration is characterized by the number of labels and the number of examples. The number of labels are consistent in all settings. In NLI task, each multi label example contains 10 labels. For completeness, we also provide original training data configurations.