Embracing Ambiguity: Shifting the Training Target of NLI Models

Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels. While many research works do not pay much attention to this fact, several recent efforts have been made to acknowledge and embrace the existence of ambiguity, such as UNLI and ChaosNLI. In this paper, we explore the option of training directly on the estimated label distribution of the annotators in the NLI task, using a learning loss based on this ambiguity distribution instead of the gold-labels. We prepare AmbiNLI, a trial dataset obtained from readily available sources, and show it is possible to reduce ChaosNLI divergence scores when finetuning on this data, a promising first step towards learning how to capture linguistic ambiguity. Additionally, we show that training on the same amount of data but targeting the ambiguity distribution instead of gold-labels can result in models that achieve higher performance and learn better representations for downstream tasks.


Introduction
Ambiguity is intrinsic to natural language, and creating datasets free of this property is a hard if not impossible task. Previously, it was common to disregard it as noise or as a sign of poor quality data. More recent research, however, has drawn our attention towards the inevitability of ambiguity, and the necessity to take it into consideration when working on natural language understanding tasks (Pavlick and Kwiatkowski, 2019;Chen et al., 2020;Nie et al., 2020;Swayamdipta et al., 2020). This ambiguity stems from the lack of proper context or differences in background knowledge between annotators, and leads to a large number of examples where the correctness of labels can be debated.
ChaosNLI (Nie et al., 2020) is a dataset created by manually annotating a subset of the SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), and αNLI (Bhagavatula et al., 2020) datasets. Each of the total 4,645 samples received 100 annotations. Through this data, they were able to generate a probability distribution over the labels for these samples, which they call the human agreement distribution, with the goal of using it to evaluate the ability of current state-of-the-art models to capture ambiguity. The divergence scores between the model's predicted probability distribution and the true target distribution is computed and compared against random and human baselines. They showed that models trained using gold-labels have very poor performance on the task of capturing the human agreement distribution.
Although this is a promising first step, it remains unclear how to train models with a better understanding of ambiguity, and what tangible benefits we can obtain when actually doing so. In this work, we study the possibility of shifting the training target of models from gold-labels to the ambiguity distribution, a simple and intuitive yet until now unexplored approach in this domain. We hypothesize that when we finetune a model in this way, we can achieve lower divergence scores in the ChaosNLI benchmark. Further, we believe that it should also bring accuracy improvements in NLI and other downstream tasks. The intuition behind our performance expectations is that an ambiguity distribution offers a more informative and less misleading view on the answer to the task, which allows models to learn more from the same data.
We prepare a trial dataset with ambiguity distributions obtained from available SNLI and MNLI data, and run experiments to confirm our hypotheses. We refer to it as AmbiNLI, but we do not encourage its use in further work. Instead, we encourage the community to follow this direction by performing further data collection in this area.
Our main contributions are showing that 1) models trained on ambiguity can more closely capture SNLI / MNLI. Both SNLI and MNLI provide labels assigned by 5 annotators on some subsets of the data (marked "Original" in Table 1). Examples where no human agreement could be reached (no majority) were given a special label (-1) and are commonly filtered out. Although the precision given by 5 labels is much lower than that of the 100 annotations provided in ChaosNLI, we believe that even a rough and inaccurate ambiguity representation is more beneficial than gold-labels only.
UNLI. UNLI (Chen et al., 2020) presents a subset of SNLI as a regression task, where each example is annotated with a real value in the range [0,1]. Values close to 0 indicate contradiction, and values close to 1 represent entailment. Each entry has one label only, but since it real-valued, it is also possible to extract a distribution from it. Even though it seems to be a less suitable data source,  we do intend to investigate its effectiveness for our purposes.
ChaosNLI. ChaosNLI provides annotations from 100 humans for 3,113 examples in the development sets of SNLI and MNLI. We will call these subsets ChaosSNLI and ChaosMNLI. In order to allow for comparison with the original paper, we use them for testing only.

Creating AmbiNLI
Original SNLI and MNLI data with 5 annotations can be converted to an ambiguity distribution by simply counting the number of annotations for each label and then scaling it down into probabilities. We make sure to avoid overlap between ChaosNLI and "Original" data by removing the samples used in ChaosNLI from the data we will include in Am-biNLI. In the case of UNLI, we have taken only the 55,517 samples from the training set, so there is no overlap with ChaosNLI. We apply a simple linear approach to convert the UNLI regression value p into a probability distribution z NLI , as described in the following composed function (its plot can be found in the Appendix A):

Experiments
In our experiments, we use BERT-base (Devlin et al., 2019) with pre-trained weights and a softmax classification head. We use a batch size of 128 and learning rate of 1e-5.
Learning to capture question ambiguity. In our main experiment, we aim to judge whether is is possible to learn how to capture the human agreement distribution. We first obtain a base model in the same manner as Nie et al. (2020), by pretraining it for 3 epochs on the gold-labels of the SNLI and MNLI training sets. We observed that this pre-training step is necessary to provide the model with a general understanding of the NLI task to compensate for the low amount of ambiguity data available. We then finetune the model on our AmbiNLI dataset, setting the training objective to be the minimization of the cross-entropy between the output probability distribution and the target ambiguity distribution. For evaluation, we compute the ChaosNLI divergence scores, measured using the Jensen-Shannon Divergence (JSD), as was done in their original experiments. Furthermore, we explore what effect our ambiguity learning has on accuracy by comparing models trained on exactly the same data but with goldlabel training versus ambiguous training. In order to achieve this, we prepare a version of AmbiNLI where we replace the ambiguity distributions with gold-labels. Since the two models have seen the exact same data, performance differences can be directly attributed to the process of capturing ambiguity. We report accuracy on ChaosNLI using their re-computed gold-labels.
Further accuracy analysis. To reinforce our hypothesis that accuracy improvements can be gained by leveraging the extra knowledge that models capture with ambiguity, we run an additional experiment on the ChaosMNLI dataset. We split it into three folds, and perform three-fold cross validation by training the model on two folds and evaluating on the third. Again, we start with our baseline model and compare the gold-label approach against ours.
Performance in different entropy ranges. We also study the model performance in different entropy ranges of the ChaosMNLI set. We bin the evaluation samples based on their entropy value into three equally sized ranges, and compare the   Transfer learning. In this last experiment, we aim to compare the usefulness of the representations that the BERT encoder is able to learn when training on ambiguity distributions as opposed to gold-labels. We use UNLI and IMBD movie reviews (Maas et al., 2011) as the two downstream tasks for evaluation. As we want to focus on the representations learned during the ambiguity training phase, during the downstream task finetuning we freeze the BERT layers and update only the new classification head. We try with 1-layer and 2-layer heads using the ELU (Clevert et al., 2016) activation function and a hidden size of 128. We use the original train, development and test splits for UNLI, and an 80/10/10% split for IMDB movie reviews. We track development set loss and stop after two epochs without improvement. Each experiment is ran for 5 trials with different seeds and the mean and standard deviation are reported for each metric.

Results and Discussion
Training on the ambiguity distribution can reduce divergence scores. Table 2 details the results of our main experiment. Accuracy and JSD are provided for both the SNLI and MNLI sections in ChaosNLI. Due to differences in hyperparameters or random seeds, we were not able to exactly reproduce the base model provided in Nie et al. (2020), but achieve similar results. We follow with models further finetuned on different config-urations of our AmbiNLI dataset. AmbiSM refers to the data originating from the original 5 label distribution only, while AmbiU refers to the data we obtained from UNLI. AmbiSMU thus refers to the full dataset. For each combination, we also trained a model on gold-labels (marked as "Gold" in the table) for comparison. With the exception of ChaosSNLI when including UNLI data, every experiment has yielded a mentionable divergence score improvement. The AmbiSM model shows a 20.5% and 21.7% JSD decrease in ChaosSNLI and ChaosMNLI respectively. This means that we can learn to capture the human agreement distribution when we use it as a training target.

UNLI's skewed distribution worsens scores.
When looking at the AmbiU and AmbiSMU results in Table 2, it becomes apparent that UNLI data is not always beneficial. Specifically, it seems to worsen scores in all metrics except for ChaosMNLI accuracy. The distribution of labels in UNLI is drastically different from that of the remaining data, and we believe that when a model is finetuned on it, this distribution shift has a negative influence. We have found a very large number of samples with labels very close to 0 or 1, which translate into very extreme non-ambiguous distributions when converted.
To confirm this, we filtered out all UNLI samples that had a probability label p < 0.05 or p > 0.97, and ran the "Filtered" experiments. Indeed, in Am-biU, this naive filtering process yields about 20% lower JSD scores and about 5% higher accuracy. We conclude that UNLI data, under the current conversion approach, is somewhat problematic.
Training on the ambiguity distribution can yield accuracy improvements. We have found that, for the case of AmbiSM, a model trained to target the ambiguity distribution achieves higher accuracy. This means that more precise knowledge can be acquired when learning the true underlying ambiguity of questions instead of the sometimes misleading gold-label. When using UNLI data (AmbiU and AmbiSMU) however, the results are mixed, as discussed above. Thus, to further strengthen our argument on the benefit of ambiguity data, we refer to the supplementary experiment results in Table 3, where we obtain a 2.1% accuracy improvement when performing three-fold cross-validation on the ChaosMNLI dataset. When performing a qualitative analysis on the predictions of the AmbiSM and AmbiSM Gold models, we found that the former has a stronger tendency towards neutrality, both in the number of correctly predicted neutral labels and in the average neutrality score given. However, it also resulted in some examples now being incorrectly labeled as neutral. It seems to be the case that the neutral label is the main source of ambiguity. Most ambiguous questions have a considerable amount of neutral probability, which likely produces the shift. For more details, including label counts for correct predictions as well as some prediction examples, refer to Appendix B.
Divergence scores are stable. Through the entropy range comparison of Table 4 we learn that divergence scores remain similar across different entropy subsets, showing that the model is capable of recognizing which questions are ambiguous, and appropriately adjusting the entropy level of its output. Accuracy dramatically decreases in high entropy ranges, but this goes along with our intuition, since both human annotators and the models will have doubts regarding the correct answer to the question, which leads to mismatches between the model prediction and the assigned label.
Ambiguity models learn better representations for transfer learning. Lastly, in Table 5, we observe a consistent performance improvement in transfer learning to different tasks. From the results we can infer that, by targeting the ambiguity distribution, the model can capture better linguistic representations than by targeting gold-labels. We believe that a similar trend should be visible in other tasks as well, and that the margins of improvement should increase with more ambiguity data to train on.
Is ambiguity training worth the extra labeling cost? One argument against this method is the apparent extra labeling cost required. Indeed, when comparing the gold-label and ambiguity approaches at equal number of total labels, the goldlabel approach would likely attain higher performance due to the difference in number of samples. However, we argue that collecting multiple labels has several benefits other than ambiguity distribution generation. Most importantly, they help avoid mis-labelings and raise the overall quality of the dataset. In many occasions, multiple labels are already being collected for these reasons, but occasionally not released (for example, Bowman et al.  lected for 10% of the training data). They can also be used in other methods such as item response theory (Lalor et al., 2016). Furthermore, this paper's main intention is not to encourage multi-label collection at the cost of sample quantity, but rather to show the benefits of exploiting the ambiguity distribution if it is available.

Conclusion
We hypothesized that the intrinsic ambiguity present in natural language datasets can be exploited instead of treating it like noise. We used existing data to generate ambiguity distributions for subsets of SNLI, MNLI, and UNLI, and trained new models that are capable of more accurately capturing the ambiguity present in these datasets.
Our results show that it is indeed possible to exploit this ambiguity information, and that for the same amount of data, a model trained to recognize ambiguity shows signs of higher performance in the same task as well as in other downstream tasks. However, our dataset was created using existing resources and lacks in quality and quantity. While it was enough to show that this research direction is promising, it limited the strength of our results. In future work, we wish to obtain larger amounts of data by using crowdsourcing techniques, and expand our scope to other NLP tasks as well. Figure 1: Linear approach to converting the UNLI regression value into an ambiguity distribution.
A Conversion Function Figure 1 shows a plot of the linear conversion approach that we have taken to convert UNLI data into a probability distribution.

B Qualitative Analysis
To investigate the prediction differences between an ambiguous model and one trained on gold-labels, we compared AmbiSM and AmbiSM Gold predictions on the ChaosMNLI dataset (see Table 6). We use the new labels obtained from the ChaosNLI majority vote, instead of the original MNLI labels. We focus on two situations: 1) when only Am-biSM can predict the label correctly and 2) when only AmbiSM Gold can predict the label correctly. We picked samples from the high entropy regions to observe how the models deal with ambiguity. Generally, AmbiSM has a higher tendency towards neutrality. However, it was also able to show confidence in some samples that are more entailed or contradicted. On the other hand, we also observe some samples that were missed by AmbiSM due to its tendency, while AmbiSM Gold could predict them correctly.
Furthermore, we show the label counts for the samples that were correctly labeled by only one of the two models in Figure 2. The labels of the samples that are predicted correctly by AmbiSM Gold show the same distribution as the ChaosMNLI dataset as a whole. However, within the samples that are only predicted correctly by AmbiSM we can find a higher amount of neutral labels. This emphasizes that the behavior of the model trained on ambiguity targets can deal with neutral labels in NLI better; neutral labels are likely to be the biggest source of ambiguity.

Premise
Hypothesis CHAOS ASM ASMG

Only AmbiSM is correct
They were in rotation on the ground grabbing their weapons.
The woman rolled and drew two spears before the horse had rolled and broken the rest. Thus, the imbalance in the volume of mail exchanged magnifies the effect of the relatively higher rates in these countries.
There is an imbalance in ingoing vs outgoing mail.  Table 6: Example of ChaosMNLI prediction for AmbiSM and AmbiSM Gold. CHAOS is the human distribution, ASM is the predicted distribution by AmbiSM and ASMG is the predicted distribution by AmbiSM Gold. The labels E, N, and C stand for entailment, neutral, and contradiction and their probabilities are appended. Figure 2: The count plot of the labels of the correctly predicted samples by either AmbiSM Gold or AmbiSM, AmbiSM Gold (left), AmbiSM (middle), and the labels of the whole ChaosMNLI (right). The labels e, n, and c stand for entailment, neutral, and contradiction respectively.