Quantifying Train-Evaluation Overlap with Nearest Neighbors

,


Introduction
Benchmark datasets in NLP (Rajpurkar et al. 2016;Wang et al. 2018) are invaluable for driving and tracking progress in the field.While evaluating on a held-out set of data ideally tests for generalizability to new data, frequent overlap between training and evaluation sets hinders assessing a model's generalization capacity (Elangovan et al., 2021;Lewis et al., 2021a;Krishna et al., 2021).In this paper, we quantify the overlap between the training and evaluation splits in datasets through a simple metric based on a nearest neighbor approach, and analyze datasets along the axis of dataset collection method.
We categorize data collection methods frequently used in the literature into four categories, based on how naturally the language is captured; some datasets harvest user generated content (e.g., movie reviews paired with their scores), while language in other datasets is written by crowdworkers to fool existing models (Nie et al., 2020) or syn-thetically generated from templates (Warstadt et al., 2020).
We analyze the train-evaluation overlap in eleven NLP datasets varying in data collection method on two tasks -classification and extractive question answering -through a nearest neighbors approach.To quantify the overlap between training and evaluation datasets, we identify the nearest train neighbor to each evaluation example using cosine similarity between the input representations.We experiment with two types of representationsgeneral sentence embeddings (Gao et al., 2021) and task-specific embeddings (after task-specific training (Devlin et al., 2019)).Then, we copy the label of the nearest training example to each evaluation example, constructing a simple nearest neighbor baseline model.In nearly every setting, we show that copying labels from the nearest train example alone achieve a competitive baseline, indicating overlap in content between the training and evaluation sets without any task specific training.We find that naturally-collected datasets exhibit stronger training and evaluation set overlap compared to more synthetic and adversarially-generated data.
We introduce a new metric, named InsSim, which summarizes the distance from each evaluation example to its nearest training examples, indicating the train-evaluation overlap.We use the nearest neighbor classifier and InsSim score to estimate the difficulty of individual evaluation examples, and suggest splitting evaluation datasets into challenging and easier subsets.Our analysis motivates careful benchmark designs (Koh et al., 2021a) that aims to capture both natural language usage and distributional shifts.

Related Work
Representing a sequence of tokens as a single, fixed dimensional vector (Reimers and Gurevych, 2019;Arora et al., 2017;Kiros et al., 2015) has been studied extensively.Such an encoder can act as a dense passage retriever (Karpukhin et al., 2020), paired with an efficient similarity search method (Qin et al., 2020).
Two prior studies in question answering (Lewis et al., 2021a;Krishna et al., 2021) look in-depth into the overlap between the training and evaluation sets.They identify the most similar training example either by answer string match or comparing the question embedding constructed for passage retrieval.The follow up work further develops the QA model (Lewis et al., 2021b) for copying the answer from the nearest training example, after augmenting training examples with generated question answer pairs.Our study in Section 4.3 extends this setting for a wide range of tasks and different embedding methods.Similar to our work, Elangovan et al. (2021) examine train-test overlap for text classification tasks.They also compute the similarity for each test instance to the entire training set using a similarity function.However, they utilize a bag-of-words approach to represent text (where we use sentence embeddings).In addition, we provide analysis for a broad range of datasets.
Many works have explored whether models simply memorize the training dataset or actually learn the task, thus generalizing to unseen examples.Our nearest-neighbor match classification method resembles ProtoBERT (Tänzer et al., 2022), which shows promising performance in rare classes.The model classifies examples by comparing distance to the centroid of training examples belonging to each class.Our method is simpler, without estimating a probability distribution over the output classes.Tirumala et al. (2022) also study the effect of dataset size and model size on memorization, but look at the dynamics of memorization in language models during training, finding that larger language models tend to memorize data faster, and that certain parts of speech are memorized faster than others.
Other work studies different subsets of datasets and how this can change evaluation.Ethayarajh et al. (2022) study dataset difficulty in terms of the lack of usable information to a particular model V , as well as difficulty of data subsets using a measure of pointwise V -information for individual data instances.As in our work, Swayamdipta et al. (2020) study difficulty of individual instances, although they focus on the training rather than evaluation set.Similarly, Godbole and Jia (2022) propose a method for better evaluation of generalization on more difficult examples (those assigned lower likelihood by a pretrained LM), focused on creating the train-eval split.In our work, we introduce a very simple and generalizable method of splitting examples by whether classification with the nearest training example can succeed.
Recent work (Sakaguchi et al., 2021;Malinin et al., 2021Malinin et al., , 2022;;Koh et al., 2021b) focuses on modeling distributional shifts in carefully constructed real world datasets, such as simulating shifts by having training set from one region and the test set from another region.This can be one path to mitigate frequent train-evaluation overlap in naturally occurring datasets.

Categorizing Dataset Collection Method
NLP datasets are collected through diverse methods for multiple purposes -some datasets mirror the user-facing applications closely (e.g., question answering datasets and machine translation datasets), while other datasets are carefully designed for diagnostic purposes.With the rise of harder to interpret, high capacity models (Brown et al., 2020;Chowdhery et al., 2022), many datasets are designed to probe model qualities.Would different data collection method yield different level of train evaluation overlap?To investigate this, we first categorize the data collection method of datasets below.We propose a discrete scale of naturalness, from purely synthetic to user-generated, as follows: • Synthetic (SYN): template-generated or highly-constrained crowd-sourced text.Here, both inputs and outputs are synthetically generated.We note that our definition of synthetic data includes highly-constrained crowd-sourced text, by which we mean that the annotators have limited freedom in the content of their annotations.For example, for the WinoGrande dataset workers are instructed to choose an anchor word to use in the twin sentences, they are given a range for sentence length, and they are asked to maintain 70% overlap between sentences.This is less natural than what the human might have generated on their own.
We provide examples of the datasets of each type we study here, approximately ordered from the least to most natural datasets.
WinoGrande A crowd-sourced, commonsense reasoning benchmark inspired by the Winograd Schema Challenge, in which twin sentences with a small edit distance each have a missing word and two possible options (Sakaguchi et al., 2021).CSQA 2.0 (Commonsense Question Answering 2.0) A corpus of crowdsourced yes/no commonsense reasoning questions (e.g., "a playing card is capable of cutting soft cheese?") (Talmor et al., 2021).ANLI (Adversarial NLI) A natural language inference corpus with data collected "adversarially" in three rounds using a human-in-the-loop approach (Nie et al., 2020).MNLI (Multi-Genre Natural Language Inference) A corpus of sentence pairs (crowdsourced) with annotations for textual entailment (given a premise and hypothesis, does the first entail, contradict, or is neutral to the other).We conduct experiments using both the matched (in-domain) and mismatched (cross-domain) evaluation sets (Williams et al., 2018).SQuAD 2.0 (Stanford Question Answering Dataset 2.0) A corpus of crowdsourced questions (along with a Wikipedia context), and annotated answer spans.Unlike SQuAD 1.1, not all questions have answers (Rajpurkar et al., 2018).MRPC (Microsoft Research Paraphrase Corpus) A corpus of sentence pairs extracted from online news sources, where each pair is annotated for whether the sentences are semantically equivalent (Dolan and Brockett, 2005).The sentences was paired based on heuristics (e.g., "two sentences share at least three common words").NQ (Natural Questions) A corpus of questions from popular Google search queries, paired with a retrieved Wikipedia document, annotated with an answer.We use simplified MRQA version, which removes unanswerable questions, yes/no questions or questions without a short answer span and considers paragraph containing a short answer span as context instead of the entire document (Kwiatkowski et al., 2019;Fisch et al., 2019).TweetEval A corpus of tweets containing multiple classification tasks (Barbieri et al., 2020), though we used the subset of the dataset specifically for sentiment analysis.We also pre-process the data to remove examples with the neutral label, making the classification task binary (positive/negative) for out-domain evaluation with SST-2.SST-2 (Stanford Sentiment Treebank) A corpus of movie review sentences with annotations for sentiment (positive/negative) (Socher et al., 2013).AG News A corpus of news articles from the web, categorized into four topics (business, sci/tech, sports, world) (Zhang et al., 2015).IMDb (IMDb Review Dataset) A balanced corpus of movie reviews from IMDb with negative (score ≤ 4 out of 10) and positive reviews (score ≥ 7 out of 10) (Maas et al., 2011).

Nearest Neighbor Analysis with Two Types of Encoders
We begin studying overlap with an analysis of nearest neighbor data instances between the train and evaluation datasets.We define the nearest neighbor for each evaluation example x e in the given training dataset X train .This is dependent on the embedding function E(x), and the training dataset X train .Following prior work (Snell et al., 2017;Tänzer et al., 2022), we define the similarity between two examples x i and x j as the cosine similarity between their embeddings, E(x i ) and E(x j ).We describe how to encode each example below.

Instance Encoder
We consider two types of encoder E(x) for each data instance x -a general sentence embedding function and an embedding function learned while optimizing for the target task.We study two tasks, classification and extractive question answering (Rajpurkar et al.).Classification tasks map input text x to y from pre-defined label set Y , and question answering tasks map an input x consisting of {question q, evidence passage c} to answer string y which is a span in the evidence passage.
As the output should be entailed from the input, we only pass in input to the instance encoder.We note that such a nearest neighbors approach to studying overlap of the input could be extended to generation tasks such as translation or summarization, or semantic parsing, although we do not examine these in this work.and (2) SimCSE embeddings (Gao et al., 2021) which showed strong performance over various benchmark datasets.Gao et al. (2021) first encode input sentence with a pretrained language model and then take the [CLS] representation to get a fixed dimensional representation and improve it with a contrastive learning objective (Chen et al., 2020).Specifically, they construct positive sentence pairs by applying two different standard dropout masks (Gal and Ghahramani, 2016) on the input representation on the same sentence, and construct negative pairs by taking other sentences in the same mini-batch.While we choose these two embeddings for our analysis, other sentence embedding methods (Kiros et al., 2015;Wu et al., 2020) can be used.
Task Specific Learned Embedding [E t ] To construct task specific embedding, we first fine-tune a pre-trained language model to perform our target tasks.Unless otherwise specified, we use the RoBERTa-large model (Liu et al., 2019b).We use standard recipes for using pre-trained LMs.
For classification, we take the [CLS] representation through a fully-connected layer to predict the correct label from a label set (classification task).For extractive QA, we encode concatenation of question and context tokens and take the final representations of the context tokens through fully-connected layer to predict the answer start and answer end token.

Nearest Neighbor Analysis
We first provide some manual inspection of similar examples.In every dataset, there is more lexical overlap when nearest neighbor was found using general representations, supporting our qualitative observations.

Classification with the Nearest Neighbor
After identifying the nearest training example for each evaluation example, what can we do with it?Inspired by a recent study in question answering (Lewis et al., 2021a) which copies the answer of the training question that is most similar to the evaluation question (where the evaluation question is a duplicate or paraphrase of the train question), we apply this method widely to all datasets we study to build a non-parametric classification model.This is similar to the protoBERT model (Tänzer et al., 2022) which uses k-nearest neighbor classification algorithms.However, we use the label from the nearest neighbor without constructing an embedding representing each class label.For extractive QA tasks, we use the answer as the label and calculate performance as the exact-match to the nearest neighbor.High performance of this baseline will indicate greater train-evaluation overlap.Table 2 presents the results for the two embedding types we study, as well as two training data sizes.Here, we look at gold labels, and focus on differences between embedding types and training data sizes.We also report the difference to the classification performance for taking the farthest training example in parentheses and a random baseline which assigns labels according to the label distribution.We also show the total RoBERTalarge fine-tuned performance as an upperbound.Fine-tuned performance for all datasets and other models are shown in Appendix B.

How does nearest neighbor classification work with different encoders? Comparing general CLS token embeddings (without fine-tuning) with
SimCSE embeddings, we see mixed results -sometimes using SimCSE results in higher performance, sometimes general CLS token embeddings.However, the difference between performance on the nearest neighbor and performance on the farthest neighbor using CLS embeddings without finetuning is generally lower than when we use Sim-CSE embeddings, indicating the nearest semantic neighbor might be more relevant with SimCSE embeddings over CLS tokens, which follows prior work (Reimers and Gurevych, 2019).
After fine-tuning, copying the label of nearest neighbor shows strong performance across all datasets except WinoGrande.We attribute the strong performance to the task-specific nature of CLS embeddings (Reimers and Gurevych, 2019); while they have low semantic similarity, they are close together in terms of task similarity (e.g., examples that require the model to do the same type of reasoning are more similar) leading to a high nearest neighbor performance.
How does nearest neighbor classification interact with data collection methods?The nearest neighbor performance roughly corresponds with the degree of naturalness; for all user-generated classification tasks (LAB and USE), copying the label of nearest neighbor shows competitive performance, even without task-specific fine-tuning.On challenging, synthetically and adversarially generated datasets (WinoGrande and ANLI), however, the nearest neighbor approach shows smaller gains.We hypothesize that this is because researchers can control data diversity and task difficulty in the synthetic setting to make a benchmark more challenging, which cannot be done in the natural case.In addition, higher performance with natural data might signify more match with the pre-training data of the model.We also note that the correspondence between performance and data collection method could also be due to task difficulty and types, as the user-generated datasets tend to be easier for models to learn.Label match to the nearest neighbor is nearly always higher than to the farthest neighbor and performs better than the random baseline, showing that a simple nearest neighbor approach corresponds to the overlap between train and evaluation sets.
How does nearest neighbor classification vary with encoder model power and training data size?Figure 1 shows the nearest neighbor classification performance for label predictions of different power models of varying training data sizes for selected user-generated and synthetic/crowdsourced datasets.Here we study predicted labels rather than gold labels, and use RoBERTa-large, RoBERTA-base (Liu et al., 2019b) and DistilBERT (Sanh et al., 2020).As fine-tuned CLS embeddings achieve high performance due to task-specific or reasoning similarity, we use Sim-CSE representations for more general semantic similarity between nearest neighbors.Across all datasets, the nearest neighbor classification appears to be relatively consistent regardless of the size of the encoder model.For more natural datasets (bottom row of Figure 1), we see a large increase in performance when the training data size increases from 10k to the full dataset; this is less consistent for synthetic and crowdsourced datasets (top row of Figure 1).This could indicate that for more natural datasets, or easier tasks, a larger amount of data leads to a higher comparative overlap, but this is not necessarily the case with synthetic and crowdsourced data.tion fails, and manually categorize them into three types: • not similar: Failure at general semantic similarity • mismatch: Semantic / task similarity mismatch • ambiguous: The label for either the evaluation or train example is ambiguous (or incorrect) We note that the first two categories, not similar and mismatch are failures due to the nearest neighbors approach, while the last category, ambiguous, is relevant to the dataset itself.Table 12 in Appendix E provides examples.We show the percentage of annotated examples in each category for each dataset, in Figure 2. The majority of manually annotated examples were ambiguous, which is a possible reason for why the model performs worse on instances without label match.
How does nearest neighbor classification perform under domain shift?We perform analysis on distribution shifts on two classification taskssentiment classification and natural language inference.We report the classification results from copying the nearest neighbor in the training set (parallel to Section 4.3) in Table 3.We find that the most similar example in the train set is less likely to have the same label as the evaluation example when the evaluation example is taken from different distribution.Yet, the nearest neighbor classification almost always outperforms the baseline, sometimes strongly.

Quantifying Overlap with Instance Similarity
In this section, we introduce a new metric, Instance Similarity (InsSim), and use it to identify easy and challenging instances in the evaluation dataset.The first number in each cell represents using general sentence embeddings and the second number represents using task specific embeddings.
Defining InsSim We define a metric, InsSim(x e ), for each individual evaluation example x e based on its nearest neighbors in the provided training dataset.We notate topN(x e , X train , k) as set of k nearest examples in the total training dataset X train of x e according to the similarity function described in Section 4.

InsSim(x e ) =
x i ∈topN(xe,X train ,k) sim(x e , x i ) k We conduct our analysis with a default setting of k = 5.
Interpreting InsSim The higher InsSim(x e ), the easier for a machine learning model to estimate P (y e |x e ), if the label of the example matches its nearest train neighbors (we study this further in this section).An alternative metric would be estimating the input distribution data and evaluate the likelihood of x e according to this distribution.While P (x) will estimate how likely x e is with respect to the entire training set X train , InsSim will only consider the k closest elements in the training dataset.Given strong few-shot learning ability of recent pre-trained models (Liu et al., 2019b;Brown et al., 2020), we anticipate this metric can more effectively capture the predicted performance on example x e .
We report the average InsSim score on each dataset in Table 4.A higher score will imply heavier train-evaluation dataset overlap.Using task-specific embeddings brings examples closer together significantly across all datasets.The number of total training instances varies significantly across datasets (see Table 6 in the Appendix A), so larger datasets tend to exhibit higher InsSim.We find that the average InsSim tends to be higher for tasks that are more naturally generated, indicating less data diversity between training and evaluation sets.Our metric is coarse in that it does not specify whether the similarity between instances are caused by lexical or topical overlap (e.g., containing the same entity) or syntactic overlap (e.g., similar sentence structure).
To better evaluate model generalization, we propose to divide evaluation examples into two subsets -(1) MATCH: examples where the evaluation label equals the nearest gold train label, and (2) MISMATCH: examples where the evaluation label does not match the nearest gold train label.We use general sentence embeddings (SimCSE) for the representations for better generalizability.We hypothesize that the MATCH subset is easier for models.
How does model performance differ between MATCH and MISMATCH subsets?We show RoBERTa-large performance on each of these subsets, along with the difference between them, in Table 5.As expected, performance is generally higher when labels match, confirming our hypothesis.However, this is not the case for WinoGrande.We conjecture this is because semantic similarity is not as relevant to the WinoGrande reasoning task.This is further shown by a high difference between performance on the two subsets for the AG News dataset, for which semantic similarity is more strongly relevant.In addition, Table 5 shows the percent of total examples in the MIS-MATCH subset; we see that overall performance on the dataset loosely inversely correlates with the proportion of MISMATCH examples; further illustrating that these examples are more difficult.
Can we use the InsSim score to identify difficult evaluation examples?We further split our MATCH and MISMATCH data subsets by their InsSim score: we report performance breakdown on highest and lowest 30% of the data sorted by InsSim.RoBERTa-large performance on these sets is also shown in Table 5.Our results indicate that a higher InsSim leads to higher performance on examples where the evaluation labels match the nearest train example label, but not necessarily when they do not match.In challenging datasets (Wino-Grande, ANLI, MNLI and MRPC), when the label of the evaluation example does not match the la-bel of the nearest training example, being closer to the nearest neighbor actually hurts the model performance, suggesting over-generalization from the nearest training example.These results emphasize that in addition to evaluating model performance on a full dataset, it could be useful to evaluate models on these subsets individually to better assess model generalization; performance can be significantly different on more challenging subsets.We will publicly release our code for splitting datasets into MATCH and MISMATCH subsets at https: //github.com/GauriKambhatla/train_eval_overlap.

Conclusion
In this paper, we analyze eleven downstream NLP datasets for train-evaluation overlap using a nearest neighbors approach, quantified with a simple measure of instance similarity.We categorize datasets according to their data collection method, and find that more naturally-collected data and easier tasks tend to demonstrate higher train-eval overlap than more synthetically-generated data and difficult tasks.Lastly, we suggest using nearest neighbor analysis to split the evaluation data into more easy and challenging subsets, determined by the overlap with the training set, and advocate studying model performance on these subsets as well as the full dataset for a more comprehensive evaluation of model generalizability.

Limitations
Our study is limited in scope, studying only classification and extractive QA tasks in English; the trends we highlight in this work might not generalize to different tasks or other languages.We also acknowledge that we only use BERT-based models for our analysis, so it is uncertain whether these findings are applicable to other models.In addition, the overlap we describe in this paper is defined by semantic similarity rather than literal overlap between sentences and phrases.We are not claiming that this overlap is good or bad, rather we show that when the overlap is large, it is more difficult to evaluate model generalization.
We note that there are multiple confounding factors in our results.First, while we highlight the role of dataset collection method in our analysis, the naturalness of data collection method is negatively correlated with task difficulty (i.e., the more natural datasets we study are also the least difficult).As a result, differences in performance can be attributed to task difficulty as well as data col-lection method.Second, our study is limited in scope of similarity metrics (only cosine similarity) and embeddings used to compute similarity.Using different embedding or metric can change the results.
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112-1122, New Orleans, Louisiana.Association for Computational Linguistics.

A Dataset Statistics
We provide additional statistics about the datasets we studied, including licensing and data split sizes (Table 6).

B Model Performance & Compute
Here we list the total fine-tuned model performance for each model on each validation dataset for varying amounts of training data.DistilBERT (66M parameters) performance is listed in Table 10, RoBERTa-base (123M parameters) performance in Table 9, and RoBERTa-large (354M parameters) performance in Table 8.We take the average of three runs to get the numbers listed in these tables.We run all experiments on RTX 8000 GPUs.

D Additional Nearest Instance Examples
Table 11 shows additional examples of nearest neighbors for the datasets not shown in Table 1.

Figure 1 :
Figure 1: Nearest neighbor classification performance (%) between the prediction on the evaluation example and their nearest train example for trained models on selected datasets.The x-axis shows the model (DistilBERT, RoBERTA-base, or RoBERTa-large) and the y-axis shows the size of the training data, from 500 examples to the full dataset.

Figure 2 :
Figure 2: Distribution of label-mismatch examples (%) for selected datasets, from more synthetic (top) to more natural (bottom).

Table 1
their nearest train neighbor with both general and task-specific representations in Table 1.We provide examples for additional datasets in Appendix D (see Table 11 for qualitative examples) and quantitative unigram overlap in Appendix A (Table 7).

Table 2 :
Nearest neighbor classification results by copying the gold label from the nearest training example with different embedding methods.500 and FULL represent the size of the training data set.The number in parenthesis represents the gap from copying the label from the farthest training example.Random performance and the RoBERTa-large fine-tuned performance are shown for lower and upper bound comparisons.

Table 3 :
Nearest neighbor classification results under domain shift.The E g (CLS) embeddings are the token embeddings from the pre-trained LM without fine-tuning; the E t embeddings are after fine-tuning on the full training data set.The number in parenthesis represent the gap from copying the label from the farthest training example.The in-domain performance values are presented for comparison.

Table 4 :
Average InsSim score of evaluation subset on each dataset.The first column is computed against a randomly sampled 1K training examples, the second column against the full training portion of each dataset.

Table 5 :
P (x) based on the training RoBERTa-large performance on MATCH (gold eval label is equivalent to the nearest gold train label) and MISMATCH (the rest) subsets of the full evaluation data.We use SimCSE embeddings for similarity.Performance on the eval examples with the highest (High) or lowest (Low) 30% of InsSim are shown with bolded values indicating whether performance is higher on the low or high subset.We compare to performance on the full MISMATCH or MATCH subsets in the All column (MM or M respectively).The difference between MATCH and MISMATCH values is shown in the ∆ column.

Table 6 :
The WinoGrande and CSQA 2.0 datasets are licensed with CC-BY, ANLI is licensed with Creative Commons-Non Commercial 4.0, MNLI, the TweetEval sentiment task, and NQ (MRQA version) are licensed with MIT.All the datasets we study are in English.Dataset statistics.For Natural Questions (NQ), we use the MRQA subset, and for TweetEval, we use the sentiment split, with neutral label examples filtered out.

Table 7 :
Average unigram overlap between evaluation examples and nearest train examples according to different representation types.General embeddings are notated E g and task-specific embeddings as E t .
Saddam's other son, Odai, surrendered Friday, but the Americans are keeping it quiet because he's a U.S. agent.Phrase 2 Hussein's other son, Uday, surrendered yesterday, but the Americans are keeping it quiet because he's a US agent.Phrase 1 The only other JI member to reveal similar information is Omar al Faruq , now held at a secret location by the United States.Phrase 2 The only other JI member to reveal similar information is Omar al Faruq, now held by the United States at a secret location.Phrase 1 Initial reports said the attackers fired from a mosque within in the city, 30 miles west of Baghdad.Phrase 2 The Centcom statement said the gunmen appeared to have fired from a mosque in the city, 50 km ( 32 miles ) west of Baghdad.IMDb Haines is excellent as the brash cadet who thinks West Point will really amount to something now that he has arrived.Haines displays his easy, goofy comic persona as he takes on West Point and Joan Crawford, the local beauty...One of the biggest hits of 1926, Brown of Harvard is a exciting comedy/drama featuring regatta and football scenes that gave William Haines the role he needed to become a major star.It's patented Haines all the way: brash smart aleck who takes nothing serious until he is rejected by everyone...As Jack Nicholson's directorial debut, Drive, He Said displays at the least that he is a gifted director of actors.Even when the story might seem to lose its way to the audience (and to a modern audience -if they can find it, which pops up now and again on eBay -it might seem more free formed than they think)...

Table 11 :
Examples of the most similar instances for the evaluation example according to two embedding methods.P uh i don't know i i have mixed emotions about him uh sometimes i like him but at the same times i love to see somebody beat him H I like him for the most part, but would still enjoy seeing someone beat him.Train: P You can imagine what a thorn in the flesh I am to him!H You can imagine how much he is bothered by me, even though I treat him well Eval: Entail Train: Neutral mismatch WinoGrande Eval: Randy only ever added a little bit of hot sauce to his food, especially compared to Adam, as _ was much more sensitive to spice.Train: Randy found it easier to be healthy than Derrick because _ did not eat a wide variety of fruits and vegetables.Derrick ambiguous AG News Eval: Intel Doubles Dividend, Boosts Buyback by $11.5 Bln (Update2) Intel Corp., the world's biggest computer-chip maker, doubled its quarterly dividend and boosted its stock buyback program by $11. Train: Intel Doubles Dividend, Expands Buyback Chip giant Intel Corp. reported Wednesday that its board doubled the company's quarterly dividend and authorized an expansion of its ongoing stock repurchase program.

Table 12 :
Examples of label-mismatched eval and nearest train examples for each category.