Measuring Context-Word Biases in Lexical Semantic Datasets

State-of-the-art pretrained contextualized models (PCM) eg. BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. We question this assumption by presenting the first quantitative analysis on the context-word interaction being tested in major contextual lexical semantic tasks. To achieve this, we run probing baselines on masked input, and propose measures to calculate and visualize the degree of context or word biases in existing datasets. The analysis was performed on both models and humans. Our findings demonstrate that models are usually not being tested for word-in-context semantics in the same way as humans are in these tasks, which helps us better understand the model-human gap. Specifically, to PCMs, most existing datasets fall into the extreme ends (the retrieval-based tasks exhibit strong target word bias while WiC-style tasks and WSD show strong context bias); In comparison, humans are less biased and achieve much better performance when both word and context are available than with masked input. We recommend our framework for understanding and controlling these biases for model interpretation and future task design.


Introduction
Meaning contextualization (i.e., identifying the correct meaning of a target word in linguistic context) is essential for understanding natural language, and has been the focus in many lexical semantic tasks.Pretrained contextualized models (PCMs) have brought large improvements in these tasks including WSD (Hadiwinoto et al., 2019;Loureiro and Jorge, 2019;Huang et al., 2019;Blevins and Zettlemoyer, 2020), WiC (Pilehvar and Camacho-Collados, 2019;Garí Soler et al., 2019)   We would ideally want a dataset to lie towards the bottom left corner which is bias-free.The dashed red lines indicate 1.0 context (right) and 1.0 target word bias (top), implying a dataset is in effect dealt with by relying on context alone or target words alone.linking (EL) (Wu et al., 2020;Broscheit, 2019).
These superior performances have been taken as proof that PCMs can successfully model wordin-context semantics.Many studies have investigated the process of lexical contextualization in these PCMs.Specifically, Vulić et al. (2020); Aina et al. (2019) found language models 'contextualize' words in higher layers while the type-level information is better kept in lower layers1 .Voita et al. (2019) point out different learning objectives af-fect the contextualization process, and Garí Soler and Apidianaki (2021); Pimentel et al. (2020) show PCMs can capture words' ambiguity levels.
While most these studies have focused on probing the innerworkings of the PCM feature space, there is no systematic study to quantify the wordcontext interaction (either learned by PCMs or intrinsic) across different lexical semantic tasks.On one hand, these datasets often vary in their emphasis on context vs target words.For example, we could expect tasks such as WSD and WiC to rely more on context by design as the target words are either given or the same in each input pair2 .On the other hand, models may find shortcuts from datasets to avoid learning the complex wordcontext interaction.What is missing in the current literature is an accurate quantification of this word-context interplay being tested in each task so that we can fully understand task goals and model performance.In particular, we need to flag the situation where a model can solve a task by relying solely on context or the target words.Such heavy word or context reliance hinders a scientific assessment of the models' meaning contextualization abilities as it essentially bypasses the key word-context interaction challenge in human understanding of lexical semantics.Therefore, we refer to such heavy reliance on target words or context in a contextual lexical semantic dataset as target word biases or context biases 3 .
This study presents an analysis framework to quantify this context-word interaction by measuring context and target word biases across lexical semantic tasks.We first run controlled probing baselines by masking the input to show the context or the target word alone.Based on model's performance on these probing baselines, we calculate two ratios that reflect how much of the model performance in this dataset can be achieved from simply relying on context alone or the target word alone, i.e. the degree of context or target word biases (See Figure 1 which will be discussed fully in Section 3).The design of the probing baselines follows previous studies that applied input perturbation techniques for model and task analysis in GLUE (Pham et al., 2020), NLI (Poliak et al., 2018;Wang et al., 2018;Talman et al., 2021) and relation extraction (Peng et al., 2020).While previous probing studies usually assume no meaningful information from corrupted input with no human verification, we provide fairer comparison with model performance by collecting human judgment on the same masked input in four tasks.Such comparison reveals whether the biases are learned by models from the datasets or are inherent in the tasks.
Our key findings are (1) the tasks can be clearly divided into target-word-biased (the retrieval-based tasks), and context-biased (WiC-style tasks and WSD).Among the retrieval-based tasks, domain affects ambiguity level and thus the target word bias: models even achieve the best performance using target words alone in the medical domain.(2) AM 2 ICO and Sense Retrieval show less extreme model biases and challenge a model more to represent both the context and target words; and (3) a similar trend of biases exists in humans but is much less extreme, as humans find semantic judgment more difficult on masked input and require both word and context to do well in each task.This analysis helps us better understand the nuanced differences between models and humans in existing tasks, and we recommend the framework to be applied when designing new datasets to check whether word and context are required and whether the models rely on the coupled word and context semantics in a similar way to humans.

Task Selection
We examine the following contextual lexical semantic tasks, and for illustration, we list example data for each task in Appendix A. Word Sense Disambiguation (WSD).WSD (Navigli, 2009;Raganato et al., 2017) requires a model to assign sense labels to target words in context from a set of possible candidates for the target words.Following the standard practice, we use SemCor (sense-annotated texts created from Brown Corpus) as the train set, Semeval2007 as dev, and report accuracy on the concatenated ALL testset.The WiC-style Tasks (WiC, WiC-TSV, MCL-WiC and XL-WiC).To alleviate WSD's requirement for a sense inventory, WiC (Pilehvar and Camacho-Collados, 2019) presents a pairwise classification task where each pair consists of two wordin-context instances.The model needs to judge whether the target words in a pair have the same contextual meanings.WiC-TSV (Breit et al., 2021) extends the WiC framework to multiple domains and settings.This study adopts the combined setting where each input consists of a word in context instance paired with a definition and a hypernym, and the task is to judge whether the sense intended by the target word in context matches the one described by the definition and is the hyponym of the hypernym.The WiC-style tasks have also been extended to the multilingual and crosslingual settings in MCL-WiC (Martelli et al., 2021), XL-WiC (Raganato et al., 2020) and more recently in AM 2 ICO (Liu et al., 2021).MCL-WiC provides test sets for five languages with full gold annotation scores.However, MCL-WiC only covers training data in English.To ensure the analysis will be testing the same data distribution during both training and testing, we will only use the English dataset of MCL-WiC.XL-WiC extends WiC to 12 languages.We perform analysis on the German dataset in XL-WiC (train (50k) and test data (20k)) as it is the only language with sufficient train data and human validation performance.AM 2 ICO covers 14 datasets, each of which pairs English word-in-context instances with word-in-context instances in a target language.In this study, we perform analysis on the English-Chinese dataset which contain 13k train and 1k test data4 .
Sense Retrieval (SR).With the same train and test data as WSD, SR (Loureiro and Jorge, 2019) requires a model to retrieve a correct entry from the full sense inventory of WordNet (Miller, 1998).
AIDA and Wikification.An important application scenario for testing meaning contextualization is Entity Linking (EL).EL maps a mention (an entity in its context) to a knowledge base (KB) which is usually Wikipedia in the general domain.The target word and its context help solve name variations and lexical ambiguity, which are the main challenges in EL (Shen et al., 2014).In addition, the context itself can help learn better representations for rare or new entities (Schick and Schütze, 2019;Ji et al., 2017).We test on two popular Wikipedia-based EL benchmarks: AIDA (Hoffart et al., 2011) and Wikification (Wiki) (Ratinov et al., 2011;Bunescu and Paşca, 2006).AIDA provides manual annotations of entities with Wikipedia and YAGO2 labels for 946, 216 and 231 articles as train, dev and test sets respectively.The Wiki Dataset is based on the hyperlinks from Wikipedia.We randomly sampled 50k sentences from Wikipedia as the test and another 50k as the dev set.The rest is used for training.For both AIDA and Wiki, the search space is the full Wikipedia entity list.
WikiMed and COMETA.To test domain effects, we evaluate on two medical EL tasks.We use the WikiMed corpus (Vashishth et al., 2020), an automatically extracted medical subset from Wikipedia, for medical wikification.Each mention is mapped to a Wikipedia page linked to a concept in UMLS (Bodenreider, 2004), a massive medical concept KB.We define the search space as the Wikipedia entities covered in UMLS.With the same Wikipedia ontology but a different domain subset, WikiMed can be directly compared with Wiki for assessing domain influence.We also test on COMETA (Basaldella et al., 2020), a medical EL task in social media.COMETA consists of 20k English biomedical entity mentions from online posts in Reddit.The expert-annotated labels are linked to SNOMED CT (Donnelly et al., 2006), another widely-used medical KB.
We report accuracy for WSD and all the WiC style tasks, and accuracy@1 for retrieval-based tasks including Wiki, AIDA, etc.

Probing Baselines
Context vs. Word: For the main experiment, we design the WORD baseline where we input only the target word5 to the model, and the CONTEXT baseline where the target word is replaced with a [MASK] token in the input.The model is then trained and tested on the perturbed input.A high performance in CONTEXT or WORD will indicate strong context or target word bias.Example baseline input is shown in Table 1.Lower Bound: Apart from a RANDOM baseline, we also set up a LABEL baseline where all the input is masked and the learning is only from the label distribution in the task.Notice that training the LABEL baseline is preferable to simply counting label occurrences in the data as the former can work with both continuous and categorical label space.All the probing baselines are compared with model performance on the full input (FULL).We refer to model M's performance in WORD, CONTEXT, LABEL and FULL as M W , M C , M L and M F ull respectively.Human Evaluation: To measure the inherent task biases, we collect human judgment (HUM) for a subset (WiC, XL-WiC, AM 2 ICO and AIDA) as being representative of the tasks described in Section 2.1 and feasible given resources for annotation.WiC, XL-WiC and AM 2 ICO cover WiC-style datasets in different languages; AIDA is chosen as a representative retrieval-based task.We follow the quality control procedures in Pilehvar and Camacho-Collados (2019); Liu et al. (2021) to recruit two different annotators for each baseline input from CONTEXT,WORD and for FULL input in each task.The annotators are recruited from Prolific.They have graduate degrees and are fluent or native in the language of the dataset6 .In each setup, an annotator is assigned a randomly sampled 100 examples from the test set of each task7 and there is a 50 example overlap between the two annotators for agreement calculation.The annotators are asked to perform meaning judgment in WiC, XL-WiC and AM 2 ICO, and to find the corresponding Wikipedia pages for entities for AIDA.For CONTEXT input where the target words are masked, annotators are encouraged to first guess what the target words could be8 .As to the WORD input, annotators are asked to think of the most representative meaning of the out-of-context words when performing the tasks.As the pairs of input are always the same word by design in WiC and XL-WiC, we assume humans will give true judgment for all the examples and therefore will score 0.5 on WORD input in WiC and XL-WiC.As to human's LABEL baseline performance, while humans are not given any prior indication of how the task labels will be distributed, it is reasonable to expect that an annotator will give a random choice between the available labels or stick with one label when there is no input.Therefore, we approximate the LABEL human baseline as being 0.5 for WiC, XL-WiC and AM 2 ICO, and 0 for AIDA.

Calculating the Bias Measures
Based on a model M 's performance on the full input and on the baseline input, we propose Bias M C and Bias M W (as calculated in Equation (1) and Equation ( 2)) to measure the model's context and target word biases in a dataset.Bias M C is the ratio of M C to M F ull with the LABEL performance M L deducted from both M C and M F ull .M L has to be deducted as it is unrelated to the input.Otherwise, the ratio will give an inflated bias measurement.Bias M W is calculated in the same way as Bias M C except that we replace M C with M W in the equation.The two measures can also be seen as M C and M W under min-max normalization where the min value is M L and the max value is M F ull , and therefore the normalized values can be fairly compared across datasets.Bias M C and Bias M W reflect how much of what a model has learned from the input in a dataset can be achieved from context alone or target word alone, which will give us indicators of the degree of context and target word biases in the dataset.These bias indicators will in turn tell us how important the masked part of the input is.For example, we can interpret a Bias M C of 0.9 as 90% of what the model has learned from the full input can be achieved from the context alone.The 10% gap can be gained from adding the masked target word and since this gap is small with a high context bias, we can conclude that the model can do pretty well just from the context alone and it is not learning much from the target word.
Like models, humans can also be biased as they can use their prior knowledge (eg.humans can guess the typical meaning of a word without knowing the context) to make predictions based on partial input (Gardner et al., 2021).To measure how much humans can perform on the baseline input will help us understand the biases inherent in a task.We therefore calculate the context and target word bias scores for humans in the same way.

Experiment setup
The underlying model for our main experiments is BERT (Devlin et al., 2019), the most popular PCM that offers dynamic contextual word representations as bidirectional hidden layers from a transformer architecture.To ensure the general trend of our findings are consistent across different models, we also performed the analysis using ROBERTA Comparing CONTEXT and GUESSEDWORD also shows BERT's contextual bias in WiC as BERT is not sensitive to the target word change.(Liu et al., 2019), which improves upon BERT by optimized design decisions during training.
We adopt standard model finetuning setups in each task.We use the base uncased variant of BERT 9 for general domain experiments and PUB-MEDBERT (Gu et al., 2020) for the medical tasks.For WSD, we use GLOSSBERT (Huang et al., 2019) that learns a sentence-gloss pair classification model based on BERT.For the WiC-style tasks, we follow the SuperGlue (Wang et al., 2019) practices to concatenate BERT's last layer of [CLS] and the target words' token representations for each input pair, followed by a linear classifier.For the retrieval-based tasks including SR and EL, we adopt a bi-encoder architecture to model query and target candidates with BERT (Wu et al., 2020).For the query, we insert [ and ] to mark the start and end positions of the target word in context.Each target candidate is reformatted as "[CLS]Name || Description[SEP]".Name is an entity title (EL) or synset lemmas from WordNet (SR).Description is the first sentence in an entity's Wikipedia page (Wiki & WikiMed), a gloss (SR), or n/a (COMETA).The model learns to draw closer the true query-target pairs' representations using triplet loss with triplet miners during finetuning (Liu et al., 2020).For each experiment, we perform grid search for the learning rate in [1e − 5, 2e − 5, 3e − 5] and select models with early stopping on the dev set.We also run all the models with three random seeds and select the models with the best performance on the dev set.The performance across random seeds are stable as shown by small standard deviations which can be referred to in Table 7 in the appendix. 9All PCM configurations are listed in Appendix D. We also conducted experiments with ROBERTA (Liu et al., 2019) and reported the results in Appendix E

Main Results and Discussion
We report BERT's baseline performance in Figure 2, based on which we calculate Bias BERT C and Bias BERT W for each dataset and plot the results (black dots) in Figure 1 (We also report ROBERTA biases in Appendix E and found a similar trend).For comparison, we plot human baseline performance and biases alongside the model performance in each figure.

Model biases
Models can learn extreme context or target word biases from the datasets.One obvious observation from Figure 1 is that, probed with BERT, most of the datasets lie close to the dashed red lines: tasks such as WiC and MCL-WiC lie towards the right and are close to the vertical red line which indicates 1.0 context bias; the retrieval-based tasks such as WikiMed and Wiki lie towards the top and are close to or even surpass the horizontal red line which indicates 1.0 target word bias.This pattern indicates that BERT can score highly on these datasets by relying only on the target words or only on the context.In other words, context or target words can be much ignored when the model learns to solve the tasks.It is therefore questionable how much word-context interaction, which requires the modeling of both word and context representations, is actually learned by BERT when applied to these tasks.
Moreover, the datasets tend to concentrate in two corners.That is, models usually learn strong bias from either context or the target word: the retrieval-based datasets (eg.Wiki) lie in the top left corner, showing large target word bias and low context bias; the WiC style datasets and WSD lie in the bottom right corner with large context bias and low target word bias.XL-WiC is an exception as it contains both strong context and target word biases.We will come back to this later in Section 3.2 where Figure 2: BERT and human performance on probing baselines across popular context-aware lexical semantic tasks.For the retrieval-based tasks, we report @1 accuracy, and the LABEL and RANDOM baselines are not visible as they are close to 0.
we compare model and human performance.
AM 2 ICO and SR are closest to testing wordcontext interaction from models.There are few existing datasets that in effect require the modeling of the context-word interaction, which should result in both low context and target word biases.SR and AM 2 ICO can be seen as two such datasets which, in Figure 1, can be found further inside of the red lines towards the bias-free left bottom corner.This is because these two tasks are designed to require balanced attention over context and target words.In SR, a system needs to model the target words in order to retrieve all the possible senses associated with the word, and because there is plenty of ambiguity in the dataset, context is also crucial to identify the correct sense.AM 2 ICO was specifically designed to include adversarial examples to penalize models that rely only on the context, and therefore elicits the lowest context bias from models among the WiC-style tasks.As such, SR and AM 2 ICO are the closest tasks that we have to test word-context interaction.
Domains affect lexical ambiguity and the target word bias.
The retrieval-based tasks in this study offer com-  est sense entropy, and therefore missing context in these two tasks will not bring so much negative impact on the model performance, resulting in the highest target word biases; whereas higher sense entropy and thus higher lexical ambiguity (eg.Wiki and then SR) will necessarily require context alongside the target word, which leads to lower target word biases.
Context can harm model performance in Medical EL.We notice that the model's target word bias in COMETA and WikiMed can go beyond 1.0, indicating that the model learning is dominated entirely by the target words with the context being useless or even harmful.This comes as a surprise as medical EL has been treated as a contextual lexical semantic task where the context is usually provided in the hope for higher modeling accuracy.We examined the errors from FULL as compared with WORD, and we found that the model tends to get distracted by related context words.Table 3 shows an example where the retrieval model selects the entry that is closer to a context word ("Miltonia") than to the target word ("Miltoniopsis"), but in fact knowing the target word alone in this case is sufficient to retrieve the correct label.This indicates that the model has not learned a good strategy to incorporate word and context representations from the datasets (i.e.not knowing when to focus on the context and when to focus on the target words).

Human vs Model
There are inherent task biases.Our first finding is that humans show a similar trend of biases in the tasks in comparison to model biases (except for XL-WiC).This is evident from Figure 1 where, with the human bias indicators, WiC still lies near the bottom right corner with relatively high context bias; AIDA lies near the top left corner with high target word bias and AM 2 ICO remains in the middle.This confirms that there are some degrees of biases inherent in the task design so that humans can also rely on either target words or context alone to perform the task to some extent.
Humans are less biased than models.That being said, the second finding and the more important one is that humans exhibit overall much weaker biases in comparison with models in all the four tasks.If we compare human performance with model performance in Figure 2, we can see the CONTEXT and WORD baseline scores are lower in comparison to FULL from human performance.For clearer comparison, we calculate and plot the minimum gap between FULL to either of the two baselines in Figure 3, and we can see substantial difference between humans and models where humans exhibit much larger gaps across the four tasks.The much larger gaps from humans also result in all the four tasks moving further towards the leftbottom "bias-free" corner as shown Figure 1.In other words, humans are more likely than models to rely on both word and context as the absence of either part will lead to much more negative impact for humans when performing these tasks.
The most dramatic difference is in XL-WiC where the model's strong target word bias disappears in humans.The task of XL-WiC by nature should not leak any information from the target word alone (hence 0 target word bias for humans) as the input pair will always contain the same target word.The high target word bias from models comes from the fact the dataset does not contain sufficient ambiguous cases where the same word pair can have both true and false labels dependent on the contexts.We confirm this by calculating the per-word average label entropy of the training data for all WiC-style tasks 10 in Table 4, and we found XL-WiC has the lowest label entropy as 0.2084 , and on average a word pair has the same label for 87% of the examples it appears in the dataset.Therefore, the model learns correlation between the word itself and the label without needing context for disambiguation.Finally, the fact that models cannot achieve a similarly large jump of performance from masked input to FULL like humans could indicate the wordcontext interaction is particularly challenging for models and this might eventually explain the modelhuman gap.We take AM 2 ICO as an example that explicitly requires word-context interaction (Table 5).While BERT achieves comparable results with humans in CONTEXT and WORD, a significantly larger human-model gap is found in FULL, indicating the word-context interaction is what the model lacks the most to achieve human-like performance.

CONTEXT
WORD FULL Human 69 68.5 87.9 BERT 66 61 71 Human -BERT 3 7.5 16.9 Target words are important in WiC for humans.
The much lower context bias from humans in tasks such as WiC suggests that the absence of the target words drastically decreases performance.In fact, human CONTEXT baseline (0.61) is even worse than BERT (0.65) as shown in Figure 2.This may also come as a surprise, considering that target words are always the same and only the context is different in each pair of input.We examined human response in CONTEXT and found that humans can guess another valid target word based on the context, which gives a different prediction.Table 1 shows such an example.While the original WiC label of the input is F, our annotator gave T for the CONTEXT input, guessing the target word is type.This is a reasonable prediction as type fits the contexts and does hold its meaning across the 10 We disgard words that only occur once two sentences.We refer to this new example with human-elicited target words as GUESSEDWORD input.The same annotator was able to give the WiC label F when we reveal the original target word (breed) which has the specific meaning of species in sentence1 and personality in sentence2 (see the FULL input in Table 1).BERT however still predicts F regardless of the target word change in this GUESSEDWORD example.
As qualitative analysis on the human-model discrepancy on CONTEXT, we examined 20 cases where annotators did not predict WiC labels (from the corresponding FULL input) while BERT did.In 11 cases, humans guessed other valid target words to justify their predictions.We then perform preliminary analysis to test BERT on all the 11 GUESSEDWORD cases where the human-elicited target words change the labels (We show more examples in Table 8), and found that for 7 out of 11, BERT is insensitive to the changed target words and maintains its original prediction.This suggests BERT does not appreciate the same word-context interaction as humans, and is making prediction mainly based on contexts rather than modeling contextual lexical semantics in WiC.

Implications for future dataset design
We recommend this analysis framework in future dataset design and result interpretation for contextual word representation evaluation.In particular, we recommend (1) creating probing baselines by masking the context and word (if relevant), and (2) providing a sample to humans (details in section 2.2), (3) and comparing human and model performance of full input vs the masked baseline/s, and then calculate bias indicators.In terms of task design, we would ideally want both models and humans to show low baseline performance and thus low bias measures.When interpreting the results, apart from evaluating model performance on the FULL input, we should also ensure the model shows a human-like gap in performance (between FULL and the baseline(s)) on the same data.

Conclusion and Limitations
This study presented an analysis framework to disentangle and quantify context-word interplay in application of popular contextual lexical semantic benchmarks.With our proposed bias measures, we plot datasets on a continuum, and we found that, to models, most existing datasets lie on the two ends with excessive biases (WiC-style tasks and WSD are heavily context-biased while retrieval-based tasks are heavily target-word-biased) that essentially bypass the key challenges in word-context interaction.SR and AM 2 ICO have been identified as two tasks that have less extreme biases and therefore can better test the representation of both word and context, and we call for more tasks that challenge models to do so.In addition, we identify that the degree of lexical ambiguity as a byproduct of domain affects target word bias (med-ical>general) in retrieval-based tasks.Most importantly, we differentiate biases learned by models and task-inherent biases by collecting human responses on the same baseline input.We found that models' heavy context and target word biases are not attested to the same extent in humans who usually need both context and target words to perform well in the tasks.This suggests that models are relying on different cues instead of modeling contextual lexical semantics as intended by the tasks.Our paper highlights the importance of understanding these biases in existing datasets and encourages future dataset and model design to control for these biases and to focus more on testing the challenging word-context interaction in context-sensitive lexical semantics.
One limitation of this study is that we do not have large-scaled quantitative evidence to pinpoint the cues the models rely on from partial input11 .Possible future directions will be to design such ablation studies to identify any spurious correlations the models have learned and introduce adversarial examples that penalize sole reliance on context or target words in both task design and model training.

B Dev performance
Table 7 shows BERT biases calculated over three runs on the dev set with standard deviation reported.

C Examples of the context bias in WiC
See Table 8 for two examples where the model relies solely on the context to make the prediction.

D Model configurations
ALL PCMs are from https://huggingface.co/.Model configurations are listed in Table 9.
E ROBERTA Performance (Figure 4) F Agreement in WiC-style tasks (Table 10)

G Annotation Guideline
Figure 5 shows an example annotation guideline for the CONTEXT experiment in WiC.

Figure 1 :
Figure 1: Plotting context and target word biases from BERT (black) and humans (blue) across popular context-aware lexical semantic datasets.The green shade and the yellow shade roughly indicate the areas for high target word bias and high context bias (>0.8).We would ideally want a dataset to lie towards the bottom left corner which is bias-free.The dashed red lines indicate 1.0 context (right) and 1.0 target word bias (top), implying a dataset is in effect dealt with by relying on context alone or target words alone.

Figure 3 :
Figure 3: The minimum gap between FULL and CON-TEXT or WORD, i.e. min(FULL-CONTEXT,FULL-WORD) with BERT and human performance.A small gap will indicate strong bias.

Figure 4 :
Figure 4: Plotting context and target word biases when applying ROBERTA across popular context-aware lexical semantic datasets.The green shade and the yellow shade roughly indicate the areas for high target word bias and high context bias (0.8).The dashed red lines indicate 1.0 context (right) and 1.0 target word bias (top), implying the model only requires the target words alone or context alone in this dataset.
halictidae: the Halictidae is the second largest family of Apoidea bees.eomecon: eomecon is a monotypic genus of fl owering plants in the poppy family... ... (all entries in the medical section of Wikipedia) Examples for a selection of context-sensitive lexical semantic tasks surveyed in this thesis.Acc: accuracy; ρ: Spearman's correlation; r: Pearson's correlation; P&R: precision and recall.

Figure 5 :
Figure 5: An annotation guideline for conducting the CONTEXT baseline of humans in WiC.

Table 1 :
Example input of FULL, CONTEXT and WORD in WiC.Target words are in brackets and the original WiC label for the FULL example is F. GUESSEDWORD shows human-elicited target words based on CONTEXT.

Table 2 :
Target Word Bias and Sense Entropy across retrieval-based tasks.

Table 2
. Confirming our hypothesis, sense entropy (lexical ambiguity) in a task does roughly correlate with the model's target word bias: the medical domain tasks (WikiMed and COMETA) contain the lowest lexical ambiguity as reflected by the low-

Table 3 :
Error analysis on BERT predictions on FULL and WORD from WikiMEd.

Table 5 :
Model-human gap in CONTEXT, WORD and FULL in AM 2 ICO.
Table 6 lists example input and labels for tasks surveyed in this study.

Table 7 :
Average context and target word biases over three runs with three different random seeds on the dev set in each dataset.Standard deviation is reported in the parenthesis.

Table 8 :
Example input of WORD, CONTEXT and FULL in WiC.The original WiC label for these examples is F. GUESSEDWORD contains human-elicited target words that flip the label.Comparing CONTEXT and GUESSED-WORD also shows BERT's contextual bias in WiC as BERT is not sensitive to the target word change.

Table 10 :
Human agreement in CONTEXT and FULL in WiC-style tasks