WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context

We present WiC-TSV, a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, we introduce a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model. We set baseline performance on the dataset using state-of-the-art language models. Experimental results show that even though these models can perform decently on the task, there remains a gap between machine and human performance, especially in out-of-domain settings. WiC-TSV data is available at https://competitions.codalab.org/competitions/23683.


Introduction
Word Sense Disambiguation (WSD) is a longstanding task in Natural Language Processing and Artificial Intelligence. While progress has been made in recent years, mainly thanks to the surge of transformer-based language models such as BERT (Loureiro and Jorge, 2019;Vial et al., 2019;Huang et al., 2019), the evaluation of WSD models has been limited to a set of (mostly SemEval-based) standard WSD datasets (Raganato et al., 2017). These datasets usually come in one of the two forms: lexical sample, in which a target word is placed in various contexts, triggering different senses, and all-words, in which all the content words in a given text are to be disambiguated. Both settings, however, come with a major restriction: word senses in the datasets are linked to external sense inventories such as WordNet (Fellbaum, 1998). Therefore, existing benchmarks are limited to only those WSD systems in which sense distinctions are defined according to an underlying sense inventory. This not only gives restrictions to the model's flexibility, but also enforces the assumption of the availability of complete data. However, as general sense inventories are complex to maintain they often lag behind in being up-to-date 1 , yielding to the absence of novel terms and term usages. Furthermore, the coverage of domain-specific terms and named entities in general sense inventories is quite limited, while domain-specific sense inventories are rare and in most cases incomplete.
As a motivating example, let us assume Technology as the target domain and the collection of information on the current technology landscape as a goal. Therefore, the following context needs to be disambiguated in order to evaluate its relevance: From 1970 to 2007, Apple's chief executive was former Beatles road manager Neil Aspinall.
Even when incorporating a general sense inventory (which would include senses for the fruit and the tree) and a technology-specific sense inventory (which would include the sense for Apple Inc. the technology company), the actual target sense of this context (i.e., Apple Corps Limited, a multimedia corporation founded by the Beatles) may still be missing, which makes the annotation of the correct sense impossible. For these reasons, the current WSD task formulation and existing benchmarks are not fully able to evaluate the suitability of disambiguation systems in realistic domain-specific and/or enterprise settings.
In this paper, we try to fill this gap by proposing a re-formulation of the existing WSD task as well as a new benchmark for evaluating WSD systems under this paradigm.
Target Sense Verification (TSV) formulates the disambiguation of a word as a binary classification task where the equivalence of the intended sense of a word in context and a single given sense is evaluated. For instance, in the example above, the system would need to decide whether the sentence refers to Apple Inc. the technology company or not, by being provided with a sense indicator for solely Apple Inc. (e.g., the hypernym technology company or the definition).
A system able to efficiently solve the TSV task could be effectively used in the scenario of collecting and tagging large amounts of textual data; e.g., from social media, news agencies, blogs and for downstream tasks such as information retrieval, sentiment analysis or relation extraction. Furthermore, such a system could be a good candidate for entity linking (EL) as the task statement of TSV resembles the usage of enterprise knowledge graphs (Galkin et al., 2017) for EL: typically, small domain-specific enterprise knowledge graphs only contain entities from the domain of interest, partially or completely missing the general purpose senses of the contained labels.
In order to train and evaluate models for TSV we constructed WiC-TSV (Word in Context -Target Sense Verification) a multi-domain dataset and evaluated standard unsupervised and supervised approaches (including language models). While WiC-TSV's training and development set consist of general purpose instances, the test set contains domain-specific instances from three different domains. Therefore, this dataset aims at evaluating the ability of a model to (1) disambiguate the word in context without an external sense inventory, (2) deal with unseen instances and incomplete data, and (3) transfer the intrinsic knowledge (gained on general domain data) into a specific domain.

Related Work
Word Sense Disambiguation. The task of WSD consists of associating a word in context with its most appropriate entry in a given sense inventory (Navigli, 2009), e.g., WordNet. For WSD there are many associated datasets (Raganato et al., 2017;Vial et al., 2018;Röder et al., 2018;Ling et al., 2015), including domain-specific ones (Agirre et al., 2009;Faralli and Navigli, 2012). The main difference between WSD and its re-formulation TSV is that for TSV the availability of a sense inventory is not required. Instead of associating a word in context with its most appropriate sense, the usage of a single given sense in the provided context is to be verified. Systems that aim to solve the proposed task are therefore not required to model all senses of the target word, but only a single sense instead.
This facilitates the development of systems for specific domains or settings, as no general-domain knowledge resource is required to perform this task. For instance, an Indonesian company may want to retrieve all sentences referring to the Java island and not other unrelated senses. This framing of the task is frequent in business and data mining settings where domain-specific knowledge resources or inventories may be available, without the need for modeling instances from other domains.
WiC. The task closest to the proposed WiC-TSV is probably Word-in-Context (Pilehvar and Camacho-Collados, 2019, WiC), which our dataset is based on. WiC is a binary classification dataset where a target word is presented within two different contexts. The task consists of deciding whether the word is associated with the same sense in the two contexts or not. WiC is also one of the tasks included in the general language understanding framework SuperGLUE (Wang et al., 2019).
WiC-TSV inherits some of the desirable properties of WiC, such as independence from external sense inventories and the binary classification nature of the task. However, though our benchmark draws ideas from the Word-in-Context benchmark, it provides a different evaluation setting with additional flavors. The main difference with respect to our dataset lies in the presence of relevant information such as hypernyms and definitions, which makes our dataset more realistic and a direct proxy for downstream evaluation: in WiC-TSV the ambiguous target word in a single context is compared against a specific target sense (indicated by provided hypernyms and definitions), in contrast to the comparison of the intended senses of the target word in two different contexts. Also, the task is more targeted at word-level representation, as in one of the tasks (i.e. hypernymy task) the model is not provided with any contextual information and, therefore, needs to have a clear understanding of the word to be able to make correct judgements. Moreover, WiC-TSV includes instances from three domains (cocktails, medicine, and computer science) in its test set, which makes the benchmark more challenging and comparable to a real setting.

WiC-TSV: The Benchmark
A goal of this benchmark is to evaluate the ability of a model to verify the target sense of a word in a context without the usage of an external sense inventory, i.e., without knowing all possible senses of the target word. Another model quality that is aimed at with the presented benchmark is the ability to transfer the intrinsic knowledge into a specific domain. As for most areas, domain-specific training data is hard to obtain, being able to learn on general purpose data and still perform well on domain-specific data is a huge advantage in a real world setting.
To this end, we constructed a benchmark satisfying following requirements: 1. Knowledge of only a single sense of the target word; 2. Knowledge of the definition and/or hypernyms of the target sense; 3. Ability to test the models capability to disambiguate both general purpose and domainspecific senses; 4. Ability to test the models capability to classify usages of previously unseen words; Formally, each instance in the dataset consists of a target word w, a context c containing the target word w, and its corresponding target sense s represented by either its definition (Task 1), its hypernym/s (Task 2), or both definition and hypenyms (Task 3). The task aims to determine whether the intended sense of the word w used in the context c matches the target sense s. Table 1 contains examples of instances from the WiC-TSV test set. Furthermore, a small sample of 10 instances is available online in the form of a survey 2 , where the achieved score is shown to the user after the submission.

Dataset Construction
In this section we detail the construction of the dataset. First, we describe the construction of the training and development set (Section 3.1.1) and then the test set (Section 3.1.2), with a special focus on the creation of the domain-specific subsets.

Training and Development Set
Instances in the training and development set do not focus on a specific domain. As basis served the Word-in-Context (WiC) dataset (Pilehvar and Camacho-Collados, 2019), which contains a target word w and two contexts c 1 and c 2 for each instance. The contexts from WiC for noun instances come from two resources: WordNet and Wiktionary. To maintain the desirable characteristics of the WiC dataset (e.g., balanced data, not having repeated contexts across instances), the splits of the original training and development sets were treated separately in the following way: starting from a noun-only sub-sample, for each context c i , the sense of the target word w was mapped to the corresponding synset of WordNet, adding a sense identifier. Each WiC instance was then split into two instances, one for each context. For initial negative instances (i.e. w has different intended senses in c 1 and c 2 ), the sense identifiers of these two instances were switched. To avoid information leakage, only one of the two instances were kept for the WiC-TSV dataset 3 . Finally, for each sense, the definition and hypernyms were derived from WordNet using the sense identifiers. 4

Test Sets
To make the dataset more challenging and realistic, the test set incorporates both general purpose and domain-specific instances.
General Purpose (WNT/WKT). The general purpose instances were generated analogously to 3.1.1. Hence, this test set is composed of both WordNet and Wiktionary examples, with definitions and hypernyms extracted from WordNet.
In the following we describe the construction of the domain-specific subsets. The main difference between domain-specific and WNT/WKT test sets is that in the former the target sense remains the same. That means, that even though "fork" might have different senses within the computer science domain, we are only interested in one of these senses.

Cocktails (CTL).
For the cocktails instances the target words were taken from the "All about cocktails" thesaurus 5 . The thesaurus contains 300 en- Python is an interpreted, high-level, general-purpose programming language object oriented programming language F The present paper compares the recently studied pythons with those examined 20 years ago , and uses the combined dataset to assess the ecological sustainability .
Python is an interpreted, high-level, general-purpose programming language object oriented programming language tries describing not only cocktails, but also beverages, garnishes and glassware, among others. For instances obtained from this resource, the hypernym "cocktail" is used in the WiC-TSV dataset, while the definition is derived from the thesaurus.
Medical Subjects (MSH). For medical subject instances we use terms, definitions and hypernyms from the MeSH thesaurus 6 . This thesaurus is used for indexing medical articles and therefore contains a wide variety of terms in this domain. We considered various types, such as diseases, symptoms and body parts as target words.
Computer Science (CPS). Target words in the computer science domain were gathered manually, without a readily available thesaurus. The definitions were derived from the lead section of the corresponding Wikipedia page, while hypernyms were created by the consensus of two domain experts.
In order to create the domain-specific instances, first a list of ambiguous words and their domainspecific target senses was fixed for each domain. 6 www.nlm.nih.gov/mesh/ Then, we used the Wikilinks dataset (Singh et al., 2012) as a basis for collecting different contexts containing the target words. This dataset contains documents -blog posts scraped from the web -and the links from these documents to the Wikipedia pages, which were used to assign the intended sense (i.e., target sense or other sense) to the target word. Where needed, additional contexts were collected manually by incorporating a search engine to find contexts for the target word. The intended senses for these instances were assigned manually.
Postprocessing. After creating the initial domain-specific instances, the subsets were checked manually to remove non-suitable and unsolvable instances. To maintain a rather realistic evaluation setup, data was not completely cleaned, meaning that contexts can contain noisy elements such as headings or meta-info derived from the websites (e.g., "posted by").

Data Cleaning
While the quality of the domain-specific instances is assured due to their manual creation process, an additional data cleaning step was introduced in which general purpose instances were manually curated. The instances from the test set were split into four sets with an overlap of 20%. Each set was evaluated by an annotator regarding correctness and solvability of the instances. For example, when the hypernym of an instance was too generic to help in the disambiguation process, or the context itself was too ambiguous, the instance was marked as "to filter out". Each marked instance was reviewed by a second annotator, who could either confirm, or reject the request of removal. Instances marked by both annotators were removed.
An example of such a removed instance would be the context "The zero sign in American Sign Language is considered rude in some cultures ." for the target word "zero" with the target definition 'a mathematical element that when added to another number yields the same number'. In American Sign Language (ASL), "zero sign" is a ring-shaped hand sign using the thumb and pointing finger, similar to the OK-gesture. The provided instance mixes two senses of "zero sign". On the one hand, it refers to the hand gesture itself (synonymous to OK-gesture) which does not fully match the target sense. On the other hand, it also refers to the sign of the digit zero in ASL, which does match the target sense.
Other examples of filtered instances involve sentences where the target word may have been used metaphorically.
This procedure resulted in 106 instances which were removed. About 8% of these instances were part of evaluation sets created to measure the human performance (see 3.4) 7 : the annotators achieved a mean accuracy of only 56% on these instances. This shows that the data cleaning step was necessary in order to ensure the data quality of the test set.

Statistics
A statistical overview of the dataset and their splits is shown in Table 2. The totality of 3832 available instances were split into train, development and test sets with a ratio of 56:10:34 which allows a sophisticated analysis of the generalisation capabilities of tested systems, while still providing an appropriately sized training set.
The test set contains around 55% general purpose instances and 45% from specific domains. For 7 Annotations for these instances were removed before calculating the metrics presented in 3.4  each domain, the number of unique target words is relatively low compared to the general domain subset, which results in a higher number of instances per target word. However, for domain specific words, a great variety of senses is used in the contexts, yielding a big diversity among the instances. For all three splits, positive and negative instances are approximately balanced.

Human Performance
To estimate the human performance upper bound, a sub-sample of the test set was manually annotated. The performance was evaluated on the setting of Task 3, meaning that both the definition and the hypernyms were provided to disambiguate. A random selection of 250 instances were split into two evaluation sets of the size of 150, resulting in a 20% overlap. Each evaluation set was assigned to a non-expert annotator with English as native language. No additional information -especially not from the respective ontology or about other senses of the target -was provided to the annotators and they were instructed not to use external knowledge sources (e.g. if they are not familiar with the domain-specific sense of a word). Results of the human performance evaluation can be found in Table 3. The mean accuracy for the evaluated datasets was 85%, with individual scores of 81% and 89%. To estimate the inter-annotator reliability, the agreement of the two annotators on the overlapping instances was calculated: for 42 instances (84%) the annotators agreed on the label.
When evaluating the instances per domain, it can be seen that the general purpose instances were  more difficult than the domain-specific ones, as annotators achieved an average accuracy of 82% (individual scores of 77% and 87%) on the general purpose instances, while the mean accuracy on the domains were 89% (83% and 96%), 92% (88% and 96%), and 86.5% (89% and 84%) for MSH, CTL, and CPS, respectively. This performance difference is even more evident when comparing to the performance of non-native speakers: an additional experiment showed, that evaluators whose mother language is not English only achieved an average accuracy of about 77% on the WNT/WKT instances, while performances on the domain specific subsets were comparable to native speakers.

Experimental Results
In this section we evaluate the performance of different baseline models on our WiC-TSV benchmark. For our experiments we considered two main systems, namely BERT (Devlin et al., 2019) and FastText (Joulin et al., 2017), as well as unsupervised baselines adapted to the corresponding tasks in WiC-TSV.

Evaluation Tasks
The benchmark provides three different tasks depending on the input information available: definition-based (Section 4.1.1), hypernym-based (Section 4.1.2), and both (Section 4.1.3).

Task 1: Definition Information
In this task, the goal is to identify if the intended sense of the target word in the context matches the target sense described by the definition.
In other words, the model has to check if the sense represented by the definition can fit within the given context. For this task, the system is provided with a context (in which the target word is marked) along with a definition (which describes one of the possible senses of the word).
Baselines. The first baseline is based on the pretrained transformer-based language model BERT 8 . It consists of a simple classification layer on top of the BERT model which is responsible for encoding the input. For this task, we concatenate the context and the definition and feed the whole sequence to BERT. Then, the classification layer takes as input the concatenation of three different vectors, all provided by BERT: the [CLS] token representation, the representation of the target word in the context and the average representation of the words in the definition. This is similar to the baseline BERT model employed in SuperGLUE (Wang et al., 2019). It is worth mentioning that BERT is originally trained using WordPiece tokenization (Wu et al., 2016), which means that each word can be broken down into more than one sub-word. Therefore, in order to have a fixed length representation for each word, we take the average of its sub-word representations. Finally, the whole model is fine-tuned on the training set.
For the FastText-based baseline, we first extract the corresponding embeddings for each word in the context and definition, respectively. Then, the representation is simply computed as the average of the corresponding embeddings it contains. Next, these two representations are concatenated together to form a fixed length vector which we then feed to a fully connected layer. Finally, we put a simple classification layer on top of this fully connected layer and train the model on the training set.
We also evaluated GlossBERT (Huang et al., 2019) on our dataset. The authors describe a weak supervision algorithm that consists in surrounding the target word with special symbols -quotation marks are used in the available implementation 9 . We provide results both with (GBERT ws ) and without (GBERT) weak supervision. We chose the hyper-parameters as suggested by the authors, trained for 6 epochs and achieved the highest scores on the 4th epoch.
Unsupervised baselines U-BERT and U-dBERT, which do not make use of the training set, are simple threshold-based classifier which take the cosine distance of a target word representation and a definition representation into account. As source for these vectors serve BERT and DistilBERT, respectively.
Similar to before, we derive the target word vector by taking the embedding of the target word in the context and the definition vector by averaging over all embeddings of the definition. The threshold is tuned on the development set with a step size of 0.02.

Task 2: Hypernym Information
For this task, the system is provided with a target word (in a context) and a set of hypernyms for the target sense. The task is to identify if the intended sense given through the context is the hyponym of the provided hypernyms. Note that, unlike Task 1, no definition is involved in this setting and the task is directed only by hypernymy information.
Baselines. We used baseline models similar to those used in the previous task. The only difference lies in how we shape the inputs fed to these models.
For the supervised and unsupervised BERTbased models, we put together the context with the hypernyms to form the input. Similarly, for the FastText-based model, the hypernyms' embeddings are concatenated with the context's representation and fed to the classifier.

Task 3: Both Sources of Information
In the third task systems are provided with both definition and hypernymy information.
Baselines. For this task, we concatenate the definition and the hypernyms, and feed the generated sequence together with the context to BERT. Then, the concatenation of the [CLS] token representation, the representation of the target word in the context and the average representation of the words in the definition/hypernyms sequence is fed to the classification layer.
For the unsupervised model we use the same BERT input and take the representation of the target in context and the average over the definition and hypernyms as input vectors. For the FastTextbased baseline, the hypernyms' embeddings are concatenated with both the context's representation and the definition representation and the combination is fed to the classifier. Table 4 shows the overall results for the three tasks. As can be observed, GlossBERT performs best in  terms of accuracy and F 1 . BERT-L is a little worse, but achieves the best recall. The worst supervised baseline -FastText -does not perform better than a naive baseline that retrieves all instances as true. This also reinforces the challenging nature of the benchmark, as even BERT-based models are far from the human annotator performance (estimated on 85.3% for accuracy). Clearly, the definition information is more helpful than the hypernyms for BERT, while the combination of both attains the best overall results. Yet GlossBERT reaches a better performance with definition only 10 . The unsupervised models only perform well with hypernyms. Though U-dBERT reaches the best precision in Task 1, the recall remains very low and therefore the overall performance.

Results
Another point to highlight is the high recall of BERT-based models, in contrast to its precision. This is mainly attributed to the domain-specific subsets as it will be analysed below. As for the comparison between BERT-based models, the larger model (BERT-L) performs as expected better than  the base model (BERT-B) overall.

Analysis
In order to better understand the results, in this section we perform a focused analysis on the performance split by domain. In fact, the results on the domain-specific instances are in the same ballpark as the WNT/WKT test set. This can be attributed to the fact that specific domains highly constrain the set of possible senses for a word, resulting in an easier WSD classification task (Magnini et al., 2002). On the other hand, WordNet is known to be quite fine-grained (e.g., the noun run has 16 different senses in WordNet). Surprisingly, unsupervised DistilBERT achieves the best accuracy over all tasks and classifiers on MSH. However, both unsupervised models do not perform well on WNT/WKT and CTL. We can observe that supervised models are significantly more reliable and produce similar scores on different tasks and datasets than unsupervised models.

Domain-based Analysis
In general, for BERT-based models, recall is substantially higher than precision on the domainspecific subsets. This is desirable in a retrieval setting where a high-coverage retrieval of relevant cases is of more importance. Interestingly, among the two BERT alternatives, the smaller model performs better on the domain-specific subsets, suggesting that it is more robust to domain changes. This is an important observation which needs further careful investigation in future work, given that most evaluation benchmarks (on which the larger model consistently outperforms the smaller one) comprise in-domain test sets, which cannot reveal robustness across domains.

In-domain Few-shot Analysis
Although the availability of big annotated domainspecific training sets is quite rare, the presence of a small training set forms a realistic scenario. Incorporating these domain-specific instances in the model training could potentially increase its prediction performance. To investigate this theory, we performed an additional analysis focusing on the usage of in-domain examples in the learning process, where for each domain 100 instances from the test sets were used as a training set. To enforce the assumption that not all target senses would be seen during the training process, we put aside all instances of 3 target words for each domain test set. 11 Two additional domain-based strategies were considered: (1) few-shot learning: only using the domain-specific instances, and (2) continued learning: extending the existing general-purpose training set with the domain-specific instances.
For this analysis we focused on Task-3 and BERT-large, which performed better overall. Table 6 shows the F1 results. In general, few-shot learning works surprisingly well overall (achieving the best overall performance in the CTL and MSH domains). On CTL pairs unseen during training, it even performs considerably better than the same BERT model trained in the continued learning setting. In the CPS domain, for both few-shot in-domain learning and continued learning the performance on seen target words is quite high, while the prediction of unseen target words produces relatively low F1 scores, which indicates a low ability to generalise to new senses. As for the model trained on the general-domain dataset, it performs best in the CPS domain, but performs considerably lower than the domain-tuned alternatives in the CTL domain. Indeed, the domain-tuned BERT systems clearly outperform the same model trained on the general domain on seen pairs, proving the importance of obtaining word-specific examples to boost performance. However, this may not be realistic in practice, and therefore further research should be devoted in improving the generalization capabilities of disambiguation systems, and language models in particular. These findings are consistent with the results of an experiment conducted with GlossBERT in the few-shot learning setting on Task-1: the overall accuracy increase ranged from 0.1% (CPS) to 13.6% (CTL) compared to the model trained solely on general domain instances.

Conclusions and Future Work
In this paper we have introduced the Target Sense Verification task, a re-formulation of WSD where the equivalence of the intended sense of a word in context and a single given sense is evaluated. Furthermore, we presented WiC-TSV, a multidomain benchmark which differs from existing WSD datasets in three main ways: (1) it is based on TSV and therefore framed as a binary classification task where only one target sense needs 11 To add robustness to the results, three different random samples were considered for this experiment, with the results being averaged after the three different runs.  Table 6: F1 score for the in-domain few-shot analysis (Task-3) using BERT-L trained on general domain (WNT), domain-specific (Dom) and general domain fine-tuned on the target domain (WNT+D). In addition to the full test set (All), results are split on seen (See) and unseen (Uns), as per the presence or absence of the target word in the domain-specific training set.
to be verified, (2) it is independent from external sense inventories, and (3) its test set contains instances from three specific and heterogeneous domains are included: cocktails, medical subjects and computer science. Our benchmark therefore opens the floor for different disambiguation algorithms that do not require modeling the entirety of a sense inventory. This characteristic also provides a crucial advantage in enterprise and domain-specific settings as it facilitates the development of systems which are only aimed at modelling the domain at hand. Moreover, having these out-of-domain test instances makes our benchmark more robust and generalisable, preventing (or making it harder) for statistical models to learn spurious correlations from the training set, which has been proven to be an issue in standard NLP tasks (Poliak et al., 2018;Gururangan et al., 2018;Linzen, 2020). In our initial experiments we found that current state-of-the-art disambiguation techniques based on pre-trained language models such as BERT are very accurate at handling ambiguity, even in specialised domains. However, there is still room for improvement as highlighted by the gap with the human performance. This benchmark therefore opens up avenues for future research on domain-transfer and on developing general-purpose solutions which can perform well on a variety of domains without the need for large amounts of training data.
As future work, we are planning to further investigate and analyse the robustness of pre-trained models with respect to domain changes. Also, it would be interesting to develop hybrid models which take both definition and hypernymy information into account -in this paper we combined both sources in BERT in a simple manner, but more complex models should lead to further improvements.