FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary

Current models for Word Sense Disambiguation (WSD) struggle to disambiguate rare senses, despite reaching human performance on global WSD metrics. This stems from a lack of data for both modeling and evaluating rare senses in existing WSD datasets. In this paper, we introduce FEWS (Few-shot Examples of Word Senses), a new low-shot WSD dataset automatically extracted from example sentences in Wiktionary. FEWS has high sense coverage across different natural language domains and provides: (1) a large training set that covers many more senses than previous datasets and (2) a comprehensive evaluation set containing few- and zero-shot examples of a wide variety of senses. We establish baselines on FEWS with knowledge-based and neural WSD approaches and present transfer learning experiments demonstrating that models additionally trained with FEWS better capture rare senses in existing WSD datasets. Finally, we find humans outperform the best baseline models on FEWS, indicating that FEWS will support significant future work on low-shot WSD.


Introduction
Word Sense Disambiguation (WSD) is the task of identifying the sense, or meaning, that an ambiguous word takes in a specific context. Recent WSD models (Huang et al., 2019;Blevins and Zettlemoyer, 2020;Bevilacqua and Navigli, 2020) have made large gains on the task, surpassing the estimated 80% F1 human performance upper bound on WordNet annotated corpora (Navigli, 2009). Despite this breakthrough, the task remains far from solved: performance on rare and zero-shot senses is still low, and in general, current WSD models struggle to learn senses with few training examples (Kumar et al., 2019; Blevins and Zettlemoyer, C: I liked my friend's last status... S1: to enjoy... [or] be in favor of. S2: To show support for, or approval of, something on the Internet by marking it with a vote. C: A transistor-diode matrix is composed of vertical and horizontal wires with a transistor at each intersection. S1: A grid-like arrangement of electronic components, especially one intended for information coding, decoding or storage. S2: A rectangular arrangement of numbers or terms having various uses [in mathematics]. 2020). This performance gap stems from limited data for rare senses in current WSD datasets, which are annotated on natural language documents that contain a Zipfian distribution of senses (Postma et al., 2016). More generally, since each word has a different set of candidate senses and new senses are regularly coined, it is almost impossible to gather a large number of examples for each sense in a language. This makes the few-shot learning setting particularly important for WSD. We introduce FEWS (Few-shot Examples of Word Senses), a dataset built to comprehensively train and evaluate WSD models in few-and zero-shot settings. Overall, the contributions of FEWS are two-fold: as training data, it exposes models a broad array of senses in a low-shot setting, and the large evaluation set allows for more robust evaluation of rare senses.
FEWS achieves high coverage of rare senses by  (Table 1). Not only is this sense coverage higher than existing datasets (e.g., Sem-Cor, the largest manually annotated WSD dataset, only covers approximately 33,000 senses (Miller et al., 1993)), it also extends to senses related to new domains ( Figure 1). We establish performance baselines on FEWS with both knowledge-based approaches and a recent neural biencoder model for WSD (Blevins and Zettlemoyer, 2020). We find that the biencoder, despite being the strongest baseline on FEWS, still underperforms human annotators, particularly on zero-shot senses where the biencoder trails by more than 10%. We also present transfer learning experiments and find adding FEWS as additional training data improves performance on all but the most frequent senses (MFS) in the WSD Evaluation Framework (Raganato et al., 2017); this suggests that future improvements on FEWS could generalize other WSD benchmarks. FEWS is available at https://nlp.cs.washington.edu/fews.

Related Work
WSD is a long-standing NLP task and is the focus of many datasets. The current de facto benchmark for modeling English WSD is the WSD Evaluation Framework (Raganato et al., 2017), which includes the SemCor dataset (Miller et al., 1993) as training data and consolidates a number of evaluation sets (Pradhan et al., 2007a;Palmer et al., 2001;Snyder and Palmer, 2004;Navigli et al., 2013;Moro and Navigli, 2015) into a standardized evaluation suite. These datasets are annotated with the senses (known as synsets) from Wordnet, a manually constructed ontology of semantic relations (Miller et al., 1993).
Most existing datasets for WSD, including those in the WSD Evaluation framework and others like Pradhan et al. (2007b), are annotated on natural language documents that contain a Zipfian distribution of word senses (Kilgarriff, 2004). This data source causes these datasets to have low coverage of rare senses, leading to worse performance on these less common senses (Postma et al., 2016;Kumar et al., 2019). In contrast, we use examples sentences from Wiktionary as an alternative source of text for WSD data with FEWS. This means that FEWS has a more uniform sense distribution, providing more balanced coverage across different senses of words.
Wiktionary has previously been used as a resource for WSD research. Most work has investigated mapping Wiktionary senses onto WordNet synsets (Meyer and Gurevych, 2011;Matuschek and Gurevych, 2013); other work has learned similar mappings for Wikipedia articles (Mihalcea (2007); Navigli and Ponzetto (2012); inter alia). More similar to our work, Henrich et al. (2012) and Segonne et al. (2019) mine WSD examples from Wiktionary to augment labeled WSD data for non-English languages. However, FEWS is the first dataset specifically designed to evaluate zero and few-shot learning with the balanced dictionary sense distribution. Wiktionary is curated by volunteers, the data is manually annotated and high quality, and there is no additional annotation cost to construct FEWS. Furthermore, using example contexts from a dictionary allows FEWS to cover many senses underrepresented in existing WSD datasets, such as rare senses of words or senses pertaining to specific domains. However, we note that due to the crowdsourced nature of the examples in Wiktionary and the subjectivity of fine-grained sense distinctions, inconsistencies in the underlying data may introduce some annotation errors into FEWS. Examples of the data in FEWS are shown in Figure 1. In FEWS, each example context contains one or more instances of the ambiguous target word and is labeled with the sense (and corresponding definition) that describes that word as used in the context; this is in contrast to all-words WSD, where many of the content words in the context are annotated.

Dataset Creation
To create FEWS, we extracted all of the definitions for content words (nouns, verbs, adjectives, and adverbs) and example contexts associated with those definitions from a checkpointed version of English Wiktionary. 3 While processing the Wiktionary data, we collected two types of contexts: (1) quotations (93% of extracted examples), which are quotations of natural language text found by Wiktionary contributors that contain the target word used with the relevant sense, and (2) illustrations (7%), which are short sentences or fragments written by contributors to illustrate the word sense in context. The target words in each context are marked by the Wik-tionary formatting metadata; examples where no words are marked or the marked word differs too much from the base form of the dictionary entry are discarded. 4 We additionally filter out examples that are too short to provide a meaningful context for the marked word.
We then labeled the target words in each extracted context with the sense ID generated for the definition associated with that sentence; this gave us 254,506 annotated WSD example contexts covering 148,333 senses. Finally, we filtered out examples with monosemous words since predicting the sense in these cases is a trivial task. However, Loureiro and Camacho-Collados (2020) recently found that unambiguous examples can improve WSD performance; therefore, we include the unambiguous cases as an additional file in the FEWS dataset. After filtering these unambiguous examples from the main dataset, FEWS in total contains 121,459 examples covering 71,391 sense types.
Finally, we split the data into training and evaluation sets. The majority of the data are quotations, which we use to populate the train, development, and test sets as they more closely resemble naturally occurring text than the illustrations. To create the development and test sets, we randomly select 10,000 examples for each evaluation set and ensure that each of these evaluation examples pertains to a different sense. We verify that half of those examples were labeled with senses that only occurred once in the unsplit data to create a zero-shot subset of each evaluation set, and the other half of the evaluation senses comprise the few-shot evaluation subsets. The remaining quotations that were not  used for the development or test set are included as the training data. Finally, we remove the illustrations for senses in the zero-shot evaluation subsets and add the remaining illustrations to the training data; this addition makes the extended train set.

Dataset Analysis
We present a comprehensive analysis of FEWS to demonstrate that the dataset provides high coverage of many diverse words senses in a low-shot manner.

High Coverage of Words and Senses
The FEWS dataset covers 35,416 polysemous words and 71,391 senses (Table 1). The complete dataset covers 53.21% of the senses for words that appear in it (out of their Wiktionary sense inventories). Figure 2 shows this high coverage of senses for five different words compared to the coverage of the same words in SemCor (Miller et al., 1993). We see that while SemCor tends to have more examples for these common words, most examples correspond to a single sense of the word. However, the FEWS train set covers many more senses per word, albeit with fewer total examples. This high coverage of senses is particularly important for the evaluation sets provided in FEWS (Table 2). Each evaluation set (development and test) covers 10,000 different senses; half of these examples are few-shot and occur in the training set, and the other half of the evaluation senses are zero-shot. In comparison, the current benchmark for WSD evaluation (Raganato et al., 2017) only contains 796 unique zero-shot and 761 unique fewshot senses (where the sense is seen three or fewer times in the SemCor (Miller et al., 1993) training set) across development and test evaluation sets. This much larger sample of few-and zero-shot evaluation examples means that FEWS provides a robust setting to evaluate model performance on less common senses.
Low-Shot Learning Because the data in FEWS come from example sentences for definitions in Wiktionary, each sense occurs in only a few labeled examples. This low-shot nature of the data is shown for five common words in Figure 2: each sense of these words occurs only one to four times in the training data.  Domains in FEWS Definitions in Wiktionary are tagged with keywords, which we include as metadata for their respective senses in the dataset. These keywords indicate that the senses in FEWS pertain to topics ranging from literature and archaic English to sports and the sciences and come from various English dialects (Figure 4). FEWS also covers many new domains not covered in existing WSD corpora, with six keywords corresponding to internet culture.
We also find that a number of the senses (<1% in FEWS) are indicated to be toxic or offensive language: our analysis contains tags such as ethnic slurs, offensive, vulgar, and derogatory that correspond to examples of toxic language. For many of these examples, the meaning of the target word is only toxic due to the context in which it appears. These examples provide an opportunity for improving models for hate speech detection and related tasks, but we leave this exploration to future work.

Baselines for FEWS
We run a series of baseline approaches on FEWS to demonstrate how current methods for WSD perform on this dataset. We consider a number of knowledge-based approaches (Lesk, Lesk+emb, and MFS) and two neural models that build on pretrained encoders (Probe BERT and BEM BERT ). We also ascertain how well humans perform on FEWS as a potential upper bound for model performance.

Knowledge-based Baselines
Most Frequent Sense (MFS) The MFS baseline assigns each target word in the evaluation with their candidate sense that is most frequently observed as the correct sense of that word in the training set. The MFS heuristic is known to be a particularly strong baseline in WSD datasets labeled on natural language documents (Kilgarriff, 2004); however, we expect this to be a weaker baseline on FEWS since the distribution of senses is much more balanced (and half of the evaluation senses are completely unseen during training).
Lesk The simplified Lesk algorithm assigns to each ambiguous target word the sense whose gloss has the highest word overlap with the context surrounding that target word (Kilgarriff and Rosenzweig, 2000). We specifically use the Leskdefinitions baseline from this work, meaning that we do not include words from example sentences in the set compared against the context -since these example sentences are used as the contexts in FEWS.
Lesk+emb This baseline is an extension of the above approach that incorporates word embeddings (Basile et al., 2014). A vector representation is built for the context around an ambiguous word (v c ) and the glosses of each sense of that word (v g ), where v c and v g are the element-wise sums of the word vectors for words in the context and gloss, respectively. The sense that corresponds to the v g with the highest cosine similarity to v c is chosen as the label for the target word. We use Glove embeddings (Pennington et al., 2014) for our implementation of this baseline.

Neural Baselines
Probe BERT This baseline is a linear classifier trained on contextualized representations output by the final layer of a frozen pretrained model; we use BERT as our pretrained encoder (Devlin et al., 2019). We train this classifier by performing a softmax over all of the senses in the Wiktionary sense inventory and mask out any senses not relevant to the target word.
BEM Our other neural baseline is the biencoder model (BEM) for WSD introduced by Blevins and Zettlemoyer (2020). The BEM has two independent encoders, a context encoder that processes the context sentence (including the target word) and a gloss encoder that encodes the glosses of senses into a sense representation. The BEM takes the dot product of the target word representation from the context encoder and sense representations from the gloss encoder, and it labels the target word with the sense that has the highest dot product score. We train BEM BERT by initializing each encoder with BERT and training on the FEWS train set.

Human Performance
Finally, we calculate the estimated human performance on the FEWS development set. The three human annotators were native English speakers, who each completed the same randomly chosen 300 example subset of the development set. The examples were sampled such that half (150) of these examples came from the few-shot split and the other half came from the zero-shot split. Similar to the modeling baselines, we evaluate each annotator's performance by scoring them against the  sense associated with that example in Wiktionary (which we assume to be gold labels).

Experimental Setup
Data All baselines for the FEWS dataset are trained on the train set unless specifically stated to have been trained on the extended train set. All models for FEWS are tuned using the development set and then evaluated on the held-out test set.
Experimental Details Our probe and BEM baselines are in implemented in PyTorch 5 and optimized with Adam (Kingma and Ba, 2015). For the BEM, we use the implementation provided by Blevins and Zettlemoyer (2020). 6 We obtain the bert-base-uncased encoder from Wolf et al. (2019) to get the BERT output representations for the probe and to initialize the BEM models. Further hyperparameter details are given in Appendix A. Table 3 shows the results of our baseline experiments on FEWS. We find that the MFS baseline is weak overall, primarily because it is unable to predict anything about the held-out, zero-shot senses; the Lesk algorithms both outperform this baseline in the overall setting, with the Lesk+emb approach scoring slightly better than the original Lesk approach by 1.78-4.2% across the different evalua-5 https://pytorch.org/ 6 https://github.com/facebookresearch/wsd-biencoders tion subsets. However, on the few-shot examples in both the development and test sets, we see that the MFS baseline outperforms both of the Lesk baselines. This shows that, for the few-shot examples, the MFS heuristic remains a reasonably strong baseline even with the more uniform sense distribution of FEWS (and indicates that the distribution of examples drawn from the dictionary is less uniform than expected). The neural baselines we run generally outperform the knowledge-based ones. The Probe BERT model does fairly well on the few-shot examples, outperforming the MFS baseline by about 20 accuracy points; however, it is unable to disambiguate words in the zero-shot splits correctly since the probe can not generalize to unseen senses. In comparison, BEM BERT performs well across the entire evaluation set. In particular, the BEM achieves much better zero-shot performance than other baselines, though performance on this subset still lags behind few-shot performance. Finally, we see that humans perform better than all of the considered baselines, particularly on zero-shot senses where humans outperform the BEM by 11.53 points. More details about the human evaluation are given below (Section 5.3).

Modeling Results
Additionally, we find that training on the extended train set has little effect on these baselines: the MFS and Probe BERT baselines perform slightly worse (with deltas of -0.08% and -0.21% on the test set, respectively), and BEM BERT performs 1.05% better. Appendix B presents full re-  sults for the extended train baselines.

Human Evaluation Results
The human annotators achieved an average accuracy of 80.11% (with each annotator getting 84.67%, 78.67%, and 77.00% accuracy) and an average inter-annotator agreement of κ = 0.802. We find that humans perform slightly better on the examples that correspond to few-shot examples for the dataset than those corresponding to zero-shot examples despite not using the training data, with an average of 80.44% and 79.87% accuracy on those two subsets, respectively. We also report the performance of the baselines for FEWS on the same subset of the development set that was manually completed by humans (Table  4). We find that the baselines perform similarly on this subset, with a small decrease in performance compared to the full development set (with decreases in accuracy ranging between 0.81% and 3.46% when compared to the development set).

Transfer Learning with FEWS
Next, we investigate how useful FEWS is at improving WSD performance on existing benchmarks. We perform transfer learning experiments by iteratively finetuning models on FEWS and the WSD Evaluation Framework (Raganato et al., 2017), with one acting as the intermediate dataset and evaluating performance on the other, target dataset. We find that on global metrics, this approach performs similarly to finetuning only on the training data for each benchmark; however, transferring from FEWS to the WSD Framework improves performance on less-frequent and zero-shot senses. This suggests that FEWS provides valuable WSD information not covered in SemCor.

Experimental Setup
We apply the supplementary training approach presented in Phang et al. (2018) to perform our transfer learning experiments. We initialize a BEM with the best model developed on the intermediate dataset and evaluate on the target dataset in two ways: first, by evaluating this BEM on the target dataset in a zero-shot manner, without additional finetuning; and second by finetuning the BEM on the target training data before performing the target evaluation. We refer to these models as BEM zero−shot and BEM intermediate , respectively. As a baseline, we also compare against the best BEM trained only on the target dataset (with no exposure to the intermediate dataset), BEM BERT . 7 The model implementation details are identical to those for the baseline experiments on FEWS (Section 5.1).

Data
Models that are finetuned for the WSD Evaluation Framework are trained using SemCor, a large dataset annotated with WordNet synsets and commonly used for training WSD models (Miller et al., 1993). Following previous work, we use SemEval-2007 as a validation set (SE07; (Pradhan et al., 2007a)) and hold out the other evaluations sets in the Framework (Senseval2 (SE2; (Palmer et al., 2001)), Senseval-3 (SE3; (Snyder and Palmer, 2004)), SemEval-2013 (SE13;(Navigli et al., 2013)), and SemEval-2015 (SE15; (Moro and Navigli, 2015)) as test sets. Similarly to the baseline experiments, models that are trained on FEWS use the train set (note that we do not use the extended train set in these experiments), are validated on the development set, and finally evaluated on the held out test set.

FEWS Results
We first consider the setting where SemCor acts as the intermediate dataset and FEWS as the target (Table 3). We find that BEM SemCor performs similarly to training only on FEWS (with an improvement of 0.21% on the test set). BEM zero−shot , which is not finetuned on  Table 5: F1-score on the English all-words WSD in the WSD Evaluation Framework (Raganato et al., 2017). We compare the best model from Blevins and Zettlemoyer (2020) (BEM BERT ) against (1) a model first trained on FEWS and then trained on SemCor (BEM F EW S ) and (2) a model trained on FEWS and evaluated on this task without further finetuning (BEM zero−shot ).  the FEWS train set, unsurprisingly performs worse than any of the BEMs that saw the FEWS training data but outperforms the Lesk baselines.

WSD Evaluation Framework Results
We then consider the opposite setting, where FEWS is the intermediate dataset and the WSD Evaluation Framework acts as the target evaluation (Table 5).
On the overall evaluation set, we again see that BEM F EW S performs similarly to the BEM BERT baseline on the overall ALL metric, and that the zero-shot BEM model underperforms the other biencoders.
We then break down performance on the target evaluation set by sense frequencies: we evaluate performance on the most frequent sense (MFS) of each word in the evaluation (i.e., the sense each word takes most frequently in the SemCor training set); the less frequent senses (LFS) of words, or any sense a words takes besides its MFS; zero-shot words that are not seen during training on SemCor; and zero-shot senses, also not seen during training (Table 6). We find that both transfer learning models perform better on LFS and zero-shot examples than BEM BERT . 8 8 For the models trained on FEWS, it is possible they have In particular, the zero-shot transfer model does well on these subsets, demonstrating that a fair amount of WSD knowledge about uncommon senses can be transferred in a zero-shot manner between these datasets (albeit at the expense of higher performance on the MFS subset). This result also shows how much the MFS group dominates existing WSD metrics and highlights the need for focused evaluations of other types of word senses. Finally, we see that even without exposure to the natural sense distribution in natural language texts, the zero-shot model still performs significantly better on the MFS of words than the LFS, with a 17.1 F1 point difference between the two subsets; this is likely because the BERT encoder is exposed to the sense distribution of English natural language documents during pretraining.

Conclusion
We establish baseline performance on FEWS with both knowledge-based approaches and recently published neural WSD models. Unsurprisingly, neural models based on pretrained encoders perform best on FEWS; however, the human evaluation shows there is still room for improvement, particularly for zero-shot senses. Finally, we also present results on transferring word sense knowledge from FEWS onto existing WSD datasets with staged finetuning. While our naive approach for transferring knowledge from FEWS does not improve performance on the global WSD metric, adding FEWS as an additional training signal improves performance on uncommon senses in existing evaluation sets.
We hope that FEWS will inspire future work foseen closely related senses to those in the zero-shot subsets of from the Wiktionary sense inventory; however, these are represented differently in each dataset and correspond to different definitions to be encoded by the biencoder.
cusing on better methods for capturing rare senses in WSD and better modeling of word sense in niche domains like internet culture or technical writing.  Table 7: Accuracy of the FEWS baselines trained on the extended train set. In each group of rows, we report (1) the extended train baseline, (2) the comparable baseline trained on the standard train set, and (3) the performance delta between the two (where a positive delta indicates the extended train baseline performs better).

A Model Hyperparameters
Each model reported in this paper was tuned on a single hyperparameter sweep over the reported ranges and chosen based on the appropriate development set metric (accuracy on FEWS, F1 performance on the Unified WSD Framework).
Probe BERT The linear layer in the BERT probe baseline is trained for 100 epochs. It is tuned over a range learning rates ([5e − 6, 1e − 5, 5e − 5, 1e − 4], with a final learning rate of 1e − 4). We use a batch size of 128 to train this probe.
BEM For the biencoder model (BEM), we use the codebase provided by (Blevins and Zettlemoyer, 2020). Following this work, we train the BEM for 20 epochs with a warmup phase of 10,000 steps; we use a context batch size of 4 and a gloss batch size of 256. Each BEM is tuned over the following learning rates: [1e − 6, 5e − 6, 1e − 5, 5e − 5]. The BEM BERT and BEM SemCor had a final learning rate of 5e − 6, and the BEM F EW S , of 1e − 6.

B Extended Train Baselines
The extended train set in FEWS contains all of the quotation-based examples from the train set as well as the additional, shorter example illustrations that are written by Wiktionary editors to exemplify a particular sense of a word. We retrain the MFS, Probe BERT , and BEM BERT baselines on this extended training set; the other baselines we consider (Lesk and Lesk+emb) are calculated without using either of the training sets. Table 7 compare the extended train baselines against those trained on the standard train set. For the MFS and Probe BERT baselines, we find that adding the extra, stylistically different illustrations in extended train slightly hurts performance. However, the stronger BEM BERT is able to better use this data and achieves somewhat stronger performance with additional training data. Notably, most of this improvement in the BEM BERT comes from the zero-shot evaluation setting, even though the extended train set does not contain any of these zero-shot senses.