Non-Parametric Few-Shot Learning for Word Sense Disambiguation

Word sense disambiguation (WSD) is a long-standing problem in natural language processing. One significant challenge in supervised all-words WSD is to classify among senses for a majority of words that lie in the long-tail distribution. For instance, 84% of the annotated words have less than 10 examples in the SemCor training data. This issue is more pronounced as the imbalance occurs in both word and sense distributions. In this work, we propose MetricWSD, a non-parametric few-shot learning approach to mitigate this data imbalance issue. By learning to compute distances among the senses of a given word through episodic training, MetricWSD transfers knowledge (a learned metric space) from high-frequency words to infrequent ones. MetricWSD constructs the training episodes tailored to word frequencies and explicitly addresses the problem of the skewed distribution, as opposed to mixing all the words trained with parametric models in previous work. Without resorting to any lexical resources, MetricWSD obtains strong performance against parametric alternatives, achieving a 75.1 F1 score on the unified WSD evaluation benchmark (Raganato et al., 2017b). Our analysis further validates that infrequent words and senses enjoy significant improvement.


Introduction
Word sense disambiguation (WSD) (Navigli, 2009) is a widely studied problem that aims to assign words in text to their correct senses. Despite advances over the years, a major challenge remains to be the naturally present data imbalance issue. Models suffer from extreme data imbalance, rendering learning the long-tail examples a major focus. In the English all-words WSD task (Raganato et al., 2017b), 84% of the annotated words 2 have less  Recent approaches tackle this problem by resorting to extra sense information such as gloss (sense definition) and semantic relations to mitigate the issue of rare words and senses (Luo et al., 2018b,a;Kumar et al., 2019;Huang et al., 2019;Blevins and Zettlemoyer, 2020;Bevilacqua and Navigli, 2020). However, most work sticks to the parametric models that share parameters between words and adopts standard supervised learning mixing all the words of different frequencies. We argue that this accustomed paradigm exposes a missing opportunity to explicitly address the data imbalance issue.
In this work, we propose MetricWSD, a simple non-parametric model coupled with episodic training to solve the long-tail problem, drawing inspiration from few-shot learning methods such as Prototypical Networks (Snell et al., 2017). Given a word, the model represents its senses by encoding a sampled subset (support set) of the training data and learns a distance metric between these sense repre-sentations and the representations from the remaining subset (query set). This lightens the load for a model by learning an effective metric space instead of learning a sense representation from scratch. By sharing only the parameters in the text encoder, the model will trickle the knowledge of the learned metric space down from high-frequency words to infrequent ones. We devise a sampling strategy that takes word and sense frequency into account and constructs support and query sets accordingly. In combination, this non-parametric approach naturally fits in the imbalanced few-shot problems, which is a more realistic setting when learning from a skewed data distribution as in WSD.
We evaluate MetricWSD on the unified WSD evaluation benchmark (Raganato et al., 2017b), achieving a 75.1% test F1 and outperforming parametric baselines using only the annotated sense supervision. A further breakdown analysis shows that the non-parametric model outperforms the parametric counterparts in low-frequency words and senses, validating the effectiveness of our approach.

Related Work
Word sense disambiguation has been studied extensively as a core task in natural language processing. Early work computes relatedness through conceptgloss lexical overlap without supervision (Lesk, 1986;Banerjee and Pedersen, 2003). Later work designs features to build word-specific classifiers (word expert) (Zhong and Ng, 2010;Shen et al., 2013;Iacobacci et al., 2016). All-words WSD unifies the datasets and training corpora by collecting large scale annotations (Raganato et al., 2017b), which becomes the standard testbed for the WSD task. However, due to the naturally present longtail annotation, word expert approaches fall short in utilizing information across different words.
Recent supervised neural approaches prevail word-independent classifiers by more effective sentence feature extraction and achieve higher performance (Kågebäck and Salomonsson, 2016;Raganato et al., 2017a). Approaches that use large pretrained language models (Peters et al., 2018;Devlin et al., 2019) further boost the performance (Hadiwinoto et al., 2019). Recent work turns to incorporate gloss information (Luo et al., 2018b,a;Huang et al., 2019;Loureiro and Jorge, 2019;Blevins and Zettlemoyer, 2020). Other work explores more lexical resources such as knowledge graph structures (Kumar et al., 2019;Bevilacqua and Navigli, 2020;Scarlini et al., 2020b,a). All the above approaches mix words in the dataset and are trained under a standard supervised learning paradigm. Another close work to ours is Holla et al. (2020), which converts WSD into an N -way, K-shot few-shot learning problem and explores a range of meta-learning algorithms. This setup assumes disjoint sets of words between meta-training and meta-testing and deviates from the standard WSD setting.

Task Definition
Given an input sentence x = x 1 , x 2 , . . . , x n , the goal of the all-words WSD task is to assign a sense y i for every word x i , where y i ∈ S x i ⊂ S for a given sense inventory such as the WordNet. In practice, not all the words in a sentence are annotated, and only a subset of positions are identified I ⊆ {1, 2, . . . , n} to be disambiguated. The goal is to predict y i for i ∈ I.
We regard all the instances of a word w ∈ W as a classification task T w , since only the instances of word w share the output label set S w . We define inputx = (x, t) where x is an input sentence, and 1 ≤ t ≤ n is the position of the target word and the output is y t for x t . A WSD system is a function f such that y = f (x). Our method groups the training instances by word w: where N (w) is the number of training instances for T w . It allows for word-based sampling as opposed to mixing all words in standard supervised training.

Episodic Sampling
We construct episodes by words with a tailored sampling strategy to account for the data imbalance issue. In each episode, all examples A(w) of a word w are split into a support set S(w) containing J distinct senses and a query set Q(w) by a predefined ratio r (splitting r% into the support set). When the support set is smaller than a predefined size K, we use the sets as they are. This split maintains the original sense distribution of the infrequent words as they will be used fully as support instances during inference. On the other hand, frequent words normally have abundant examples to form the support set. To mimic the few-shot behavior, we sample a balanced number of examples per sense in the support set for frequent words (referred to as the P b strategy). We also compare to the strategy where the examples of all senses of Algorithm 1 Episodic Sampling 1: K: maximum sample number for support set 2: r: support to query splitting ratio 3: P : sampling strategy ∈ {P b , Pu} 4: Initialize empty dataset D = ∅ 5: for all w ∈ W do 6: Retrieve A(w) and randomly split A(w) intoS(w) andQ(w) with a ratio r. 7: if |S(w)| ≤ K then 8: S(w) ←S(w); Q(w) ←Q(w) 9: else 10: J ← # of senses inS(w) 11:Sj(w) ← examples of sense j inS(w) 12: for k = 1 . . . |S(w)| do 13: j ← the sense of k-th example 14: the word are uniformly sampled (referred to as the P u strategy). We present the complete sampling strategy in Algorithm 1.

Learning Distance Metric
We use BERT-base (uncased) (Devlin et al., 2019) as the context encoder. We follow Blevins and Zettlemoyer (2020) closely and denote context encoding as f θ (x) = BERT(x) [t] where the context encoder is parameterized by θ. If a word x t is split into multiple word pieces, we take the average of their hidden representations. In each episode, the model encodes the contexts in the support set S(w) and the query set Q(w), where the encoded support examples will be taken average and treated as the sense representations (prototypes). For word w, the prototype for sense j among the sampled J senses is computed from the support examples: We compute dot product 3 as the scoring function s(·, ·) between the prototypes and the query representations to obtain the probability of predicting sense j given an example (x , y ): .
(2) 3 We experiment with negative squared l2 distance as suggested in Snell et al. (2017) as the scoring function and find no improvement.
The loss is computed using negative log-likelihood and is minimized through gradient descent. During inference, we randomly sample min(I S , |A j (w)|) examples in the training set for sense j as the support set, where I S is a hyperparameter. We also experimented with a cross-attention model which learns a scoring function for every pair of instances, similar to the BERT-pair model in Gao et al. (2019); however, we didn't find it to perform better than the dual-encoder model.

Relation to Prototypical Networks
Our non-parametric approach is inspired and closely related to Prototypical Networks (Snell et al., 2017) with several key differences. First, instead of using disjoint tasks (i.e., words in our case) for training and testing, MetricWSD leverages the training data to construct the support set during inference. Second, we control how to sample the support set using a tailored sampling strategy (either balanced or uniform sense distribution). This encourages learning an effective metric space from frequent examples to lower-frequency ones, which is different from adapting between disjoint tasks as in the typical meta-learning setup.

Experiments
We evaluate our approach with the WSD framework proposed by Raganato et al. (2017b). We train our model on SemCor 3.0 and use SemEval-2007 (SE07) for development and the rest: Senseval-2 (SE02), Senseval-3 (SE03), SemEval-2013 (SE13), and SemEval-2015 (SE15) for testing. Following standard practice, we report performance on the separate test sets, the concatenation of all test sets, and the breakdown by part-of-speech tags. For all the experiments, we use the BERT-base (uncased) model as the text encoder.
Baselines We first compare to two simple baselines: WordNet S1 always predicts the first sense and MFS always predicts the most frequent sense in the training data. We compare our approach to BERT-classifier: a linear classifier built on top of BERT (all the weights are learned together). As opposed to our non-parametric approach, the BERTclassifier has to learn the output weights from scratch. We compare to another supervised baseline using contextualized word representations that extends the input context text with its surrounding sentences in the SemCor dataset (Hadiwinoto et al., 2019). We also compare to a non-parametric nearest neighbor baseline BERT-kNN, which obtains   sense representations by averaging BERT encoded representations from training examples of the same sense. It predicts the nearest neighbor of the input among the sense representations. The BERT weights are frozen which, different from our approach, does not learn the metric space. Models using only supervised WSD data fall back to predicting the most frequent sense (MFS) when encountering unseen words. For reference, we also list the results of recent state-of-the-art methods that incorporate gloss information including EWISE (Kumar et al., 2019), EWISER (Bevilacqua and Navigli, 2020), GlossBERT (Huang et al., 2019), and BEM (Blevins and Zettlemoyer, 2020). More implementation details are given in Appendix A.
Overall results Table 1 presents the overall results on the WSD datasets. Comparing against systems without using gloss information, MetricWSD achieves strong performance against all baselines. In particular, MetricWSD outperforms BERTclassifier by 1.4 points and BERT-kNN by 2.5 points respectively in F1 score on the test set. Using gloss information boosts the performance by a large margin especially for unseen words, where systems without access to gloss can only default to the first sense. We believe adding gloss has the potential to enhance the performance for our non-parametric approach and we leave it to future work.

Performance on infrequent words and senses
The performance breakdown for words and senses of different frequency groups is given in Figure 2. The non-parametric methods (both MetricWSD and BERT-kNN) are better at handling infrequent words and senses. In particular, our approach outperforms BERT-classifier 3.5% for the words with ≤ 10 occurrences and 6.6% for the senses with ≤ 10 occurrences. It demonstrates the effectiveness of MetricWSD to handle scarce examples.
Ablation on sampling strategies We provide an ablation study for the sampling strategy on the development set. The system using the balanced strategy (P b ) achieves a 71.4 F1 on the development set and drops to 69.2 F1 when the uniform strategy (P u ) is used. Balancing the sampled Figure 3: t-SNE visualization of the learned representations f θ (x) for the examples of note (v) and provide (v) in the SemCor dataset. It shows that MetricWSD is better than BERT-classifier in grouping different senses.

Context
BERT-classifier prediction MetricWSD prediction The art of change-ringing is peculiar to the English, and, like most English peculiarities, unintelligible to the rest of the world.  senses achieves significantly higher performance than sampling with the uniform distribution and this observation is consistent across different hyperparameter settings.

Analysis
Qualitative analysis Table 2 shows the examples which are correctly predicted by our method but incorrectly predicted by BERT-classifier. We see that MetricWSD is able to correctly predict the sense art%1:09:00:: (a superior skill that you can learn by study and practice and observation), which has only 6 training examples. The BERT-classifier model incorrectly predicts the sense art%1:06:00:: (the products of human creativity; works of art collectively) that has many more training examples.

Visualization of learned representations
We conduct a qualitative inspection of the learned representations for the BERT-classifier model and MetricWSD. Figure 3 shows the encoded representations of all 105 examples in the SemCor dataset of the word note (with part-of-speech tag v). We see that the BERT-classifier model fails to learn distinct grouping of the senses while MetricWSD forms clear clusters. Note that even for the sense (red) with only few examples, our method is able to learn representations that are meaningfully grouped. Similarly, MetricWSD separates senses more clearly than BERT-classifier for the word provide (with part-of-speech tag v, especially on the rare sense (pink).

Conclusion
In this work, we introduce MetricWSD, a few-shot non-parametric approach for solving the data imbalance issue in word sense disambiguation. Through learning the metric space and episodic training, the model learns to transfer knowledge from frequent words to infrequent ones. MetricWSD outperforms previous methods only using the standard annotated sense supervision and shows significant improvements on low-frequency words and senses. In the future, we plan to incorporate lexical information to further close the performance gap.

Ethical Considerations
We identify areas where the WSD applications and our proposed approach will impact or benefit users. WSD systems are often used as an assistive submodule for other downstream tasks, rendering the risk of misuse less pronounced. However, it might still exhibit risk when biased data incurs erroneous disambiguation. For example, the word "shoot" might have a higher chance to be interpreted as a harmful action among other possible meanings when the context contains certain racial or ethnic groups that are biasedly presented in training data. Our proposed method does not directly address this issue. Nonetheless, we identify the opportunity for our approach to alleviate the risk by providing an easier way to inspect and remove biased prototypes instead of making prediction using learned output weights that are hard to attribute system's biased behavior. We hope future work extends the approach and tackles the above problem more explicitly.