MWE as WSD: Solving Multiword Expression Identification with Word Sense Disambiguation

Recent approaches to word sense disambiguation (WSD) utilize encodings of the sense gloss (definition), in addition to the input context, to improve performance. In this work we demonstrate that this approach can be adapted for use in multiword expression (MWE) identification by training models which use gloss and context information to filter MWE candidates produced by a rule-based extraction pipeline. Our approach substantially improves precision, outperforming the state-of-the-art in MWE identification on the DiMSUM dataset by up to 1.9 F1 points and achieving competitive results on the PARSEME 1.1 English dataset. Our models also retain most of their WSD performance, showing that a single model can be used for both tasks. Finally, building on similar approaches using Bi-encoders for WSD, we introduce a novel Poly-encoder architecture which improves MWE identification performance.


Introduction
Word sense disambiguation (WSD), the task of predicting the appropriate sense for a word in context, and multiword expression (MWE) identification, the task of identifying MWEs in a body of text, both deal with determining the meaning of words in context (Maru et al., 2022;Constant et al., 2017).They have traditionally been treated as separate tasks, but this is potentially disadvantageous as WSD performed on words which are part of unrecognized MWEs cannot produce correct meanings, and the meanings of polysemous MWEs are ambiguous even after identification.For example, the sentence "She inherited a fortune after her grandfather kicked the bucket" tells us that someone's grandfather has died, but we would not expect to find meanings associated with death in the sense inventories of either kick or bucket.WSD cannot *Both authors contributed equally to this work.capture the meanings of these words in context unless the relevant MWE is identified first.However, like many MWEs, kick the bucket can have a literal, non-compositional meaning as in "He kicked the bucket down the hill," so we also cannot indiscriminately mark all combinations of words in known MWEs as MWEs.MWEs can also have multiple possible senses in the same way words can: break up can refer both to objects physically breaking apart and romantic relationships ending, so even in cases where it is correctly identified as a MWE its meaning is ambiguous without WSD.Identifying the meanings of all words in a sentence requires solving these tasks together.
WSD and MWE identification can be used in preprocessing to improve performance of downstream tasks such as machine translation or information extraction (Zaninello and Birch, 2020;Song et al., 2021;Barba et al., 2021a).They also have more direct applications in helping language learners -for whom MWEs are particularly challenging (Christiansen and Arnon, 2017;Pulido, 2022) -understand the meaning of words or MWEs in context.
In this paper, we propose a system that tackles these tasks together, using a MWE lexicon and rulebased pipeline to identify MWE candidates and a trainable model to both perform WSD and filter MWE candidates.Our model is a modified Polyencoder (Humeau et al., 2020), a natural extension of previous work using Bi-encoders for WSD (Blevins and Zettlemoyer, 2020;Kohli, 2021).Utilizing gloss information 1 allows our model to consider the meaning of MWEs and filter out candidates where the constituents of a MWE are present but the MWE meaning does not fit the context, such as the aforementioned literal usage of kick the bucket.Our method improves precision and achieves state-of-the-art F1 for MWE identification on the DiMSUM dataset (Schneider et al., 2016) and competitive performance on the PARSEME 1.1 English data (Ramisch et al., 2018).To the best of our knowledge, this work is the first to use glosses as a resource for MWE identification.Our contributions are summarized as follows: • We present a system which solves MWE identification and WSD together, achieving stateof-the-art results for MWE identification on DiMSUM and only 6% less F1 for WSD than an equivalent single-task model We make all of our code, models and data public2 .
2 Related Work

Word Sense Disambiguation
Until the last few years, most approaches to WSD treated senses only as one of many possible labels in a classification task.This formulation limits the information available to the model about each sense to only what is learnable from the training data, and can lead to poor performance on rare or unseen senses.To mitigate these problems, recent approaches have improved performance by incorporating sense glosses (Blevins and Zettlemoyer, 2020;Barba et al., 2021a;Zhang et al., 2022).Our work is inspired by this methodology and utilizes gloss information to improve MWE identification.In particular, Blevins and Zettlemoyer (2020) demonstrate that a simple Bi-encoder model consisting of two BERT (Devlin et al., 2019) models can achieve competitive WSD performance, with Kohli (2021) improving Bi-encoder training for WSD and Song et al. (2021) achieving further performance gains through improved sense representations.Bi-encoder models are also particularly efficient at inference time because gloss representations can be computed in advance and cached.

Poly-encoders
The Poly-encoder architecture was proposed by Humeau et al. (2020) as a middle ground between Bi-encoders and Cross-encoders (which jointly encode all possible input pairs), retaining the speed advantage of the Bi-encoder, but allowing information to flow between the two encoder outputs like the Cross-encoder.It can be used in place of a Bi-encoder in tasks such as information retrieval (Li et al., 2022) text reranking (Kim et al., 2022), or in our case MWE identification and WSD.

Multiword Expression Identification
Precisely defining what constitutes a MWE has proven to be difficult (Maziarz et al., 2015), but they can be broadly defined as groupings of words whose meaning is not entirely composed of the meanings of included words (Sag et al., 2002;Baldwin and Kim, 2010).This includes idioms such as a taste of one's own medicine, verb-particle constructions such as break up or run down, compound nouns such as bus stop, and any other grouping of words with non-compositional semantics.In fact, a significant portion of noun MWEs are named entities (Savary et al., 2019).
The task of MWE identification is locating these MWEs in a given body of text.Common approaches to solving MWE identification include rule-based systems (Foufi et al., 2017;Pasquer et al., 2020), CRF-based systems (Liu et al., 2021), and token tagging systems (Rohanian et al., 2019).Rule-based systems remain competitive with neural models in this task, and many systems including ours use MWE lexicons in order to identify MWEs, which Savary et al. (2019) argue are critical to making progress in MWE identification.Kurfalı and Östling (2020) and Kanclerz and Piasecki (2022) are similar to our work in that they frame the task of MWE identification as a classification problem, although neither use gloss information.
Among all the types of MWEs, verbal MWEs are particularly difficult to identify due to their surface variability -constituents can be conjugated or separated so that they become discontinuous (Pasquer et al., 2020).Much work on verbal MWE identification, especially in languages other than English, has been done as part of recent iterations of the PARSEME shared task (Ramisch et al., 2018).
Figure 1: Each model scoring the MWE take advantage."Draw advantage from" is the gloss for one possible sense.The gloss encoder produces sense representations rs using the [CLS] embedding in all models.The MWE representation rw is an average of constituents for the Bi-encoder and the combination of attention for each code for the Poly-encoder.The DCA Poly-encoder learns separate codes for target and non-target tokens, allowing it to attend differently to the MWE and surrounding context.Scores are the similarity between rs and rw computed as the dot product, and the model predicts the sense with the highest score.

Bi-encoder
Bi-encoders for WSD, as defined by Blevins and Zettlemoyer (2020), consist of two BERT (Devlin et al., 2019) models: a context encoder T c and gloss encoder T g , which embed the context and sense glosses into the same embedding space.Given an input sentence c = (w 0 , ...w n ) containing the target words to disambiguate, we first tokenize it and use the context encoder to produce representations for each token.Because tokenization may break words or MWEs up into multiple subwords, word or MWE representations r w are computed as an average of all included subwords.
Then, for each target word or MWE, the gloss encoder produces a sense representation r s for each possible sense by encoding its gloss and taking the [CLS] token embedding.
Scores corresponding to possible senses for each target word are computed as the dot product similarity of the word and sense representations, and the model predicts the highest scoring sense.

Poly-encoder
Like the Bi-encoder, the Poly-encoder has a context encoder T c for target word contexts and a gloss encoder T g for glosses.There is also a new set of parameters that Humeau et al. (2020) refer to as code embeddings, Q.These codes are used as queries to extract information from context representations produced by the context encoder.The inputs to the Poly-encoder are the same as to the Bi-encoder, sense representations r s are computed identically, and predictions are still the highest scoring sense.However, senses are scored differently.
We take the last hidden state of the context encoder as the context representation r c = T c (c), which we use along with the code embeddings Q = (q 1 , ..., q m ) in the first dot-product attention step (code context attention) of the Poly-encoder.
We use a different set of embeddings for single words and MWEs.The number of embeddings, m, is a hyperparameter and their dimensionality is the same as the encoders' hidden sizes.The context representation r c is used as both keys and values in this dot-product attention module, yielding a code attended context Y ctxt .The representation a code q i extracts is as follows: w q i j r c j Sense representations r s are then used as queries and the code-attended context representations Y ctxt are used as keys and values in a final dot-product attention module, yielding a gloss attended codecontext.For a given sense s of a word or MWE: We then take the dot product of the gloss attended code-context y f inal and each gloss embedding r s 0 , ...r s k , yielding a score for each gloss: ϕ(w, s i ) = y f inal • r s i .

Distinct Codes Attention
Since the Poly-encoder was originally designed to compute sentence representations, it contains no mechanism for explicitly focusing on a specific set of target words/subwords.To address this problem, we propose a variation of the Poly-encoder which we call "distinct codes attention" (DCA).We change the code context attention step of the Poly-encoder so that it can attend differently to target words and the surrounding context, using two sets of code embeddings: one set for target words, Q t and one set for non-target words Q nt .Since we also maintain different code embeddings for single words and MWEs, this gives us a total of four sets of code embeddings.
In the first attention module, code-context attention, we construct two key matrices, one to be used with the target code queries Q t and one to be used with the nontarget code queries Q nt .First we create two masks which pick out target or nontarget subwords: the target mask M t , which is 1 at the indices of target subwords and 0 otherwise, and the nontarget mask M nt which is the opposite.We then multiply each mask by the encoded context r c to get target and nontarget key matrices K t = M t r c and K nt = M nt r c .Next we compute target and nontarget query results (QK T ) and add them.
Finally, we softmax and multiply QK T by the encoded context r c to yield the code attended context, Y ctxt = sof tmax(QK T )(r c ).The gloss attended code-context and final scores are then computed identically to the standard Poly-encoder.

MWE Identification Pipeline
We use a rule-based pipeline inspired by Kulkarni and Finlayson (2011) for MWE identification.First, we compute initial candidates as all combinations of words in a sentence whose lemmas correspond to a MWE in our lexicon.That is, any group of words that when lemmatized corresponds to a known MWE, regardless of order or location in the sentence, is a candidate.This ensures we rarely miss known MWEs, but also produces many false positives, such as: in that in "That was back in 1954, 55 years ago."Next, we filter the candidates by removing MWEs which are out of order or too gappy (>3 words in between constituents), and optionally by discarding MWE candidates judged to be incorrect by our DCA Poly-encoder (or other) model.We refer to the combination of rule-based extraction and filters with no model as the rule-based pipeline.Since the model is applied as a final filter after extraction and the other filters, it can only improve precision.While the heuristic filters involving order and gappyness exclude some valid MWEs as well, they empirically improved performance on development data, and the majority of exclusions made by these filters are correctly removing false positives from candidate generation as can be seen in Table 1.Note that many candidates excluded by one filter would also be excluded by another filter later in the pipeline.In cases of overlap between candidates remaining after filtering, we use only the candidate judged to be most likely by our model (or least gappy, in pipelines without models).

Model Filter
Because all of our MWE candidates correpond to words (and consequently subwords) in the input sentence, we can produce a representation r w for each MWE candidate, along with scores for each of their possible senses, the same way we do for words.However, since no MWE has a sense corresponding to the case where that candidate is a false positive, we define a special sense s n representing the case where the candidate is not a MWE.
Since s n has no gloss, we cannot use the gloss encoder to compute a representation for it, and instead make this representation a learnable parameter matrix r sn , with the same dimensionality as the model's hidden size.This can then be used in our model's scoring functions to compute a score for the candidate not being a MWE.When using a model to filter, we exclude any MWE candidates whose highest scoring sense is the "not a MWE" sense s n , retaining only candidates for which the below is true: Note that since this filtering process involves computing scores for all possible senses, it also effectively performs WSD on any polysemous MWEs.
4 Experimental Setup

Lexicon
We use WordNet (Miller, 1995) as our MWE lexicon for all experiments, treating every entry including the character "_" as a MWE.All sense glosses are taken from WordNet 3.0.

Training Data
We train our models on SemCor (Miller et al., 1993), a WSD dataset containing a total of 226,036 examples annotated with senses from WordNet.In order to make the data usable for MWE identification in addition to WSD, we preprocess it in the following ways.First, we explicitly mark any words whose lemma includes the character "_" as MWEs such that during training the possible labels for these MWEs also include the "not a MWE" sense.

MWE Identification Evaluation
We evaluate our system on the English section of the PARSEME 1.1 Shared Task (Ramisch et al., 2018) and the DiMSUM dataset (Schneider et al., 2016).We do not evaluate on STREUSLE (Schneider et al., 2018) as it requires predicting lexical categories and supersenses 4 , while our system predicts only the presence or absence of MWEs.To measure WSD performance, we use the evaluation framework established by Raganato et al. (2017) and evaluate on the English all-words task.

PARSEME 1.1
The PARSEME data focuses on verbal MWEs, containing 3471 sentences in the training set and 4 The STREUSLE evaluation script rejects input without appropriate lexical categories/supersenses 3965 in test.Because the data contains only verbal MWEs, when evaluating on PARSEME we limit the output of our pipeline to verbal MWEs.

DiMSUM
The DiMSUM data consists of online reviews, tweets and TED Talks which have been annotated with MWEs and other information.There are 4799 sentences in the training set, and 1000 in the test set.Because noun phrases are marked as MWEs in DiMSUM, when evaluating on DiMSUM our pipeline also marks consecutive nouns as MWEs.

Results
Table 4 shows MWE identification performance for the rule-based pipeline (Section 3.4), and the same pipeline with the DCA Poly-encoder included as a final filter for various training data.Comparisons to the Bi-encoder and standard Poly-encoder can be found in Section 6.1, or in detail in Appendix C.
Our system achieves moderate performance on PARSEME and competitive performance on the DiMSUM trained only on the modified SemCor data.When fine-tuned on both MWE identification datasets it further improves, reaching state-ofthe-art performance on DiMSUM.Systems finetuned on either PARSEME or DiMSUM alone perform even better on their corresponding test set, but worse on the other test set, likely due to differences in domain and MWE type between the datasets.
High precision stands out as a strength of our approach, but it suffers from low recall -even the rule-based pipeline with no model filter lags behind other systems in recall.We attribute this mainly to the issue of lexicon dependence described in Section 8; MWEs missing from our lexicon account for a majority of our false negatives as we show in our error analysis (Section 6.2).These findings echo Savary et al. (2019) on the importance of lexicons for MWE identification, and suggest that there is room to improve performance by expanding the lexicon.While it is difficult to pinpoint exactly why we achieve state-of-the-art F1 on DiMSUM and not PARSEME, one significant difference is that more For trainable models we report the mean (± standard deviation for the F1 score) of three runs with random seeds.Because our system uses gold POS tags/lemmas to look up sense glosses, we compare against systems using gold information where available, such as for Liu et al. (2021) and Kirilin et al. (2016).
than 40% of the DiMSUM test set MWEs are noun phrases, most of which we can detect without relying on a lexicon (as described in Section 4.4.2).For PARSEME, we must always rely on our lexicon.

WSD Performance
We compare performance on the English WSD allwords task to Blevins and Zettlemoyer (2020), a similar Bi-encoder system trained only for WSD.
Recent work in WSD has achieved higher scores (Barba et al., 2021b), but our goal is to understand how the addition of the MWE identification task affects WSD performance.5: English WSD all-words task F1.

System
Our system retains most but not all of its WSD performance: F1 is 2% lower when trained on our modified SemCor data and 6% lower when finetuned on PARSEME+DiMSUM.We attribute this drop in F1 from fine-tuning to potentially confusing labels in the fine-tuning data: the gold label of positive examples is always the MWE's first sense, which may be incorrect for polysemous MWEs, and as we show in Section 6.2, many negative example MWEs actually have senses appropriate for the context they are in.Consequently, the model cannot rely entirely on matching sense glosses to input contexts for this data and may forget some knowledge useful for WSD.
Comparing models, the DCA Poly-encoder outperforms the standard Poly-encoder on WSD, but its performance does not significantly differ from the Bi-encoder.We leave Poly-encoder architectures better suited for WSD to future work.We also see that while removing rule-based filters from the DCA pipeline lowers PARSEME F1, it slightly raises DiMSUM F1, suggesting that the necessity of these filters depends on the data.However, removing the rule-based filters only works because the DCA Poly-encoder can accurately exclude false positives: removing the same filters from the purely rule-based pipeline results in a very low F1.Finally, the DCA Poly-encoder substantially outperforms the standard Poly-encoder (PolyEnc) on both datasets and surpasses the Biencoder on PARSEME, demonstrating that our DCA Poly-encoder model can improve MWE identification performance.

Error Analysis
We perform an error analysis on the output of our SemCor trained and fine-tuned models on both test sets, taking 50 false positives and 50 false negatives from each combination of model and dataset (for a total of 400 examples).Select examples can be seen in Table 6, and detailed statistics about the outcome of our analysis can be found in Appendix B.
We find that for >80% 6 of false positives a sense from our lexicon was appropriate for the given context, but the target words were not marked as a MWE in the data.Many of these MWEs were present in our lexicon but nowhere in the test set, suggesting discrepancies between the scope of what WordNet and these datasets respectively define as MWEs.However, there were also a number of false positives that are marked as MWEs in other places in the dataset.This could happen if these combinations of words were only marked as MWEs when they had specific meanings or particularly non-compositional semantics, but this was not the case for the examples we examined.These results speak to the difficult and potentially subjective nature of annotating MWEs, and we hope to 6 Computed excluding false-positives from the DiMSUM noun phrase detector, which does not use the lexicon see work exploring this area in the future.
For false negatives, >85% were cases where the target MWE was missing from the lexicon, confirming that the bottleneck for recall is our system's lexicon.For the majority of the remaining false negatives, an appropriate sense for the given context was present in our lexicon, meaning that these were failures of our MWE identification system and not the lexicon.However, the fact that errors in matching meaning to context account for <20% of false positives and <15% of false negatives shows that our model has successfully learned how to judge whether a group of words constitutes a MWE with a given meaning.See

Conclusion
In this work, we present an approach to MWE identification using rule-based candidate extraction with a model filter, achieving strong results on the PARSEME 1.1 English data and state-of-the-art results for MWE identification on the DiMSUM dataset.Our system performs both MWE identification and WSD with the same model, demonstrating that these tasks can be tackled together.We also introduce a modified Poly-encoder architecture better suited to MWE identification.
Our system's strength is its high precision for MWE identification.We show its low recall to be a function of lexicon size, and in future work we intend to expand the lexicon by mining MWEs and generating glosses for them, which has the potential to substantially increase recall for lexicon-based systems.Improved approaches for multitask training of MWE identification/WSD models could also be valuable; the ideal pipeline would be competitive with state-of-the-art systems in both tasks, and not just MWE identification.
Ideal applications of our system include MWE identification when a lexicon of target MWEs is available, or cases where quickly performing both MWE identification and WSD is valuable, such as in language learning and assisted reading tools.

Limitations
While our system performs well, the output of our MWE pipeline is limited to MWEs that are present in our lexicon or detectable with simple rules.Furthermore, because our model uses gloss text as input, we cannot effectively filter MWE candidates without sense glosses.Consequently, our approach to MWE identification depends on the presence of a high-quality lexicon which includes MWE lemmas and sense glosses, making it ill-suited for scenarios where data like this may not be available yet, such as in low resource languages.However, we are optimistic that work in MWE discovery (Ramisch et al., 2010) and gloss/definition generation (Bevilacqua et al., 2020) will help to mitigate this problem by automating parts of the data creation process.a single GeForce GTX TITAN X GPU, with hyperparameters tuned using Weights & Biases (Biewald, 2020) to run random sweeps and track performance.Separate sweeps were run for the Bi-encider and Poly-encoder, each having a maximum of 20 runs and using early stopping to terminate runs with poor performance.Our total compute time was approximately 160 days, though this would have been significantly lower using a newer model of GPU.Our models have 220M parameters, and fully training for 15 epochs on the modified SemCor data takes approximately 1.2 × 10 17 FLOPS.We used Prodigy (Montani and Honnibal) as our annotation tool.Further detail, including all training hyperparameters and instructions for reproduction, can be found in our published code.

B Error Analysis Details
This appendix contain details about the frequency with which we found various types of false positives or false negatives in our error analysis.

B.1 PARSEME
In the table below, Def? represents the % of false positives where a sense appropriate for the predicted MWE was present in our lexicon.MWE? represents the % of false positives where the MWE was present in other sentences in the dataset, and the % of false negatives where it was present in our lexicon, respectively.

B.2 DiMSUM
Our results on DiMSUM are similar to those of PARSEME, except that for the system using the SemCor model 22% of the false positives were from the rule-based consecutive noun tagger, with that number increasing to 56% for the fine-tuned model (the false positive rate drops substantially after fine-tuning the filtering model as can be seen in Table 4, which leads to these errors accounting for a higher percentage of total false positives).

Table 1 :
Tokens excluded by rule-based filters.True negatives represent correct exclusions (I.E.false positive candidates), and false negatives incorrect exclusions.

Table 2 :
Since some discontiguous MWEs in SemCor are labeled only on a subset of the included words, we add stranded constituents to their parent MWE by attaching nearby words whose lemmas match constituents missing from the labeled MWE 3 .Finally, because SemCor contains no labeled negative examples of MWEs -instances where the constituent words of a MWE are all present but their meaning in context does not match any of the MWE senses -we add these ourselves.We generate synthetic negative examples using the rulebased pipeline with its filters inverted to mark combinations of words whose lemmas correspond to a known MWE but are out of order or very gappy as negative examples whose gold label is the "not a MWE" sense.We randomly add negative examples in this fashion until they account for just over 50% of the MWE examples in the training data.To mitigate the risk of the model learning only the heuristics used to generate these synthetic negatives, we also manually annotate a small number of examples.We do this by running the rule-based pipeline (Section 3.4) on the SemCor data and annotating output MWEs with their appropriate sense from WordNet or the "not a MWE", sense based on context.Because we exclude words already marked as MWEs and many MWEs in SemCor have already been annotated, >50% of the newly annotated examples are negative.SemCor after each addition of data s , and |S w | = j possible senses in the lexicon, this formalizes to:L(w, g s ) = −ϕ(w, g s ) + log x∈X exp(ϕ(w, x)) X = {s 0 , ...s j , n} if MWE {s 0 , ...s j } otherwise

Table 4 :
Test set results on PARSEME 1.1 English and DiMSUM for MWE identification.All DCA poly-encoder models function as a final filter after the rule-based pipeline.Training data is listed in parenthesis: S=SemCor, P =PARSEME, D=DiMSUM.

Table 6 :
...were propped up on a foot-warmer, ... prop up never marked as MWE in dataset PARSEME FN Never mind, Mrs. Bray will join you later.never mind missing from lexicon PARSEME FP ...his mind drifted off to the accounts... drift off sense "fall asleep" does not apply PARSEME TN textit...as we sat side by side... sit by sense "be inactive" does not apply DiMSUM FP Aww, thank you. thank you marked as MWE in 4 other sentences DiMSUM FN All our dreams can come true,... come true missing from lexicon DiMSUM FN ...this was a breathe of fresh air.Present in lexicon; model filter false negative DiMSUM TN ...impact my wardrobe has on the environment.have on sense "dress in" does not apply Representative errors (FP/FN) and incorrect MWEs successfully excluded by the model filter (TN) Table 6 true negatives for examples of MWEs excluded based on meaning.

Table 8 :
PARSEME Error Analysis The Def? and MWE? percentages for false positives in the below table are computed excluding consecutive noun tagger false positives.

Table 9 :
DiMSUM Error Analysis Table 10 below contains full scores for systems omitted from the main paper for brevity.