Taxonomy of Problems in Lexical Semantics

Semantic tasks are rarely formally defined, and the exact relationship between them is an open question. We introduce a taxonomy that elucidates the connection between several problems in lexical semantics, including monolingual and cross-lingual variants. Our theoretical framework is based on the hypothesis of the equivalence of concept and meaning distinctions. Using algorithmic problem reductions, we demonstrate that all problems in the taxonomy can be reduced to word sense disam-biguation (WSD), and that WSD itself can be reduced to some problems, making them theoretically equivalent. In addition, we carry out experiments that strongly support the soundness of the concept-meaning hypothesis, and the correctness of our reductions.


Introduction
This paper proposes a taxonomy of several problems in lexical semantics, consisting of a clear definition of each task, and a theory-driven analysis establishing the relationships between them (Figure 1).The taxonomy includes word sense disambiguation (WSD), word-in-context (WiC), lexical substitution (LexSub), and word synonymy (Syn).We consider their monolingual, cross-lingual, and multilingual variants.With the exception of WSD, they are all defined as binary decision problems.
Our theoretical problem formulations correspond to well-studied semantic tasks.In practice, these tasks are rarely precisely defined, and instead depend on annotated datasets.For example, the definitions of lexical substitution differ between papers, and involve imprecise terms, such as "the overall meaning of the context" or "suitable substitute."The exact relationships between these tasks have not been rigorously demonstrated.Altogether, the recent literature suggests that a more detailed taxonomy is very much needed.
We start by formally defining the problems in terms of concepts and contexts, and proceed to de-termine their relative hardness by specifying reduction algorithms which produce a solution for one problem by applying an algorithm for another.In particular, we demonstrate that all problems in the taxonomy can be reduced to WSD, which confirms the principal role of this problem in lexical semantics.Furthermore, we show by mutual reductions that WSD and multilingual variants of WiC and LexSub are theoretically equivalent.Finally, we shed light on how they relate to lexical translation and wordnets.
The soundness of the problems in the taxonomy hinges on the consistency of judgments of sameness of word meaning.Hauer and Kondrak (2022) demonstrate the theoretical equivalence of the monolingual WiC and WSD via mutual reduction.We posit the following generalization of their sense-meaning hypothesis to multilingual concepts: different word instances have the same meaning if and only if they express the same concept.This empirically falsifiable proposition, which we refer to as the concept-meaning hypothesis, allows us to incorporate multilingual tasks, including lexical synonymy and substitution, into our theoretical framework.
In addition to showing that our theoretical propositions follow directly from our definitions and assumptions, we perform a series of experiments for the purpose of testing their empirical applicability and soundness.In particular, we test three problem reductions on standard benchmark datasets using independently developed systems based on pre-trained language models.Manual error analysis reveals no counter-examples to our conceptmeaning hypothesis.
Our main contribution is a novel taxonomy of formally-defined problems, which establishes the reducibility or equivalence relations between the principal tasks in lexical semantics.In addition, we carry out a series of experiments that support the correctness of our theoretical findings.

Theoretical Formalization
In this section, we formally define the problems in our proposed taxonomy, and discuss the relationship between these theoretical problems and the computational tasks addressed in prior work.

Words
All semantic problems in Figure 1 take at least one word as a parameter.In our definitions, a word is not necessarily an orthographic word, but rather a triple consisting of a lemma, a part of speech, and a language.The problems are divided into three categories based solely on the language of the words (rather than contexts): monolingual (same language), cross-lingual (different languages), and multilingual (same or different languages).Thus, a multilingual problem can be seen as the union of the corresponding monolingual and cross-lingual problems.While this categorization theoretically admits "monolingual" problem instances consisting of a word in one language and a context in a different language, such instances are rare in practice.

Contexts and Concepts
Alternatively, we can categorize semantic problems according to the number of contexts which must be considered in each instance: zero, one, or two, respectively, in the leftmost three columns of Figure 1.Contexts are denoted by the variable names starting with C. We broadly define a context as a discourse (not necessarily a sentence) with a focus, which is a word or sequence of words that express a specific concept.Contexts that consist of the same discourse but differ in focus are considered distinct.
The expression "a word expresses a concept given a context" signifies that the word can be used to refer to the concept that corresponds to the focus of that context.Note that the word itself is not required to occur in the context, or even match the language of the context.For example, consider the context "bats live in caves" which disambiguates the word bat to its animal sense.The underlined word represents the focus of the context, which can be expressed by the words bat or its synonym chiropteran.The languages of the word and the context need not be the same.For example, the Spanish context "un murciélago entro en mi casa" disambiguates the English word bat as an animal rather than an instrument.
A lexical concept, or simply concept, refers to a discrete word meaning.A concept gloss, such as "flying nocturnal rodent," is a special type of a context, in which the entire definition is the focus, and which uniquely determines the concept.We assume that the concept gloss C s which defines the meaning of the concept s can be expressed in any language.
We assume the availability of complete sets of words (i.e., lexicons) and lexical concepts.The methods for creating such resources are beyond the scope of this paper.

Monolexical Problems
We first define three problems that take a single word argument.We refer to these theoretical problems by the same acronyms as their corresponding computational tasks: WSD, TSV, and WiC.
Word sense disambiguation (WSD) is the task of classifying a word in context according to its sense, given an inventory of possible senses for each word.For each word, there is a one-to-one mapping between its senses and the concepts that it can express.We can therefore define the WSD problem more generally, to return a concept rather than a sense.This avoids the need for a predefined sense inventory for each word.WSD(C, w) := "the concept which is expressed by the word w given the context C" Note that this formulation does not require the word to occur in the context.By convention, the return value of the WSD predicate is undefined if the word is not meaningful given the context; for example, the English word metre does not express any concept given the Italian context "la metro di Roma è efficiente."In contrast, any binary predicate is assumed to return FALSE in such cases.
Target sense verification (Breit et al., 2021, TSV) is the binary classification task of deciding whether a given word in a given context expresses a given sense.As with WSD, we define the TSV problem on concepts rather than senses.We assume that the concept s is represented by its gloss C s .
TSV(C, w, s) := "the word w expresses the concept s given the context C" The TSV problem can be viewed as a binary analogue of the WSD problem, such that the following equivalence holds: The word-in-context task (WiC) is a binary classification task proposed by Pilehvar and Camacho-Collados (2019): given a pair of sentences, decide whether or not a word has the same meaning in both sentences.We define the corresponding WiC problem using concepts, on the basis of the conceptmeaning hypothesis.
WiC(C x , C y , w) := "the word w expresses the same concept given the contexts C x and C y " Hauer and Kondrak (2022) demonstrate the equivalence of WiC, TSV, and WSD by pairwise reductions, which are denoted by purple arrows in Figure 1.In particular, the following formula specifies the reduction of WiC to WSD: WiC(Cx, Cy, w) ⇔ WSD(Cx, w) = WSD(Cy, w)

Word-in-Context Problems
We now introduce a set of binary predicates which include WiC and its variants.We start with the most general problem of the set, MultiWiC, and then define MonoWiC, and CL-WiC as its special cases, in which the two words w x and w y are constrained to be either in the same or different languages, respectively.
MultiWiC(C x , C y , w x , w y ) := "the words w x and w y express the same concept given the contexts C x and C y , respectively" The WiC problem defined in Section 2.3 is a special case of MonoWiC, in which w x = w y .
MonoWiC(C x , C y , w x , w y ):= "the words w x and w y from the same language express the same concept given the contexts C x and C y , respectively" Martelli et al. (2021) extend the WiC task to include cross-lingual instances, which consist of a pair of contexts in different languages, in which the two focus words have the same meaning.1Our definition of the corresponding theoretical problem is similar: CL-WiC(C x , C y , w x , w y ): "the words w x and w y from different languages express the same concept given the contexts C x and C y , respectively" Clearly, any instance of MultiWiC is either an instance of MonoWiC or CL-WiC.

Lexical Substitution Problems
The next set of problems each involve a pair of words in a single context.These problems formalize the semantic task of lexical substitution (McCarthy and Navigli, 2007), and its different variants and settings, such as cross-lingual substitution (Mihalcea et al., 2010).Our definitions are more precise than conventional ones, as we define substitutes on the basis of identity of expressed concepts.By virtue of our concept-meaning hypothesis, the definitions formalize the notions of "meaning-preserving substitutions" and "correct translations" present in previous work.However, they are restricted to lexical substitutions, excluding compositional compounds and phrases.
MonoLexSub(C, w x , w y ) := "the words w x and w y from the same language express the same concept given the context C" In other words, w x and w y are mutually substitutable given the context C. For example, MonoLexSub returns TRUE given C = "the gist of the prosecutor's argument", w x = core, and w y = heart.
The CL-LexSub problem is a cross-lingual counterpart of MonoLexSub.The definition of CL-LexSub is the same as that of MonoLexSub, except that the two words are required to be in different languages.For example, MonoLexSub("she batted the ball", bat, murciélago) returns FALSE.
CL-LexSub(C, w x , w y ) := "the words w x and w y from different languages express the same concept given the context C" Finally, we define a multilingual lexical substitution problem which generalizes MonoLexSub and CL-LexSub by removing their respective language constraints: MultiLexSub(C, w x , w y ) := "the words w x and w y from any language(s) express the same concept given the context C" While the goal of many conventional lexical substitution datasets is to produce sets of substitutes, these generative problems are reducible to the corresponding binary classification problems by iterating over the set of substitution candidates.More formally, the problem of generating lexical substitutes reduces to MultiLexSub by returning the set: {w | MultiLexSub(C, w x , w)}.

Word Synonymy Problems
Our final set of semantic problems are defined on a pair of word lemmas, without any context parameters.
The MonoSyn predicate formalizes the relation of word synonymy in the monolingual setting.Given two words in the same language, it returns TRUE iff they are mutually substitutable in some context.
MonoSyn(w x , w y ) := "the words w x and w y from the same language express the same concept in some context" For example, MonoSyn(core, heart) is TRUE because there exist a contexts in which the two words express the same concept (c.f., Section 2.5).The MonoSyn problem formalizes the linguistic Substitution Test for synonymy: w x and w y are synonyms if the meaning of a sentence that contains w x does not change when w y is substituted for w x (Murphy and Koskela, 2010).
We define the cross-lingual synonymy problem CL-Syn in a similar manner.The only difference with MonoSyn is that the two words are required to be from different languages.
CL-Syn(w x , w y ) := "the words w x and w y from different languages express the same concept in some context" The CL-Syn predicate corresponds to the relation of translational equivalence between words.Two words in different languages are translationally equivalent if there exists a context in which they are literal translations.For example, CL-Syn(heart/EN, coeur/FR) is TRUE because the two words are mutual translations given the context "the heart of the matter." As with the other problem families, we unify MonoSyn and CL-Syn into a single predicate Mul-tiSyn, which places no constraints on the language of the given words.
MultiSyn(w x , w y ) := "the words w x and w y from any language(s) express the same concept in some context" MultiSyn is not only a generalization but also the union of the relations of synonymy and translational equivalence, which are represented by MonoLexSub and CL-LexSub, as postulated by Hauer and Kondrak (2020).

Problem Reductions
Given an algorithm for a problem Q, a P-to-Q reduction solves an instance of a problem P by combining the solutions of one or more instances of Q.The reducibility of P to Q is denoted P ≤ Q.Mutual reductions of two problems to one another, i.e.P ≤ Q and Q ≤ P, demonstrate their equivalence.
In this section, we present several problem reductions, which constitute the main contribution of this paper.The reductions are shown in Figure 1 by the directed arrows from P to Q.The black arrows denote the special cases, which immediately reduce to the more general problems.Taken together, the reductions establish the equivalence of six problems: WSD, TSV, WiC, MonoWiC, MultiWiC, and MultiLexSub.A method which solves any of these problems can be used to construct methods which solve the other problems by applying a sequence of reductions.As well, a method for one of those six problems can be used to solve any of the other problems in Figure 1, again via reductions.

*Syn ≤ *LexSub ≤ *WiC
We first present a set of six reductions, which are denoted by blue arrows in Figure 1.Each of the corresponding nine problems involves comparing the meanings of a pair of words, given some contexts.
The three lexical substitution problems defined in Section 2.5 can be viewed as special cases of the corresponding word-in-context problems, in which both contexts are identical.Succinctly: *LexSub(C, wx, wy) ⇔ *WiC(C, C, wx, wy) The asterisk in these and the following reductions can be replaced on both sides by "Mono", "CL-", or "Multi".To reiterate, a cross-lingual problem explicitly assumes that the input words are in different languages, while a multilingual problem can accept inputs in the same or different languages.
The three word synonymy problems defined in Section 2.6 are reducible to the corresponding lexical substitution problems.In particular, to reduce MultiSyn to MultiLexSub, we search for a concept gloss C s in which both words express the same concept.Succinctly: *Syn(wx, wy) ⇔ ∃s : *LexSub(Cs, wx, wy) The correctness of these six reductions follows from the fact that the (infinite) set of all contexts is partitioned into equivalence classes, each of which corresponds to a single concept.

Reductions to WSD
The reductions in the preceding section demonstrates that all theoretical problems defined in Section 2 can be reduced to MultiWiC.We next demonstrate that all those problems, including MultiWiC itself, can also be reduced to WSD.Thus, an algorithm that solves WSD would be sufficient to solve all other problems.For clarity, the nine reductions in this section are not shown explicitly in Figure 1, with the exception of the crucial MultiWiCto-WSD reduction, denoted by a red arrow.
Given a method for solving WSD, we can solve any *WiC instance by checking whether the concepts expressed by the two words in the corresponding contexts are the same.This set of reductions generalize the WiC-to-WSD reduction (Section 2. Finally, the word synonymy problems can be solved by searching for a concept which can be expressed by both words. *Syn(wx, wy) ⇔ ∃s : WSD(Cs, wx) = WSD(Cs, wy) The correctness of the reductions in this section follows directly from the concept-meaning hypothesis which underlies our theory.

MultiWiC ≤ MultiLexSub
We close this section by demonstrating that Multi-WiC is reducible to MultiLexSub, which is denoted by a red arrow in Figure 1.This reduction, along with the reverse reduction presented in Section 3.1, establishes the equivalence between the two problems.Formally: The first two terms on the right-hand side of the reduction formula test whether the two words are mutually substitutable in their respective contexts.The universal quantifier ensures that every substitute in one of the contexts is also an appropriate substitute in the other context, and vice versa.
The correctness of this reduction hinges on the assumption that there are no universal colexifications (Bao et al., 2021), which states that for any pair of concepts, there exists some language which lexifies but does not colexify them.In other words, there exists a language in which no word can express both concepts.Therefore, if the sets of contextual synonyms of w x in C x and w y in C y are identical, the concept expressed by the two word tokens must be the same.
In theory, the universal quantifier in the reduction formula is defined over all words in all languages.In practice, only the synonyms and translations of the two words need to be checked, and a smaller set of diverse languages may be sufficient to obtain good accuracy.

Relationship to Synsets
A wordnet is a theoretical construct which is composed of synonym sets, or synsets, such that each synset corresponds to a unique concept, and each sense of a given word corresponds to a different synset.Actual wordnets, such as Princeton Word-Net (Miller, 1995), are considered to be imperfect implementations of the theoretical construct.
We define the following monolexical problem, which decides whether a given word can express a given concept: Sense(w, s) := "the word w expresses the concept s in some context" An algorithm for the Sense problem could be used to decide whether a given word belongs to the synset that corresponds to a given concept.

*Syn ≤ Sense ≤ WSD
The word synonymy problems defined in Section 2.6 are reducible to the Sense problem.Two words are synonyms if they both express the same concept in some context.In particular, to reduce MultiSyn to Sense, we search for a concept which can be expressed by both words.
MultiSyn(wx, wy) ⇔ ∃s : Sense(wx, s) ∧ Sense(wy, s) A monolingual wordnet can be converted into a thesaurus, in which the entry for a given word consists of all of its synonyms.A bilingual wordnet can be converted into a translation dictionary, in which the entry for a given word consists of all its cross-lingual synonyms possibly grouped by sense, and accompanied by glosses.
Given a method for solving WSD, we can solve a Sense instance by checking whether the word expresses the concept given the context of its gloss.Formally: The correctness of this reduction follows from the assumption that a concept gloss uniquely determines the concept.Under our definitions, given a concept gloss, the WSD predicate can only return the corresponding concept, and does so if and only if the given word can express that concept; otherwise the return value is undefined.
The reducibility of Sense to WSD implies that implementing the WSD predicate as it is defined in Section 2.3 would make it possible to construct synsets from nothing more than a list of concept glosses, as well as correct and expand existing wordnets to new domains and languages.In fact, any of the set of six WSD-equivalent problems (Figure 1) could be used for these tasks; we therefore refer to them as wordnet-complete (WN-complete).

Substitution Lemma
The final proposition formalizes the relationship between synsets, senses, and lexical translations.It follows directly from the previously stated definitions, reductions, and assumptions.
MultiLexSub(Cx, wx, wy) ⇔ Sense(wy, WSD(Cx, wx)) The lemma provides a theoretical justification for methods that associate contextual lexical translations and synonyms with the synset identified by a WSD model.For example, BabelNet synsets are populated by translations of word instances that correspond to a given concept (Navigli and Ponzetto, 2010).Specifically, the existence of a translation pair (w x , w y ) in a context C x implies that w y lexicalizes the concept expressed by w x in C x .Another example is the method of Luan et al. (2020), which leverages contextual translations to improve the accuracy of WSD.

Empirical Validation
In this section, we implement and test three principal reductions: MultiWiC to WiC, MultiWiC to WSD, and MultiLexSub to WSD.For each reduction, we reiterate its theoretical basis, describe our implementation, and discuss the results.We emphasize that the goal of our experiments is not challenging the state of the art, but rather empirically testing the reductions, and, by extension, the hypothesis they are based on.Since the resources used for the implementations are necessarily imperfect, and the systems are each designed and optimized for a different target task, the reductions are expected to produce much less accurate predictions on the existing benchmark datasets compared to state-of-the-art methods.
Our primary interest is in identifying any possible counter-examples to our concept-meaning hypothesis.However, it must be noted that the presence of a small number of such exceptions in the existing datasets does not invalidate the theory.On the other hand, the scarcity of counter-examples should not be interpreted as a proof, but rather as supporting evidence for the correctness of our theoretical claims.

Solving MultiWiC with WiC
We first empirically test the counter-intuitive proposition that a multilingual semantic task can be reduced to a set of monolingual instances.In particular, given a method for solving WiC, we can solve any MultiWiC instance by deciding whether there exists a concept such that both given words express the concept given their corresponding contexts and the concept gloss.Formally: MultiWiC(Cx, Cy, wx, wy) ⇔ ∃s : WiC(Cx, Cs, wx) ∧ WiC(Cy, Cs, wy) The correctness of this reduction follows from the assumption that a concept gloss uniquely disambiguates every word that can express the concept.

Implementation of the Reduction
In practice, instead of checking all possible concepts, we limit our search to concepts that can be expressed by either of the two words.For each such concept, we create two WiC instances, one in each language, using a gloss retrieved from a lexical resource, and translated, as needed, into the language of each instance.We then solve each of the created WiC instances using a model trained exclusively on WiC data in that language.The reduction returns TRUE iff both WiC instances are classified as positive.
We test the reduction on the English-French test set of the MCL-WiC shared task (Martelli et al., 2021), which contains 1000 MultiWiC instances.The dataset is agnostic toward WordNet sense distinctions and annotations.We train the English WiC model on the English training and development sets (8k and 1k instances, respectively), and the French WiC model on the French development set (1k instances).The latter set is quite small, but we are not aware of any larger dedicated French WiC training data.
We create each WiC instance by prepending the input word, followed by a separator token, to each input context, including concept glosses.We retrieve concept glosses from BabelNet (Navigli and Ponzetto, 2010), using the Python API.2While English lemmas are provided in the dataset, French lemmas are not.We therefore lemmatize French words using the SpaCy FR CORE NEWS MD model.Since BabelNet does not contain French glosses for all concepts, we generate them by translating the first English gloss in BabelNet using the OPUS-MT-EN-FR model from Helsinki NLP. 3e train our English and French WiC models using LIORI (Davletov et al., 2021).All training was completed in under eight hours on two NVIDIA GeForce RTX 3090 GPUs.With the default hyper-parameter settings, the models obtain the accuracy of 87.0% and 73.3% on the English and French monolingual test sets, respectively.This is lower than the 91.1% and 86.4% results reported by Davletov et al. (2021).We attribute this to our use of smaller, purely monolingual training data, which is in line with our theoretical reduction.Based on these numbers, we estimate the probability of a pair of WiC instances being both correctly classified as 0.870 * 0.733 = 0.638.

Results and Discussion
Our implementation correctly classifies 631 out of the 1000 instances in the test set.This is very close to the estimate computed in the previous section, which suggests that our reduction is approximately as reliable as our imperfect resources and systems allow.
We manually analyzed a random sample of 50 MultiWiC classification errors.For each of the 25 false negatives, LIORI returned FALSE for all sentence pairs in either English (12 instances), French (8 instances), or both languages (5 instances).Each instance could be explained by either a LIORI error, or a missing sense in BabelNet.For the 25 false positives, we identified one or more incorrect positive WiC classifications.The final false positive was caused by an incorrect tokenization of the target word in the MCL-WiC dataset: disordered instead of mentally disordered.
Since all errors can be attributed to the systems and resources, they constitute no evidence against the correctness of our reduction.On the other hand, these results support our theoretical finding that multilingual problems can be reduced to monolingual problems.This in turn supports our methodology of grounding lexical semantics in the expression of language-independent concepts.

Solving MultiWiC with WSD
In this section, we test our MultiWiC-to-WSD reduction.In doing so, we generalize the WiC-to-WSD reduction of Hauer and Kondrak (2022) to multiple words and languages.Given a MultiWiC instance, we apply a WSD system to each contextword pair, and classify it as positive iff both words are tagged with the same BabelNet synset:

Implementation of the Reduction
Our system of choice is AMuSE-WSD (Orlando et al., 2021).It provides pre-trained WSD models for a diverse set of languages, and handles all stages of the WSD pipeline, including tokenization, lemmatization, and part-of-speech tagging.We apply the provided AMUSE-LARGE-MULTILINGUAL-CPU model, with all other parameters left at their default values.
Following Hauer and Kondrak (2022), we estimate an upper-bound on the performance of our reduction, using analogous notation and formulae.For the expected accuracy of English and non-English WSD, we use the English-ALL and XL-WSD accuracy results reported by the AMuSE-WSD authors, 0.739 and 0.673.This estimation method also depends on the average number of senses per target word.Per the BabelNet API4 , an average MCL-WiC target word has 14 senses.The resulting overall accuracy estimate is 0.752, which is the average of 0.539 and 0.965 for the positive and negative MultiWiC instances, respectively.

Results and Discussion
The results on the MCL-WiC test sets range from 51.8% on English-Arabic to 55.1% on English-French.While the estimate in the previous section is substantially higher, it does not take into account tokenization errors and missing senses in BabelNet.On the English-French dataset, we found that false negatives outnumber false positives by a factor of six; the accuracy is 22.8% and 87.4% on the positive and negative MultiWiC instances, respectively.
For our manual analysis, we randomly selected 25 false positives and 25 false negatives produced by our implementation on the English-French test set.In 41 of the 50 cases, we determined the cause of the incorrect MultiWiC classification to be an incorrect sense returned by AMuSE-WSD for one or both target words.In addition, 7 of the 50 cases represent tokenization errors.One MultiWiC instance, which involves English reflected and French consignée, is most likely a MCL-WiC annotation error.The final error is attributable to a sense missing from BabelNet, which prevents AMuSE-WSD from considering it as a candidate.Specifically, it is the "administer" sense of the verb dispense (as in "dispense justice"), which can be found in the Merriam-Webster Online Dictionary. 5ince manual analysis yields no counterexamples to our theory, we interpret the results as empirical support for this reduction, and, by extension, our taxonomy of semantic tasks, and the hypothesis on which it is based.

Solving MultiLexSub with WSD
In the final experiment, we test the MultiLexSubto-WSD reduction derived in Section 3.2: The overall method is similar to that of Guo and Diab (2010), but using our precise binary formulation of lexical substitution.

Implementation of the Reduction
We use the dataset from the SemEval 2010 shared task on cross-lingual lexical substitution (Mihalcea et al., 2010), which consists of a trial set of 300 instances, and a test set of 1000 instances.Each instance consists of an English sentence which includes a single target word and a list of Spanish gold substitutes provided by annotators.
Since our formulation of lexical substitution is binary rather than generative or ranking-based, we convert each of the SemEval instances into a pair of binary instances: one positive and one negative.For the positive instance, we take the first Spanish substitute, the one that was most frequently suggested by the annotators.For the negative instance, we randomly select a Spanish word from the set of all substitutes in the dataset for that English target word, provided that it is not among the gold substitutes for that specific instance.If there is no such substitute, we instead choose a random Spanish word from the dataset.
For each binary instance created in this way, we create two WSD instances using a simple template: 'w' as in 'C', where w is the target word, and C is the context.We obtain the context for the Spanish word by translating the English context via Helsinki NLP's OPUS-MT-EN-ES model.We return a positive MultiLexSub classification iff AMuSE-WSD assigns the same BabelNet synset ID to both English and Spanish target words.
Our procedure for estimating the expected accuracy of our reduction is the same as in Section 5.2.1.The only difference is the average number of senses per word, which in this case is 23, yielding an estimated accuracy of 75.8%.

Results and Discussion
The binary classification accuracy of our implementation on 2000 MultiLexSub instances created from the SemEval test set is 63.2%, which is substantially below the estimate in the previous section.This can be partially explained by a relatively high number of tokenization errors in the test set.We again observe a strong bias toward negative classification: the results on the positive and negative instances are 27.1% and 99.3% accuracy, respectively.Because of this, we selected only positive instances for our error analysis.
We manually analyzed a sample of 50 randomlyselected false negatives from the test set.In 44 of the 50 cases, the cause of the misclassification was an AMuSE-WSD error (on English in 30 cases, on Spanish in 14).Some of those errors may be caused by an imperfect translation of the English context, or a missing BabelNet sense of the Spanish gold substitute.In 5 cases, the English input was incorrectly tokenized; for example, the compound noun key ring was split into two word tokens, with one instance having ring as its focus.The final case likely involves an annotation error in the SemEval dataset: campo as a translation of field given the context of "effective law enforcement in the field." We conclude that all incorrect classifications can be attributed to a resource or system used by our implementation, and thus none of them represents a counter-example to our hypothesis.

Conclusion
Starting from basic assumptions about the expression of concepts by words in context, we have developed consistent formulations of thirteen different problems in lexical semantics.We have shown that a "wordnet-complete" subset of these tasks can each be used to solve any of the others via reduction.These problems can be used to construct, correct, or expand multilingual synonym sets, the building blocks of important linguistic resources such as WordNet and BabelNet.We believe that this work will lead to a greater understanding of lexical semantics and its underlying linguistic phenomena, as well as new applications and better interpretation of empirical results.Based on our theory, we intend to develop methods for constructing fully explainable and interpretable linguistic resources.

Limitations
While we do include multilingual datasets in our experiments, our error analysis is limited to languages of the Indo-European family, specifically English, French, and Spanish, as these are the languages covered by our datasets which we can confidently analyze.In addition, it is possible to question some of the assumptions made in our theory, which should be kept in mind when considering our work.For example, we assume that, for each content word token in a discourse, there exists a single concept which that word is intended by the sender to express, regardless of whether it appears unambiguous to the receiver.However, unlike in mathematics, theoretical assumptions may not always hold in practice; for example, puns often exploit multiple meanings of a word for humorous effect.While such cases are not frequently considered in lexical semantics, we can expect exceptions to almost any assumption or conclusion regarding human languages.

Figure 1 :
Figure 1: Taxonomy of problems in lexical semantics.Arrows indicate reducibility.The six wordnet-complete problems within the dotted area are equivalent, and all other problems in the taxonomy are reducible to them.