A Mixture-of-Experts Model for Antonym-Synonym Discrimination

Discrimination between antonyms and synonyms is an important and challenging NLP task. Antonyms and synonyms often share the same or similar contexts and thus are hard to make a distinction. This paper proposes two underlying hypotheses and employs the mixture-of-experts framework as a solution. It works on the basis of a divide-and-conquer strategy, where a number of localized experts focus on their own domains (or subspaces) to learn their specialties, and a gating mechanism determines the space partitioning and the expert mixture. Experimental results have shown that our method achieves the state-of-the-art performance on the task.


Introduction
Antonymy-synonymy discrimination (ASD) is a crucial problem in lexical semantics and plays a vital role in many NLP applications such as sentiment analysis, textual entailment and machine translation. Synonymy refers to semanticallysimilar words (having similar meanings), while antonymy indicates the oppositeness or contrastiveness of words (having opposite meanings). Although telling apart antonyms and synonyms looks simple on the surface, it actually poses a hard problem because of their interchangeable substitution.
A few research efforts have been devoted to computational solutions of ASD task, which comprises two mainstreams: pattern-based and distributional approaches. The underlying idea of pattern-based methods exists in that antonymous word pairs cooccur with each other in some antonymy-indicating lexico-syntactic patterns within a sentence (Roth and im Walde, 2014;Nguyen et al., 2017). In spite of their high precision, pattern-based methods suffer from limited recall owing to the sparsity of lexico-syntactic patterns and the lexical variations.
Distributional methods work on the basis of distributional hypothesis stating that "the words similar in meaning tend to occur in similar contexts" (Harris, 1954). Traditional distributional methods are based on discrete context vectors. Scheible et al. (2013) verified that using only the contexts of certain classes can help discriminate antonyms and synonyms. Santus et al. (2014) thought that synonyms are expected to have broader and more salient intersection of their top-K salient contexts than antonyms, and proposed an Average-Precision-based unsupervised measure.
With the advent of word embeddings as the continuous representations (Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013;Pennington et al., 2014), several neural methods have been proposed to elicit ASD-specific information from pretrained word embeddings in a supervised manner. Etcheverry and Wonsever (2019) used a siamese network to ensure the symmetric, reflexive and transitive properties of synonymy and a parasiamese network to model the antitransitivity of antonymy. Ali et al. (2019) projected word embeddings into the synonym and antonym subspaces respectively, and then trained a classifier on the features from these distilled subspaces, where the trans-transitivity of antonymy was taken into consideration. This paper follows the distributional approach and studies the ASD problem on the basis of pretrained word embeddings. Two hypotheses underlie our method: (a) antonymous words tend to be similar on most semantic dimensions but be different on only a few salient dimensions; (b) the salient dimensions may vary significantly for different antonymies throughout the whole distributional semantic space. With respect to the hypothesis (b), we find that a tailored model of mixture-of-experts (MoE) (Jacobs et al., 1991) fits it well. The semantic space is divided into a number of subspaces, and each subspace has one specialized expert to elicit the salient dimensions and learn a discriminator for this subspace. As to the hypothesis (a), a similar opinion was also expressed by Cruse (1986) that antonymous words tend to have many common properties, but differ saliently along one dimension of meaning. In addition, our experimental results have shown that each expert requires only four salient dimensions to achieve the best performance.
Finally, we would like to point out the main difference of our method from the existing ones. Firstly, our MoE-ASD model adopts a divide-andconquer strategy, where each subspace is in the charge of one relatively-simple localized expert that focuses on only a few salient dimensions; while existing methods rely on a global model which must grasp all the salient dimensions across all the subspaces. Secondly, our method simply enforces the symmetric property of synonymy and antonymy, but ignores the other algebraic properties such as the transitivity of synonymy and transtransitivity of antonymy, because these algebraic properties do not always hold on the word level for the polysemy characteristic of words.

Method
This paper proposes a novel ASD method based on the mixture-of-experts framework (called MoE-ASD) 1 . Its architecture is illustrated in Figure 1. It solves the problem in a divide-and-conquer manner by dividing the problem space into a number of subspaces and each subspace is in the charge 1 Our code and data are released at https://github. com/Zengnan1997/MoE-ASD Figure 2: A localized expert of a specialized expert. The expert focuses on the salient dimensions of the subspace and makes the decision for word pairs. A gating module is trained jointly with these experts. The details are as follows.

Localized Experts
All the experts are homogeneous, and they have the same network architecture but with different parameter values. Given a word pair (w 1 , w 2 ) as input, each expert E i computes its unnormalized probability a i (w 1 , w 2 ) of being antonymy. As stated in Section 1, our method adopts the hypothesis that antonymous words tend to be similar on most semantic dimensions but be different on a few salient dimensions. Each expert has to first elicit the salient dimensions, and then makes a decision based on a feature vector constructed from them. Figure 2 illustrates how an expert works.
Let w 1 and w 2 denote the pre-trained word embeddings of words w 1 and w 2 respectively, whose dimensionality is d e . Each expert E i distills d u salient dimensions from them by projecting them from R de into R du : where M i is a matrix of size d e × d u and b i is a vector of length d u . Next, a relational feature vector r is constructed by concatenating the sum Here, f w 1 ,w 2 is the Negation-Prefix feature that denotes whether w 1 and w 2 differ only by one of the known negation prefixes: {de, a, un, non, in, ir, anti, il, dis, counter, im, an, sub, ab}, following Ali et al. (2019) and Rajana et al. (2017). It is evident that the feature vector is symmetric with respect to the input word pair. This is, the word pairs (w 1 , w 2 ) and (w 2 , w 1 ) lead to the same feature vector. It is worth noting that the absolute difference is used instead of the difference, in order to preserve the symmetric properties of both synonymy and antonymy. We note that Roller et al. (2014) used the difference between two word vectors as useful features for detecting hypernymy which is asymmetric.
The relational feature vector r goes through an MLP to get the antonymy-score a i (w 1 , w 2 ): where the hidden layer has d h units, o is the bias.

Gating Mechanism for Expert Mixture
Assume there are M localized experts in the MoE-ASD model. For an input word pair , the problem is how to derive the final score for antonymy detection.
In our MoE-ASD model, the final score is a weighted average of the M scores from the localized experts: where g is located in the M -dimensional simplex, and denotes the proportional contributions of the experts to the final score. A gating mechanism is used to calculate g for each specific word pair (w 1 , w 2 ), fulfilling a dynamic mixture of experts: where M g ∈ R de×M is the parameter matrix of the gating module. The i-th column of M g can be thought of as the representative vector of the i-the expert, and the dot product between the sum of two word embeddings and the representative vector is the attention weight of the expert E i . Softmax is then applied on the attention weights to get g. It is evident that the gating module is also symmetric with respect to the input word pair. The symmetric properties of both the gating module and the local expert module endow our model with symmetry that make it distinct from the other state-of-thearts such as Parasiam (

Model Prediction and Loss Function
Given word pair (w 1 , w 2 ), the probability of being antonymy is obtained by simply applying sigmoid function to the final score: Let A denote the training set of N word pairs, , t (n) denote the gold label of the n-th word pair, and p (n) the predicted probability of being antonymy. Our model uses the cross-entropy loss function: 3 Evaluation Dataset. We evaluate our method on the dataset (Nguyen et al., 2017) that was previously created from WordNet (Miller, 1995) and Wordnik 2 . The word pairs of antonyms and synonyms were grouped according to the word class (Adjective, Noun and Verb). The ratio of antonyms to synonyms in each group is 1:1. The statistics of the dataset are shown in Table 1. In order to make a fair comparison with previous algorithms, the dataset is splitted into training, validation and testing data the same as previous works.

Methods for Comparison:
We make a comparison against the following ASD methods: (1) Concat -a baseline method that concatenates two word vectors and feeds it into an MLP with two hidden layers (with 400 and 200 hidden units respectively) and ReLU activation functions.    Table 3: Performance evaluation with the dLCE embeddings task-specific information and then trains a classifier based on distilled sub-spaces.

Experimental Settings
We use the 300-dimension FastText word embeddings (Bojanowski et al., 2017) (Pennington et al., 2014). It is observed that our model consistently outperforms the state-of-the-arts on all the three subtasks, which manifests the effectiveness of the mixtureof-experts model for ASD and validates the hypothesis (b) that the salient dimensions may vary significantly throughout the whole space.

Comparison with SOTA methods
We also find that the performance on Noun class is relatively low when compared with Verb and Adjective classes, which coincide with the observations obtained in (Scheible et al., 2013 Table 4: Ablation analysis of the features 2019), possibly for the reason that polysemy phenomenon is more significant among nouns.
Besides vanilla word embeddings, existing ASD methods also used dLCE (Nguyen et al., 2016) embeddings, and often obtained better results. However, a large number of antonymies and synonymies have been used in the process of learning dLCE embeddings, which may lead to severe overfitting. In spite of this concern, we also test our method with dLCE embeddings on the dataset and find that it outperforms these competitors with dLCE and list the results in Table 3.

Ablation Analysis of Features
We also make an ablation analysis about the four kinds of features, by removing each of them from our model. It can be seen from Table 4 that all the features are making their own contributions to the ASD. Different parts of speech have different sensitivities to different features. Specifically, verb is most sensitive to "absolute difference", while both adjective and noun are most sensitive to "cosine". The reason behind the observations deserves further exploration.  Table 5: Performance of our model and the baseline models on the lexical-split datasets 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8

Hyperparameter Analysis
The number of salient dimensions (d u ) and the number of experts (M ) are two prominent hyperparameters in our MOE-ASD model. By varying their values, we study their influence on the performance.
Firstly, by fixing M = 256, we vary d u from 2 1 to 2 8 and plot the F1-scores on the validation data and the testing data in Figure 3. It is observed that all the three subtasks (Adjective, Noun and Verb) arrive at the best performance at d u = 4 on both validation data and testing data. It validates our hypothesis (a) that antonymous words tend to be different on only a few salient dimensions.
Secondly, by fixing d u = 4, we vary M from 2 1 to 2 8 and plot the F1-scores in Figure 4. Overall, the performance becomes better with the larger number of experts. We conjecture that marginal improvement will be obtained by increasing the number of experts further, but we do not make such experiments.

Lexical Memorization
To eliminate the bias introduced by the lexical memorization problem (Levy et al., 2015), we perform lexical splits to obtain train and test datasets with zero lexical overlap. The statistics of the lexical-split datasets are listed in Table 6. Table  5 shows the results of our method and Parasiam on the lexical-split datasets by using FastText and dLCE pretrained word embeddings. It can be seen that our MoE-ASD model outperforms Parasiam on all three lexical-split datasets. However, significant decreases in the F1 scores are also observed.

Conclusions
This paper first presents two hypotheses for ASD task (i.e., antonymous words tend to be different on only a few salient dimensions that may vary significantly for different antonymies) and then motivates an ASD method based on mixture-of-experts. Finally, experimental results have manifested its effectiveness and validated the two underlying hypotheses. It is worth noting that our method is distinct from the other state-of-the-arts in two main aspects: (1) it works in a divide-and-conquer strategy by dividing the whole space into multiple subspaces and having one expert specialized for each subspace; (2) it is inherently symmetric with respect to the input word pair.