Understanding Feature Focus in Multitask Settings for Lexico-semantic Relation Identification

Discovering whether words are semantically related and identifying the speciﬁc semantic relation that holds between them is of crucial importance for automatic reasoning on text data. For that purpose, different methodologies have been proposed that either (1) tackle feature engineering, (2) ﬁne-tune latent semantic spaces, or (3) take advantage of cognitive links be-tween semantic relations in multitask settings. In this paper, we investigate how feature engineering and multitask architectures can be improved and consequently combined to identify lexico-semantic relations. Evaluation re-sults over a set of gold-standard datasets show that (1) combinations of similar features are beneﬁcial (feature sets), (2) asymmetric distributional features are a strong cue to discriminate asymmetric relations as well as they play an important role in multitask architectures, (3) shared-private models improve over binary and fully-shared classiﬁers as well as they correctly balance the focus on features between private and shared layers 1 .

Most approaches focus on modeling a single semantic relation and consist in deciding whether a given relation r holds between a pair of words (w 1 , w 2 ). The vast majority of efforts Vulić and Mrkšić, 2018;Wang and He, 2020) concentrate on hypernymy which is the key organization principle of semantic memory, but studies exist on antonymy (Nguyen et al., 2017b;Ali et al., 2019), meronymy (Glavaš and Ponzetto, 2017) and co-hyponymy (Jana et al., 2020). Within this scope, different strategies have been proposed that either define new features (Santus et al., 2017;Vu and Shwartz, 2018) or build specific latent semantic spaces (Nguyen et al., 2017a;Rei et al., 2018;Wang and He, 2020) for the relation at hand.
More recently, multitask strategies have been proposed, which consist in concurrently learning correlated lexico-semantic relations (Attia et al., 2016;Balikas et al., 2019;Bannour et al., 2020), the underlying idea being that if two (or more) tasks are cognitively interlinked, a learning architecture should improve its generalization ability by taking into account the shared information existing between the tasks (Caruana, 1998).
In this paper, we propose to investigate how feature engineering can be coupled to multitask strategies for the identification of lexico-semantic relations. On the one hand, Vu and Shwartz (2018) show that the introduction of the generalized cosine (M ult) drastically improves results over the unique concatenation of word embeddings, thus clearly evidencing the limitations of general-purpose latent spaces. However, a complete study of symmetric and asymmetric characteristics, and their combination is still lacking, except (Santus et al., 2017), one of the most complete work in the field.
On the other hand, although existing multitask strategies have been showing promising results, they neither take advantage of specialized features nor they implement state-of-the-art architectures, which have been successful for text classification (Liu et al., 2017). This might be due to the fact that the combination of features within shared-private multitask architectures is not straightforward, and requires specific tuning.
Evaluation results over a set of gold-standard datasets (RUMEN (Balikas et al., 2019), ROOT9 (Santus et al., 2016), WEEDS (Weeds et al., 2004) and BLESS (Baroni and Lenci, 2011)) of an architecture coupling optimized feature sets and sharedprivate models show that • The combination of features within a family set improves performance over the use of a unique family member; • Asymmetric distributional features are a strong cue to discriminate asymmetric lexicosemantic relations; • Shared-private models improve over binary and fully-shared classifiers (Balikas et al., 2019;Bannour et al., 2020) as well as they correctly balance the focus on features between private and shared layers; • Asymmetric distributional features play an important role in multitask architectures, being an important source of information for combining both symmetric and asymmetric tasks.

Related Work
Three major research directions have been proposed to identify lexico-semantic relations: (1) feature engineering, (2) construction of fine-tuned semantic spaces and (3) multitask architectures. Within the first topic, (Levy et al., 2015) and (Vylomova et al., 2016) proposed similar evaluations to combine word input vectors ( − → w 1 , − → w 2 ), following initial experiments of (Baroni et al., 2012;Roller et al., 2014;Weeds et al., 2014). In particular, word pairs are encoded as the concatenation of the constituent word representations Both studies evidence that the distributional hypothesis is domain-dependent by nature and as such models may not generalize across domains based on these input representations. To overcome such a limitation Nguyen et al., 2017b) proposed to represent contextual patterns as continuous vectors with successful results, while (Vu and Shwartz, 2018) The second main research direction aims to build fine-tuned neural latent semantic spaces. (Nguyen et al., 2017a) proposed HyperVec, where embeddings are learned in a specific order to capture the hypernym-hyponym distributional hierarchy from a background knowledge of hypernym-hyponym pairs. (Vulić and Mrkšić, 2018) rather proposed a post-processing strategy that retrofits the knowledge background into an original latent space. Such methods suffer from limited coverage as they affect only vectors of seen words. To deal with this limitation, (Kamath et al., 2019) presented a postprocessing method that specializes vectors of all vocabulary words by learning a global specialization function, and (Wang and He, 2020) followed the same idea but proposed to learn two projection functions. In the same line, (Bouraoui et al., 2020) introduced a framework that fine-tunes BERT (Devlin et al., 2019) to include relational information.
The third approach tackles relation identification from the architecture point of view. Within this context, (Attia et al., 2016) can be viewed as a coarsegrained analysis as they propose a multitask convolutional neural network where one task acts as domain adaptation (relatedness between two words) and the second task is a multiclass classification problem for hypernymy, meronymy, synonymy and antonymy. Instead, (Balikas et al., 2019) proposed a fine-grained approach, that determines whether the learning process of a given semantic relation can be improved by the concurrent learning of another relation, where relations are synonymy, cohyponymy, hypernymy and meronymy. (Bannour et al., 2020) implemented the same fully-shared model, but introduced the idea of data augmentation via attention models.
Although fine-tuned embeddings have evidenced improved results over generic ones, they are relation-and knowledge-dependent. One exception is proposed by (Meng et al., 2019), which learns text embeddings in a spherical space (aka. JoSE) suitable for relational information. Feature engineering also affords "cheap performance boost" (Vu and Shwartz, 2018) in resource-free environments. But, a complete study of the combination of features is still missing as well as the definition of asymmetric features in the context of continuous spaces, although a great deal of work exists for the discrete case (Kotlerman et al., 2010;Santus et al., 2017). Finally, studies in multitask settings neither take advantage of powerful multitask models such as shared-private architectures (Liu et al., 2017) that allow to combine task-specific and cross-task information, nor benefit from the fruitful combination of distributional and pattern-based features. In this paper, we propose to deal with the aforementioned limitations in a resource-free setup.

Feature Engineering
Additionally to word embeddings concatenation, we define three families of features based on the distributional hypothesis (symmetric and asymmetric features) and the paradigmatic approach (patternbased features) in continuous semantic spaces.

Distributional Representation
Most studies have been evidencing the superiority of the concatenation of representational word vectors to infer their semantic relationship Vu and Shwartz, 2018). So, we follow this line of research. Let (w 1 , w 2 ) be a word pair and − → w 1 , − → w 2 their respective distributional representations of dimension d. The input distributional feature of the word pair is noted − → w 1 ⊕ − → w 2 .

Symmetric Distributional Features
Studies have evidenced the interest of coupling word embeddings with specific features to improve relation identification. In particular, the cosine similarity measure cos has shown promising results (Garten et al., 2015;Barkan, 2017). However, Vu and Shwartz (2018) have demonstrated the effectiveness of integrating the element-wise multiplication of the input vectors, which can be seen as a generalized cosine (cosG, aka. M ult), which is defined in equation 1.
While the cosine only provides a unique value as input, cosG refers to an input of dimension d, thus evidencing a dimensional issue. As a consequence, we propose to transform the cosG into a unique value by using a linear activation layer as in equation 2. The cosG1D can be seen as a control value of cosG taking into account the dimensional bias (from high to low dimension).
The counterpart of equation 2 is the (d times) duplication of the cosine value. This metric called cosine broadcast (cosBr) defined in equation 3 aims to control the dimensional issue from a low to a high dimension.
As such, in equation 4, we define a family of symmetric distributional features.
In the next subsection, we detail the design of new asymmetric distributional measures based on the Kullback-Leibler divergence.

Asymmetric Distributional Features
Asymmetry has shown successful results for the discrete case (Kotlerman et al., 2010;Santus et al., 2017), the underlying idea being that the relation between words may be unbalanced such that one word attracts the other one more than the opposite. Here, we define different asymmetric features in the continuous space based on the Kullback-Leibler divergence (Kullback and Leibler, 1951). To fit to the continuous case, we transform each dimension of a word vector with the sigmoid (σ) function such that all values range between 0 and 1. Thus, each word can be considered as a probability distribution and the asymmetric metric Kull is defined in equations 5.
To take into account both directions of the asymmetry, we propose to concatenate the Kull values for both directions as defined in equation 6.
Similarly to the cosG, we propose to define the multiplicative version of the kull, such that kullG integrates the element-wise multiplication of the input vectors as defined in equations 7 (single asymmetry) and 8 (concatenation of both asymmetries).
Similarly to cosG1D and to take into account the dimensional issue of the multiplicative version of the Kullback-Leibler, we define kullG1D in equations 9 and 10 .
Similarly to cosBr, we propose to define kullBr based on the (d times) duplication of the Kulback-Leibler value for both directions as in equation 11.
As such, in equation 12, we define a family of asymmetric distributional features.
In the next subsection, we present the encoding strategy of patterns embodying the paradigmatic approach.

Pattern-based Paradigmatic Features
Patterns are part of the paradigmatic approach (Hearst, 1992), which suggests that specific word sequences may exist that link two words in a given relation. Some examples of sequences between word pairs are given in Table 1, which evidence that some of them can be spurious, and do not necessarily include patterns.
Here, we propose to implement the methodology of  to encode patterns into continuous spaces. As such, we transform the k 2 most frequent patterns occurring between w 1 and w 2 using either BiLSTM or the Universal Sentence Encoder (USE) (Cer et al., 2018), and then perform average pooling to get the final input representation. The encoded i-th most frequent pattern is defined in equation 13, where j ∈ {BiLSTM, USE}, i ∈ [1..k], 2 k allows to deal with spurious sequences.  and the average representation of the k patterns is noted pat Similarly to CosF and KullF , we define a family of pattern-based features P atF in equation 14.
In the next section, we present the multitask settings that have been implemented to take into account relations between lexico-semantic relations.

Multitask Settings
Multitask architectures have shown to successfully combine closely-related lexico-semantic relations. Within this scope, the fully-shared architecture has systematically been implemented (Attia et al., 2016;Balikas et al., 2019;Bannour et al., 2020), which relies on a unique shared representation capable of solving the different tasks learned concurrently from a given input.
However, the shared-private model has proved to boost results for text classification (Liu et al., 2017). In particular, a shared-private network combines N + 1 different representations (one shared and N task-specific). As such, the shared layer should transfer the joint information contained in all tasks, while private layers should focus on the specific information of each task.
Moreover, N + 1 different input representations may coexist in the shared-private case, while a unique input representation exists for fully-shared models. Here, we propose to implement both fullyshared and shared-private architectures for different combinations of input representations and features X = ( − → w 1 ⊕ − → w 2 , CosF, KullF, P atF ). In particular, forward selection (Kohavi and Sommerfield, 1995) is used for feature selection, as the search space is huge, 2 10 possible combinations 3 .

Multitask Architectures
The neural architectures are presented in figure 1 for two tasks. Formally, let X k be an input vector 4 , we compute a shared layer S(X k ) as in equation 15, where W S k is a weight matrix, b S k a bias vector, and k ∈ [1, K] (K the number of shared layers).
A private layer H j (Z q ), which solves task For the fully-shared architecture Z 1 = S(X K ) and for the shared-private model Z 1 = S(X K ) ⊕ X i , where X i is the specific input vector for task T i . Finally, the N decisions are defined in equation 17.
The parameters are updated by minimising the binary cross-entropy. Hence, the weights of the shared layer are updated by minimising the loss function of each task alternatively, while the private layers are updated for their specific task.

Forward Selection
In order to optimize the feature combination for all N + 1 tasks and thus find the best input vectors for the shared and private layers (i.e. X, X 1 and X 2 in figure 1), we perform forward selection. As such, we first train the given model to find the best combination of features within a given family (i.e. within CosF , KullF and P atF individually) 5 . Once the best within-family combination has been defined for all families, we train the model for all combinations of the best within-family combinations of features. Note that for the shared-private architecture, we first train the private models independently to determine X i (i ∈ [1, N ]) and based on these findings, we train the shared-private model to determine X, constrained by the previously learned private models with input X i .

Datasets
There exist a large body of related works for the identification of lexico-semantic relations. The first gold-standard dataset, WEEDS, has been proposed by (Weeds et al., 2004) in the context of studies about measures of lexical similarity. Following the same objective, (Baroni and Lenci, 2011) introduced the well-known BLESS dataset, and (Santus et al., 2016) compiled the ROOT9 dataset 6 , which contains word pairs randomly extracted from EVALution (Santus et al., 2015), Lenci/Benotto (Benotto, 2015) and BLESS (Baroni and Lenci, 2011). Within the context of concurrent identification of lexico-semantic relations, (Balikas et al., 2019) recently introduced the RUMEN dataset 7 to include synonymy. As the patterns are not included in the original datasets, we downloaded the English wikipedia dump 8 and extracted all patterns that do not exceed a maximum length of 10 words 9 . All datasets 10 are summarized with their specific characteristics in Table 2.

Learning Configurations
The

Synonym
Hypernym Co-hyponym Meronym Random   (Meng et al., 2019). All stateof-the-art models presented in section 6 have been implemented to provide average results and perform statistical tests 11 .

Lexical Split
As suggested in (Levy et al., 2015), lexical split is applied to all our experiments so that there is no vocabulary intersection between the test set and the train/validation sets. Note that for learning purposes, each dataset is split into train (50%), validation (20%) and test (30%) sub-datasets.

Evaluation
All comparative results against four state-of-theart models Vu and Shwartz, 2018;Balikas et al., 2019;Bannour et al., 2020) are presented in Table 3 for an average of 25 runs with evidenced statistical significance over four gold-standard datasets.

Private Models
We first start by analysing the impact of feature combination on private models, i.e. when a unique 11 Source codes are available at https://github.com/Houssam93/ Feature-Focus-in-Multi-Task-Learning-NLP lexico-semantic relation is taken into account in the learning process. This stands for the first four rows of Table 3. Unsurprisingly, the introduction of a combination of (eventually new) features (Best MLP) outperforms existing models Vu and Shwartz, 2018) and the multilayer perceptron (MLP) that only includes word embeddings concatenation (i.e. the simplest baseline). Note that the Best MLP model includes the architectures of  and (Vu and Shwartz, 2018) as it allows the combination of all family features as input.
To better understand the impact of feature engineering, we illustrate results involving all combinations of within-family features and all combinations of in-between best family features in figure 2 (a). Within the cosF family alone (i.e. only cosinebased metrics are used for the learning process) 12 , results clearly evidence the dimensional issue, being cos and cosG1D the one-dimension metrics that evidence worst results individually. The second important finding lies in the fact that metric combination steadily improves over individual metrics. In particular, (cosG, cosBr, cosG1D) gives rise to strongest results in the vast majority of cases, and particularly for hypernymy.  Table 3: Overall results for all architectures with GloVe embeddings. Lexical split is applied. , † and + denote p-value ≤ 0.05 based on the t-Test assuming unequal sample variances of metric values between respectively (Best MLP) against (Vu and Shwartz, 2018), (Best SP) against (Best MLP), and (Best SP) against (Bannour et al., 2020).
Within the KullF family alone 13 , results seem to indicate that kullBr is the less performing (alone and in combination) feature, although regularities are difficult to establish as different results can be observed depending on the dataset. Similarly to the previous observation, the combination of asymmetric features provides improved results for the vast majority of cases, suggesting that individual values encode complementary information.
Within the P atF family 14 , the BiLSTM encoding seems to provide superior results to the USE encoding, but more importantly, results clearly show that pattern-based features can be a strong cue for the classification process provided that a large number of patterns can be extracted, as it is shown for ROOT9 (see Table 2 for the number of patterns).
More surprisingly, the CosF features steadily indicate stronger results than the KullF and P atF features for asymmetric relations (hypernymy and meronymy), thus suggesting that symmetry is an important characteristic for all relations. 13 Red dots in figure 2. 14 Black dots in figure 2.
Finally, results clearly show that the combinations of the best features per family 15 steadily outperform results of individual family features, thus demonstrating their complementarity. In particular, symmetric and asymmetric distributional features successfully combine for asymmetric relations, and the successful combination is with pattern-based and cosine-based features for co-hyponymy. However, only symmetric distributional features allow maximum performance for synonymy, which can easily be understood as this is a symmetric relation. To strengthen our comments, we give the distribution of features for the best configurations in Table  4 (first row) for all datasets and relations.

Multitask Models
Results of the multitask architectures are presented in rows 5-8 of Table 3. In particular, the Best Fullyshared network stands for the model of Balikas et al. (2019) with an optimized set of input features, oppositely to their settings which rely on the unique concatenation of word embeddings. Figures clearly  show the superiority of the shared-private network (Best SP) over the fully-shared model (Best FS) for most cases, suggesting that the combination of private and shared information is beneficial to the decision process. However, the Best MLP is a hard model to beat as the Best SP statistically outperforms the former architecture 4 times out of 8, and 2 times out of 8 without statistical significance. But the contrary is only true for ROOT9 (wrt. F 1 score), where Best MLP statistically exceeds Best SP. The important issue in shared-private architectures is to understand how well they distribute the feature space between private and shared layers. For that purpose, we analyse figure 2 (b), which shows feature combinations for the shared layer, i.e. when two tasks are learned concurrently. Note that in this case, best combinations from the private models (learned separately) restrict the learning process. The first main conclusion is that asymmetric distributional features (KullF ) steadily compete with cosine-based features (CosF ), even clearly outperforming the latter for BLESS, which is definitely not the case within private models. The same conclusion can be drawn  for pattern-based features P atF , which impact is much more important in the shared layers than it is the case in the private models when compared to CosF . This suggests that when private models focus more on symmetric features, shared-private models take advantage of asymmetric features to capture task dissimilarity (indeed in the concurrent tasks there is always at least one asymmetric task). Another interesting observation is that best models are usually not a combination of different family features. Only 2 cases out of 8 show improved results with feature combination. In fact, such results suggest that private and shared layers distinctively balance the family feature space. We clearly see this situation in Table 4 by looking at the complementarity of the input feature vectors of private (row 1) and shared-private models (row 3). For instance, when maximizing the hypernymy task within the shared-private model over RUMEN, the private input vectors are (cosG, cosBr, cosG1D, kull, pat * ,BiLSTM ) for hypernymy and (cosG, cosBr) for synonymy, while the shared input vector is (kullG, kullBr, kullG1D). It is worth noticing that this situation does not hold for the fully-shared models as they are clearly biased towards cosinebased metrics and rarely include asymmetric distributional and pattern-based features.

Spherical text embeddings
We propose to compare our feature-based architectures with relational embeddings, namely JoSE (Meng et al., 2019), the underlying idea being to understand how feature-based strategies can compare and eventually add-on to fine-tuned neural semantic spaces. Results are illustrated in Table 5.
Results of the baseline MLP model do not evidence a clear advantage of relational embeddings compared to general-purpose ones like GloVe, BLESS being the only exception. However, it is interesting to notice that the proportion of improvement is much more important for JoSE embeddings when introducing combinations of features. Indeed, while the MLP model with GloVe overtakes the JoSE version 5 times out of 8, the Best MLP model with JoSE overtakes the GloVe version 5 times out of 8, thus suggesting that spherical embeddings are sensitive to feature engineering.
Finally, while shared-private architectures provide overall best results, a clear distinction between both embeddings is difficult to establish, although a small tendency towards JoSE embeddings seems to emerge. Indeed, while the hypernymy relation is better tackled by relational embeddings (3 out of 4 configurations), meronymy is better handled by GloVe although being an asymmetric relation. With respect to symmetric relations (synonymy and co-hyponymy), the situation slightly converges towards relational embeddings with better results in 2 out of 3 experiments.

Conclusions
In this paper, we proposed the definition of asymmetric distributional features in continuous spaces based on the Kullback-Leibler divergence, and suggested to combine them with families of symmetric distributional and pattern-based characteristics using a feature selection process. We proposed to analyse the impact of feature combination in multitask settings, which combine private and shared layers. Results evidenced the benefits of feature combination in the private models, and they highlighted the importance of asymmetric (distributional and paradigmatic) features in the shared layers. Moreover, share-private architectures showed the capacity of balancing feature families between private and shared layers thus taking full advantage of most features in the decision process.