SHIKEBLCU at SemEval-2020 Task 2: An External Knowledge-enhanced Matrix for Multilingual and Cross-Lingual Lexical Entailment

Lexical entailment recognition plays an important role in tasks like Question Answering and Machine Translation. As important branches of lexical entailment, predicting multilingual and cross-lingual lexical entailment (LE) are two subtasks of SemEval2020 Task2. In previous monolingual LE studies, researchers leverage external linguistic constraints to transform word embeddings for LE relation. In our system, we expand the number of external constraints in multiple languages to obtain more specialised multilingual word embeddings. For the cross-lingual subtask, we apply a bilingual word embeddings mapping method in the model. The mapping method takes specialised embeddings as inputs and is able to retain the embeddings’ LE features after operations. Our results for multilingual subtask are about 20% and 10% higher than the baseline in graded and binary prediction respectively.


Introduction
Lexical entailment (LE) refers to the hyponymy-hypernymy relation, also known as TYPE-OF, or IS-A, which is a fundamental asymmetric lexical relation (Vulić et al., 2017). It is a basic requirement for tasks like Question Answering (QA) and Recognizing Textual Entailment (RTE). And more general reasoning over cross-lingual and multilingual LE relationships can improve language understanding in multilingual contexts (Upadhyay et al., 2018). Cross-lingual LE recognition is crucial to tasks such as recognizing cross-lingual textual entailment (Conneau et al., 2018) and machine translation (Padó et al., 2009). Predicting binary and graded scores for multilingual and cross-lingual lexical entailment is the task of SemEval 2020 Task 2 (Glavaš et al., 2020).
There are two subtasks. Subtask A is to predict binary or graded LE on a monolingual pair of words, e.g., (building, construction) and the subtask is in six multiple languages (i.e., English, German, Italian, Croatian, Turkish, Albanian). Subtask B, predicting cross-lingual LE, gives a pair of words in two different languages with prefix, e.g., (en dinosaur, de kreatur 1 ). There are 15 cross-lingual pairs as languages above combine with each other, i.e., DE-X, EN-X, IT-X, HR-X and SQ-X sets 2 . Each subtask includes both binary and graded prediction for a given pair of words. Binary prediction is to determine whether there is a LE relation between two concepts, while graded lexical entailment (GR-LE) measures the strength of LE relation on a continuous 0-6 scale (Vulić et al., 2017;Rei et al., 2018). For instance, (apple, fruit) gains a score of 6, while the pair (apple, flower) gets 1.2. This is because apple is more like a kind of fruit instead of flower. Since LE is an asymmetric relation, the score of (fruit, apple) is not the same as (apple, fruit).
Researches in monolingual GR-LE mainly focus on training LE-specialiased word embeddings based on external linguistic constraints (Vulić and Mrkšić, 2018;Kamath et al., 2019;. The constraints, namely external knowledge, are some synonymy (pretty, beautiful), antonymy (nice, bad) and lexical entailment word pairs (sandwich, food). These are useful for LE relation and are extracted from lexical resources (such as WordNet). The size of external constraints decides the size of word embeddings get trained, thus for each language, the amount of external knowledge needs to be large enough.
For cross-lingual GR-LE prediction, previous works transfer the space from the source language to the target language . The basis of the methods is to obtain unified bilingual word embeddings, which can also be trained by mapping two languages space into a shared one based a bilingual dictionary (Ruder, 2017). After getting bilingual embeddings, operations in cross-lingual models are similar with the monolingual one.
In our work, we apply massive external constraints to train specialised word embeddings for monolingual LE. And we introduce a bilingual word embedding mapping method on cross-lingual subtask. The inputs of the mapping method are the LE-specialised word embeddings which are outputs of subtask A. Several experiments are conducted to prove the effect of external constraints. Our contributions are as follows: • We expand the number of external constraints in six given languages, and use them in the monolingual LE model, receiving great scores.
• We merge the bilingual word embeddings mapping method with monolingual LE model for the cross-lingual LE prediction.
• We conduct experiments with different number and kinds of external constraints to prove constraints' effectiveness.

System Description
The components for our system are shown in Figure 1. We propose a system that focuses on transforming word embeddings for LE relation. For monolingual LE subtask, the inputs are word pairs and external lexical constraints in the same language. We employ LEAR (Vulić and Mrkšić, 2018) to train specialised input vectors based on external constraints. The method was used for English GR-LE. Here we use the model for all six languages with more external constraints added. After transforming the input embeddings, the outputs are monolingual LE-specialised word embeddings (Section 2.1). Next, to solve cross-lingual subtask, we treat the outputs of subtask A as inputs for a bilingual mapping method to map two different vector spaces into a shared one (Section 2.2). The final step of both subtasks is to score the entailment relation using the trained word embeddings (Section 2.3). The main part of our system for subtask A is the same as LEAR (Lexical Entailment Attract-Repel) (Vulić and Mrkšić, 2018). It is a post-processing method that fine-tunes word embeddings observed in external linguistic constraints. The constraints consists of synonymy pairs S such as (nice, kind), antonymy pairs A such as (poor, rich), and lexical entailment pairs L such as (apple, fruit), i.e., C = S ∪ A ∪ L . The model defines two symmetric objectives: the ATTRACT (Att) objective aims to pull synonymy pairs (ATTRACT (a) The original pre-trained input word embeddings. (b) Step 1: adjusting word embeddings directions by pulling similar words closer or pushing opposite words away. (c) Step 2: adjusting norms to reflect concept level according to LE constraints. Figure 2: The transformations of input word embeddings shown in 2D plane. Figure (a) shows the original input word embeddings. In figure (b), vectors of (broccoli, food, vegetable) and (building, construction) become closer according to external constraints. Next, as described in lexical entailment pairs L, building is a kind of construction and broccoli is a type of vegetable as well as a kind of food. Thus food and construction are higher-level concepts and their norms are larger than the others in figure (c).
pairs) closer, while the REPEL (Rep) objective pushes antonymy pairs (REPEL pairs) away from each other. Meanwhile, the model adjusts vector norms so that the higher-level concepts have larger norms and lower-level concepts have smaller norms in Euclidean space. The set of K word pairs for which the Att or Rep score is to be computed is denoted by . These pairs are referred as the positive examples. The set of corresponding negative examples T is created by coupling each positive ATTRACT example (x l , x r ) with a negative example pair (t l , t r ), where t l is the vector closest (within the current batch in terms of cosine similarity) to x l , and t r the vector closest to x r . The Att objective for a batch of ATTRACT constraints B A is then given as: (1) τ (x) = max(0, x) is the hinge loss and δ att is the similarity margin imposed between the negative and positive vector pairs. Similarly, for each positive REPEL pair (x l , x r ), the negative example pair (t l , t r ) couples the vector t l that is most distant from x l and t r , most distant from x r . The Rep objective for a batch of REPEL word pairs B R is then defined as: (2) In addition to these two objectives, LEAR defines a regularization term to preserve the useful semantic content from the original distributional vector space. Let V (B) denote the set of distinct words in a constraint batch B; the regularisation term is then: Reg(B) = λ reg x∈V (B) y − x 2 , where y is the transformed vector of any vector x, and λ reg is the regularisation factor.
The most important objective of the method is an asymmetric distance-based objective which aims to rearrange norms of vectors. This is to obtain specialised vectors reflecting the asymmetry of the LE relation. We adopt the best-performing asymmetric objective from Vulić and Mrkšić (2018): B L denotes a batch of LE constraints. Finally, the full objective is then defined as: Figure 2 shows the whole transformation of word embeddings. First, the model adjusts the vectors direction according to Att objective or Rep objective. This step captures the symmetric similarity of word pairs. And next step is to rearrange vector norms according to LE(B L ), so that norms reveal the concepts' level. The final transformed word embeddings are saved for the following task.

Subtask B: Cross-lingual LE Training
The main idea of cross-lingual subtask is to map any two transformed vector spaces from subtask A into one shared space using a dictionary. Because the vector spaces are trained for LE relation, the symmetric and asymmetric features are retained in the mapped spaces. We follow the bilingual word embeddings mapping method proposed by Artetxe et al. (2018).
Let X and Z be the word embedding matrices in two languages for a given bilingual dictionary so that their ith row X i * and Z i * are the embeddings of the ith entry. The aim is to learn the transformation matrices W X and W Z so the mapped embeddings XW X and ZW Z are close to each other. The core step of the method is an orthogonal transformation 3 .
The outputs of the method is the mapped vectors of two languages. After the bilingual word embedding training, we use the outputs and following function (Section 2.3) to compute the LE score.

Scoring Lexical Entailment
After obtaining LE-specialised word embeddings, LE scores are given by a distance function that reflects both the cosine distance between the vectors and the asymmetric difference between their norms (Vulić and Mrkšić, 2018) : x and y represent the vectors of any two words x and y in one subtask. We then normalise the results of the function to a range of (0,6) as a requirement. And for binary detection, we simply transform the graded score into the binary label, using a binarization threshold t. If I LE (x, y) < t, we predict that the LE relation holds between two given concepts.  (Grave et al., 2018). All the vectors are 300-dim. For all languages except English, we first shrink the input vector spaces according to word frequency lists that contains 50,000 words 4 . This is to make sure our model works smoothly and fast. However, this raises a problem that some words may get word embeddings in the original larger vector space whereas not in the reduced space. Also, we notice that there are some multiword expressions in the datasets, e.g., macchina per scrivere (typewritter in English), and they may not get the corresponding word embeddings either. To address these problems, we conclude in different ways of loading the word embedding.
First, we try to obtain the embeddings from the reduced vector space. If the input word is not in the reduced space, then there are three conditions: whether it is in the priginal larger space, whether it is a multiword expression, or neither. The process is shown in Figure 3. For the words made of multiple words, we separate them by underscores and try to get each part the corresponding word embeddings from the larger vector space. Once one of the parts is not in the space, the embedding of the multiword is randomly initialized. If all parts meet the criteria, the final embeddings of this multiword expression is the average of word embeddings of each part. We distribute words random word embeddings if they do not belong to the above situations.
External Resources. For all the provided languages, one part of external constraints is extracted from ConceptNet 5 (Speer et al., 2017) following the idea of LEAR. For each language, word pairs of synonym and antonym relations are included as symmetric resources and concepts of IsA relation are regarded as  Table 1a. And for English, the other part is the same set as LEAR (Vulić and Mrkšić, 2018): synonymy and antonymy constraints from (Zhang et al., 2014;Ono et al., 2015) are extracted from WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009), and asymmetric LE constraints are also extracted from WordNet. We add these 1,023,082 pairs of synonyms, 380,873 pairs of antonyms, and 1,545,630 LE pairs into English external lexical constraints. Table 1: Summary of the number of constraints used in our model.
Expansion of Lexical Constraints. With more lexical constraints, more embeddings will get trained for LE relation. However, for languages like Turkish and Albanian, available external resources are not sufficient (see Table 1a). Therefore, instead of directly searching constraints in such language, we use Google Translator 6 to translate English constraints into other languages to expand their constraints. Furthermore, we apply the translations to construct the bi-dictionaries between every two languages for the cross-lingual subtask. The final number of lexical constraints applied in our model is displayed in Table 1b.

Results and Discussion
Our final results on the test set and the comparison with the baseline systems for two subtasks are shown in Table 2 and Table 3. The baseline model is from .
Subtask A. Our system surpasses baseline in every monolingual language as a result of external constraints expansion. Our best performing language is the same as baseline: English, but our results are much higher than baseline by 18% and 8% in graded and binary LE scores respectively. Albanian, as the language performs the worst in baseline, we improve the results from 0.32 to 0.56 in graded LE and from 0.57 to 0.72 in binary LE. Even for our worst performing language, Turkish, the score of graded LE is 0.53 and the binary one is 0.70, while the baseline is 0.43 and 0.64.   We also evaluate the system on the evaluation set to certify that an increase in the number of external lexical knowledge is beneficial to the model. We compare the performance of the model for graded LE prediction in two situations. One is training the model only with lexical constraints extracted from ConceptNet and the other is with all constraints including the translated constraints. Since Albanian is not published in the evaluation set, the comparisons only contain other five languages. Figure 4a depicts the results for the five languages. The evaluation results reveal the similar pattern as the test results. The scores of English files are the highest, since the number of English constraints is much more than others. As Italian and German hold approximately equal number of constraints, their scores are close to each other. The lowest is Turkish for both evaluation and test sets.
We analyze other factors that may affect the results. We find that among all three kinds of external lexical constraints, LE pairs contribute the most to the model. Figure 4b demonstrates the importance of each kind of constraints. When training with the same number of different constraints, the model performs the best with the help of LE pairs, and the performance of only with antonyms condition is the worst. The number of Turkish LE constraints is the smallest among all languages. This explains why the number of Turkish lexical constraints is not the least, but the results are the lowest. And translations also influence the effectiveness of constraints, so we cannot get as useful Turkish LE word embeddings as others.
Subtask B. The circumstances for cross-lingual LE are complex. The bilingual mapping method is not effective for all the bilingual combinations. The baseline surpasses our system in DE-IT, DE-TR, EN-IT, HR-TR, IT-TR sets for graded LE task and DE-TR and IT-TR sets for the binary LE prediction. One of the reasons is that word embeddings from subtask A, as the inputs of mapping method, determine the results of the method. Turkish specialised embeddings are trained not as well as others, so graded results of combinations with Turkish are not outstanding enough.

Conclusion
We use LEAR model to fine-tune the input word vector spaces, and expand the number of external constraints used in the model to obtain more global word embeddings. The results demonstrate that with more knowledge added into the model, especially LE constraints, the relation among vectors will be more significant and beneficial to the monolingual LE relation prediction. And we apply above specialised word embeddings and the mapping method in cross-lingual word embeddings models to predict cross-lingual LE.
For future work, we think bi-dictionary and proper polysemy process should improve the performance. Besides, we will consider applying external constraints in both languages into the cross-lingual model since they work well in monolingual LE prediction.