SemEval-2020 Task 2: Predicting Multilingual and Cross-Lingual (Graded) Lexical Entailment

Lexical entailment (LE) is a fundamental asymmetric lexico-semantic relation, supporting the hierarchies in lexical resources (e.g., WordNet, ConceptNet) and applications like natural language inference and taxonomy induction. Multilingual and cross-lingual NLP applications warrant models for LE detection that go beyond language boundaries. As part of SemEval 2020, we carried out a shared task (Task 2) on multilingual and cross-lingual LE. The shared task spans three dimensions: (1) monolingual vs. cross-lingual LE, (2) binary vs. graded LE, and (3) a set of 6 diverse languages (and 15 corresponding language pairs). We offered two different evaluation tracks: (a) Dist: for unsupervised, fully distributional models that capture LE solely on the basis of unannotated corpora, and (b) Any: for externally informed models, allowed to leverage any resources, including lexico-semantic networks (e.g., WordNet or BabelNet). In the Any track, we recieved runs that push state-of-the-art across all languages and language pairs, for both binary LE detection and graded LE prediction.

Binary and Graded Lexical Entailment. For this task, we follow the definition of lexical entailment as thoroughly discussed in Vulić et al. (2017, Section 2), namely as a taxonomical asymmetric hyponymy-hypernymy or is-a relation. Although commonly treated as a binary relation ("Is X a type of Y?"), cognitive theories of concept (proto)typicality and category vagueness (Rosch, 1975;Kamp and Partee, 1995) suggest that LE is rather a graded relation: humans can perceive the degree to which the LE relation holds between concepts ("To which degree is X a type of Y?"). 1 The graded nature of the LE relation has been empirically validated in human annotations crowdsourced for the HyperLex dataset . Its creation catalyzed research on models for predicting graded LE (Nguyen et al., 2017;Nickel and Kiela, 2017;Tifrea et al., 2019;Le et al., 2019).
Multilingual and Cross-Lingual Lexical Entailment. Despite its potential for a variety of crosslingual and multilingual applications such as multilingual taxonomy construction, machine translation, and multilingual natural language inference (Mihalcea et al., 2010;Negri et al., 2013;Ehrmann et al., 2014;Fu et al., 2014;Bordea et al., 2016;Conneau et al., 2018, inter alia), LE detection, especially its graded variant, has so far been predominantly studied in monolingual settings (Geffet and Dagan, 2005; This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1 For instance, chess is perceived as a less typical sport than basketball, but it is definitely a more typical sport than sitting. Weeds et al., 2014;Santus et al., 2014;Kiela et al., 2015;Shwartz et al., 2016;Shwartz et al., 2017;Glavaš and Ponzetto, 2017;Roller et al., 2018, inter alia) with most models and evaluations, unsurprisingly, focusing on English. Existing work on multilingual and cross-lingual LE (Vyas and Carpuat, 2016;Upadhyay et al., 2018;Kamath et al., 2019) has been rather limited and focused dominantly on major and mutually similar languages and binary LE detection.
Shared Task. Aiming to catalyze the development of models for predicting LE, we organized the shared task described in this paper. Our shared task had a broad scope aiming to cover reasoning over lexical entailment from multiple perspectives. Namely, the subtasks covered both monolingual and cross-lingual setups as well as both binary LE detection and graded LE prediction (i.e., prediction of a degree to which LE holds between concepts). Our shared task encompassed a set of 6 etymologically and typologically diverse languages (and 15 corresponding language pairs): English (EN), German (DE), Italian (IT), Croatian (HR), Turkish (TR), and Albanian (SQ). We offered two different evaluation tracks. In the distributional (Dist) track we allowed only for fully distributional systems, capturing LE only on the basis of unannotated corpora. In contrast, the Any track invited systems that exploit any kind of additional external resources, including lexico-semantic networks. Overall, we did not observe any empirically confirmed strong systems in the Dist track, further corroborating the findings from prior work that building LE-oriented vectors distributionally is more difficult than for some other relations such as broader semantic relatedness (Henderson and Popa, 2016). However, several runs submitted to the Any track pushed the state of the art both in binary LE detection and graded LE prediction, for most of the languages and language pairs in our evaluation.

Data
We started from the LE datasets we previously created and published Vulić et al., 2019b), covering four languages (EN, DE, IT, HR) and extended those datasets to two new languages (TR, SQ). For completeness, we describe the details of the annotation process and the creation of final multilingual and cross-lingual datasets for the shared task.
Starting Point: Graded LE in English. HyperLex  comprises 2,616 English (EN) word pairs (2,163 noun pairs and 453 verb pairs) annotated for the graded LE relation. Unlike in symmetric similarity datasets (Hill et al., 2015;Gerz et al., 2016;Camacho-Collados et al., 2017), word order in each pair (X, Y ) is important: this means that pairs (X, Y ) and (Y, X) can obtain drastically different graded LE ratings. The word pairs were first sampled from WordNet to represent a spectrum of different word relations (e.g., hyponymy-hypernymy, meronymy, co-hyponymy, synonymy, antonymy, no relation). The ratings in the [0, 6] interval were then collected through crowdsourcing by posing the graded LE "To what degree is X a type of Y?" question to human subjects, with each pair rated by at least 10 raters: the score of 6 indicates a perfect LE relation between the concepts X and Y (in that order), and 0 indicates absence of the LE relation. The final score was averaged across individual ratings. The final EN HyperLex dataset reveals that gradience effects are indeed present in human annotations: it contains word pairs with ratings distributed across the entire [0, 6] rating interval.
Word Pair Translation. We first created monolingual HyperLex datasets in three target languages: German (DE), Italian (IT), and Croatian (HR), as described in (Vulić et al., 2019b). For this shared task, we repeated the procedure for two more languages: Turkish (TR), and our surprise test language -Albanian (SQ). We first translated word pairs from the EN HyperLex dataset and re-scored the translated pairs in the target language. The translation approach has been selected because (1) the original EN HyperLex pairs were already carefully selected through a controlled sampling procedure (ensuring a wide coverage of diverse relations). Moreover, (2) we wanted the datasets in different languages to be as comparable as possible in terms of concept coverage. The translation approach has been validated in previous work for creating multilingual semantic similarity datasets (Leviant and Reichart, 2015;Camacho-Collados et al., 2017). Most importantly, it allows for the automatic construction of cross-lingual graded LE datasets.
We have followed the standard word pair translation procedure (Leviant and Reichart, 2015;Camacho-Collados et al., 2017). Each EN HyperLex pair was first translated independently by two native speakers  of the target language. We observed the translation agreement in the range of 80%-90% across the five target languages. Translation disagreements were resolved by a third annotator who selected the better of the two differing translations. We allowed for multi-word translations only if there was no appropriate single word translation, e.g., typewriter (EN) → pisaći stroj (HR).
Concept Scoring and Cross-Lingual Datasets. The resulting 2,616 concept pairs in all five target languages were annotated using a procedure analogous to that for EN HyperLex: the rating interval was [0, 6], and each word pair was rated by 4 (for DE, IT, HR) or 5 (for TR, SQ) native speakers. We then constructed the cross-lingual datasets automatically, leveraging word pair translations and scores in five target languages. For this, we followed the established methodology, used for creating cross-lingual semantic similarity datasets (Camacho-Collados et al., 2015;Camacho-Collados et al., 2017). In short, we first intersect aligned concept pairs (obtained through translation) in two languages: e.g., father-ancestor in English and otac-predak in Croatian are used to create cross-lingual pairs father-predak and otac-ancestor. We then computed the graded LE scores of cross-lingual pairs as averages of corresponding monolingual scores. Finally, we retained only cross-lingual pairs for which the corresponding monolingual scores differ by ≤ 1.0: this heuristic (Camacho-Collados et al., 2017) mitigates the undesirable inter-language semantic shift. Table 1 shows examples of word pairs with score annotations from monolingual and cross-lingual datasets.
Final Shared Task Datasets. The obtained monolingual datasets slightly vary in size due to the elimination of unavoidable same-word pairs (i.e., the pairs in which both English words got translated into the same target language word). Cross-lingual datasets additionally vary in size because of the elimination of cross-lingual pairs for which the scores of the corresponding monolingual pairs mutually differed by more than 1.0. For each monolingual and cross-lingual dataset we separated 500 word pairs for the development portions and retained all remaining word pairs for the test portions used in the final evaluation. 2 For the   Table 2. The distribution of graded LE scores in monolingual test sets and (a sample of) cross-lingual test sets is given by Figure 1 and Figure 2, respectively. The majority of pairs are in the outer intervals (i.e., [0, 1) and [5, 6]), with this being more pronounced for cross-lingual datasets. Nonetheless, the inner interval (i.e., [1, 5)) covers a significant portion (≈ 30%) of (evenly distributed) word pairs, confirming the gradience of the LE relation.

Tasks, Subtasks, and Tracks
Tracks. The participants were asked to designate one of the two evaluation tracks for each of their submitted runs: (1) in the Any track we allowed for any kind of model/system to produce the LE predictions -the participants of this track were allowed to use external resources, including lexico-semantic networks like WordNet (Fellbaum, 1998) or BabelNet (Navigli and Ponzetto, 2012); (2) in contrast, in the Dist track, we allowed only for distributional models relying exclusively on unannotated corpora (of any size). We allowed each participant to submit at most 3 runs in each of the tracks.
Tasks and Subtasks. We defined four top-level tasks: (1) monolingual graded LE prediction, (2) monolingual binary LE detection, (3) cross-lingual graded LE prediction, and (4) cross-lingual binary LE detection. Each language in (1) and (2) (e.g., graded LE prediction for SQ) and each language-pair in (3) and (4) (e.g., binary LE detection for HR-TR) instantiates one concrete subtask. We allowed participants to submit their predictions for an arbitrary set of subtasks. Moreover, the participants were allowed to tackle only graded LE prediction or only binary LE detection.
Evaluation Metrics. For each graded LE prediction subtask, we measured the alignment of predictions and gold LE scores using the Spearman's Rank Correlation Coefficient (Spearman ρ), which is in line with previous work on similar concept pair scoring datasets (Hill et al., 2015;Levy et al., 2015;Vulić et al., 2017, inter alia). For the binary LE detection subtasks we resorted to the standard F 1 measure.

Participating Systems
We now describe in more detail the approaches adopted by the three teams who submitted their system description papers. 3 Team BMEAUT (Kovács et al., 2020). The BMEAUT method for LE detection and prediction is a rule-based approach that exploits Wiktionary definitions (Meyer and Gurevych, 2012) and relies on dependency parsing and semantic graphs. In the first step, the authors apply the dict to 4lang tool (Recski et al., 2016) on Wiktionary definitions of concepts (which can be both unigrams and multi-word expressions, i.e., phrases) in order to induce the directed graphs conforming to the 4lang formalism (Kornai et al., 2015). 4lang graphs are directed graphs with concepts as nodes and three types of edges: edges of type 0 denote attribution (cat 0 − → four-legged), lexical entailment (cat 0 − → mammal), or unary predication (cat 0 − → meow); edges of type 1 and 2 denote relations between the predicate and its subject and object, respectively (e.g., cat Kovács et al. (2020) first extract definitions from Wiktionary using language-specific templates. Each definition is then transformed into a 4lang graph with the help of a language-specific Universal Dependencies (Nivre et al., 2016) parser. Let (x, y) be the candidate word pair from one of our test sets. Kovács et al. (2020) then start from the 4lang graph of x and include all concepts to which x has an outgoing edge of type 0. They then iteratively expand this initial 4lang graph of x by adding in each iteration all concepts for which they have an incoming edge of type 0 from any of the nodes already in the graph. If, via this extension, they get to include y into the graph, they conclude (in the binary setup) that x indeed entails y, i.e., that (x, y) is a positive LE instance. Somewhat expectedly, this rule-based approach has a very high precision, but relatively low recall. This is why, in the next step, the authors augment their 4lang graph containing extractions from Wiktionary definitions with the hypernymy pairs from WordNet (for EN and IT they use language-specific WordNets, for DE they translate German terms to English and use English WordNet).
The reliance on time-consuming rule-based design of language-specific Wiktionary extractors and language-specific LE detection rules on 4lang graphs prevented the authors from submitting predictions for lower-resource languages (HR, SQ, TR). Also, the authors did not submit any graded LE predictions nor cross-lingual predictions. Building reliable scores on a continuous scale has been proven difficult for semantic similarity (Wu and Palmer, 1994;Lin and others, 1998) and distributional similarity measures have been shown to perform better. We presume something similar to be the case with LE and the proposed 4lang graphs -it is inherently difficult to create a reliable LE score based on paths and distances in a symbolic representation that is a (directed) graph.
Team UAlberta (Hauer et al., 2020). The approach of UAlberta for cross-lingual binary LE detection combines sentence-level translations (i.e., parallel corpora), distributional word vectors (i.e., word embeddings) and multilingual lexical resources. Their base method, dubbed BITEXT, mines candidates for cross-lingual LE from parallel corpora -they simply run the FastAlign (Dyer et al., 2013) word alignment algorithm and assume that the LE relation holds between all aligned pairs of words. As clarified by the authors, this will, in most cases, extract cross-lingual synonyms, which, strictly speaking, do satisfy the LE relation; also, in some cases, the alignments will be established between close (e.g., first order) hyponymy-hypernymy pairs -in this case, however, the bitext alignment of words alone does not suggest the direction of the LE relation. The authors simply declare any pair of words from our cross-lingual datasets to stand in the LE relation if they find this pair in the list of generated word alignments. The second run, dubbed VECTORS, extends BITEXT by exploiting similarities between distributional embeddings of words. To paraphrase the authors, the intuition behind VECTORS is that mutually semantically similar concepts tend to entail the same set of other concepts. Let (x, y) be a cross-lingual word alignment generated with BITEXT. Let X = {x 1 , . . . , x n } be the set of terms which are semantically most similar to x in one language and Y = {y 1 , . . . , y m } be the set of terms which are semantically most similar to y in the other language. Note that this does not require a bilingual word embedding space, merely two monolingual word embedding spaces: sets X and Y are obtained by thresholding monolingual word embedding similarities (the threshold value is tuned on the development portions of our LE sets). Finally, all possible pairs (x i , y j ) ∈ X × Y are considered to stand in the LE relation.
Finally, in the third run, the authors couple the bitext-based FastText aligner with the BABELALIGN algorithm, which aligns concepts across languages based on BabelNet (Navigli and Ponzetto, 2012), a massively multilingual lexico-semantic network.
Team SHIKEBLCU (Wang et al., 2020). The approach of Wang et al. (2020) extends the wellestablished line of work based on specializing (i.e., fine-tuning) distributional word vectors for lexical relations, be it symmetric semantic similarity (Faruqui et al., 2015;Mrkšić et al., 2017;Ponti et al., 2018) or the asymmetric LE relation Kamath et al., 2019;Vulić et al., 2019b), using constraints from external lexico-semantic resources like WordNet for supervision. At the core of the approach is the Lexical-Entailment Attract Repel (LEAR) ), a retrofitting model that specializes distributional vectors of words from external constraints (synonyms, antonyms, and LE pairs). Wang et al. (2020) first LE-specialize with LEAR monolingual word embedding spaces of each language independently using language-specific constraints collected from ConceptNet (Speer et al., 2017). The number of constraints they obtain for other languages is, however, significantly smaller than for EN, with especially few constraints obtained for lower-resource languages -HR, TR, and SQ. Because of this, they add additional constraints in target languages by translating EN constraints to target languages via Google Translate. A similar approach of automatic constraint translation has already been proven very effective in the context of symmetric similarity-based specialization of embedding spaces for low-resource languages . This way, Wang et al. (2020) obtain an LE-specialized embedding space for each language.
Following that, in the second step, they learn a linear projection mapping between the LE-specialized monolingual spaces with the VecMap tool for inducing bilingual word embedding spaces (Artetxe et al., 2018). Recent comparative evaluations Vulić et al., 2019a) rendered VecMap as one of the most robust algorithms for inducing cross-lingual embedding spaces. The word translations obtained with Google Translate when translating EN constraints are also forwarded to VecMap as supervision for inducing bilingual embedding spaces.

Official Evaluation
We now report the official results of our evaluation. We first describe the baselines (Section 5.1) and then show the performances for all submitted runs (Section 5.2). 5

Baselines
For the Dist track we use simple cosine similarity between distributional word vectors as a baseline. To this end, we use the 300-dimensional FastText embeddings (Bojanowski et al., 2017) trained on Wikipedias of respective languages. 6 For the cross-lingual (sub)tasks we induce the bilingual embedding spaces via the simple Procrustes alignment (Smith et al., 2017), using 5K word translation dictionaries, as described in . Since LE is an asymmetric relation and cosine similarity is a symmetric measure, we did not expect this baseline to be particularly competitive and expected most participants to outperform it.
For the Any track, we used GLEN, our recent neural explicit specialization model for LE ) as a competitive baseline. GLEN is a simple feed-forward network that learns to specialize distributional word vectors for LE based on three types of lexico-semantic constraints (originating primarily from WordNet): synonyms, antonyms, and LE constraints. Unlike the standard retrofitting models (Faruqui et al., 2015;Mrkšić et al., 2017;, in explicit retrofitting the constraints are not   used to directly tune the vectors of the words from those constraints, but are rather exploited as training examples to learn a general specialization function (in case of GLEN, a feed-forward network). This way, explicit retrofitting models (Glavaš and Ponzetto, 2017; can specialize any distributional vector from the original monolingual space and not just the vectors of words from the constraints, as is the case with classic retrofitting models. We use the pre-trained GLEN instance from our original work, which was trained using only EN constraints. In order to make GLEN applicable in monolingual tasks in other languages as well as in the cross-lingual tasks, we first project monolingual embeddings of other languages to the EN monolingual embedding space. For more details, we refer the reader to the original paper . Both baselines (cosine similarity in the Dist track and GLEN in the Any track) produce real-valued scores which can be directly used for our graded LE evaluations. For the binary LE detection subtasks, we first binarize the scores on some threshold value: for both baselines we tune the threshold value on respective development dataset portions.

Results
Monolingual Results. In Table 3 we report the results for all monolingual binary LE detection subtasks. Table 4 displays the results for the respective graded LE subtasks. Unfortunately, we have not seen any encouraging results in the Dist track: the two runs submitted for the binary LE detection fail to outperform the simple cosine similarity baseline; 7 and we have not received any submissions for the graded LE prediction for the Dist track. Despite the limited number of overall submissions, this also points to the complexity of graded LE reasoning based solely on raw text data and distributional signal.
The results in the Any track are, however, much more exciting. Both Wang et al. (2020) and Kovács et al. (2020) manage to outperform our competitive baseline GLEN in binary LE detection, in some cases by a fairly wide margin (e.g., BMEAUT for EN and IT, SHIKEBLCU for HR and SQ). The results achieved by SHIKEBLCU in the graded monolingual LE tasks (Table 4) are even more encouraging and truly push the state-of-the-art in graded multilingual LE prediction -the improvements over GLEN are ≥ 20 Spearman correlation points for low-resource languages in our evaluation (TR, HR, SQ). Both BMEAUT and SHIKELBCU use language-specific constraints (BMEAUT by extracting 4LANG graphs from language-specific Wiktionaries and SHIKELBCU by translating the EN constraints from WordNet), Submission de-en de-hr de-it de-sq de-tr en-hr en-it en-sq en-tr hr-it hr-sq hr-tr it-sq it-tr sq-tr   whereas GLEN relies only on EN constraints and then transfers the LE specialization to other languages via cross-lingual word embeddings. The especially good performance of SHIKEBLCU -who translate large constraint sets to target languages -in graded LE is aligned with a similar finding established for the symmetric relation of semantic similarity: (a) translation of constraints and specialization directly in the target language (Ponti et al., 2019) substantially outperforms (b) the specialization in the source language (EN) followed by a transfer of the specialization model to the target languages via cross-lingual embedding spaces .
Cross-Lingual Results. The results of our cross-lingual evaluation are shown in Table 5 (binary LE detection) and Table 6 (graded LE prediction). The results mostly follow the same trends of the monolingual results: we received no successful runs in the Dist track, but we see very encouraging results in the Any track. Similar to the monolingual settings, SHIKEBLCU outperforms the competitive GLEN baseline, although now with somewhat narrower margins and not for all language pairs, especially in the graded LE setup (Table 6). An encouraging finding is that largest gains with SHIKEBLCU over GLEN are for language pairs involving Albanian (SQ), our surprise evaluation language and arguably the most resourcelean language in our evaluation (e.g., the margins in graded LE prediction in favor of SHIKEBLCU are 13%, 19%, and 14% for HR-SQ, IT-SQ, and SQ-TR, respectively). Although none of the UAlberta runs (Hauer et al., 2020) outperform GLEN, it is encouraging to see that competitive performance can be achieved by simple methods with relatively low-resource demands (e.g., their run-1, which is their VECTORS approach, requires only parallel corpora and monolingual word embedding spaces).

Conclusion
As a fundamental asymmetric lexico-semantic relation, lexical entailment (LE) supports construction of concept hierarchies and downstream applications that require reasoning and inference. In multilingual and cross-lingual applications we need models that can detect LE across languages. This is why we carried out a SemEval shared task (Task 2) on predicting LE, spanning (1) monolingual vs. cross-lingual LE detection, (2) binary LE detection vs. graded LE prediction, and (3) a set of 6 diverse languages (and 15 language pairs). In the track in which we allowed for any external resources to be used (Any track), we received submissions that substantially push the state of the art across all languages and language pairs, for both binary LE detection and graded LE prediction. We hope that these methodological advances, instigated by the SemEval task and the constructed datasets, will inform and inspire further work in fields such as multilingual taxonomy induction and language inference.