IRB-NLP at SemEval-2022 Task 1: Exploring the Relationship Between Words and Their Semantic Representations

What is the relation between a word and its description, or a word and its embedding? Both descriptions and embeddings are semantic representations of words. But, what information from the original word remains in these representations? Or more importantly, which information about a word do these two representations share? Definition Modeling and Reverse Dictionary are two opposite learning tasks that address these questions. The goal of the Definition Modeling task is to investigate the power of information laying inside a word embedding to express the meaning of the word in a humanly understandable way – as a dictionary definition. Conversely, the Reverse Dictionary task explores the ability to predict word embeddings directly from its definition. In this paper, by tackling these two tasks, we are exploring the relationship between words and their semantic representations. We present our findings based on the descriptive, exploratory, and predictive data analysis conducted on the CODWOE dataset. We give a detailed overview of the systems that we designed for Definition Modeling and Reverse Dictionary tasks, and that achieved top scores on SemEval-2022 CODWOE challenge in several subtasks. We hope that our experimental results concerning the predictive models and the data analyses we provide will prove useful in future explorations of word representations and their relationships.


Introduction
The COmparing Dictionaries and WOrd Embeddings (CODWOE) task (Mickus et al., 2022) is aimed at explaining two different types of semantic descriptions of words: dictionary glosses and word embeddings. A dictionary gloss is a brief textual explanation of a word and a word embedding is a vector representation that captures the word's semantic and syntactic properties (Smith, 2020).
In order to investigate the relationship between these two types of descriptions, two complementary subtracks were put together: 1. Definition Modeling (DEFMOD) track, where correct glosses need to be generated from word embedding vectors (Noraset et al., 2017); and 2. Reverse Dictionary (REVDICT) track, where correct embedding vectors should be generated from dictionary glosses (Hill et al., 2016). The datasets for both tracks cover five different languages: English (EN), Spanish (ES), French (FR), Italian (IT), and Russian (RU).
The key challenge of the CODWOE task is that it needs to be performed without external data, which precludes the use of pretrained models and vectors. Additionally, the training dataset is relatively small in comparison to the datasets on which models are typically trained.
Our strategy was to adapt an RNN-based decoder model (Noraset et al., 2017) for the DEFMOD track, and to use a transformer-based encoder (Devlin et al., 2019) for the REVDICT track. With the limited amount of available data in mind, we hypothesized that models should not be large. Therefore we aimed to limit the model complexity by reducing the number of parameters, for example by using a subword tokenizer (Kudo and Richardson, 2018), which yields a smaller dictionary of optimized subword fragments. All of the models we used were built for a single language, and their structure and parameters were optimized either iteratively or by way of Bayesian hyperparameter optimization (BHO) (Snoek et al., 2012).
We conducted data analyses of the CODWOE datasets and analyses of the developed machine learning models. We performed a statistical and visual analysis of the pretrained CODWOE embeddings, i.e., of their distributions and relationships. DEFMOD analyses include an analysis of model performance factors and a qualitative analysis of generated glosses. In the REVDICT predictive analysis, Table 1: Aggregated language-level ranks of our team for the DEFMOD (DM) and REVDICT (RD) tracks (and the number of teams competing in a subtask). TASK EN ES FR IT RU DM-all 2 (9) 1 (7) 1 (6) 5 (7) 5 (6) RD-sgns 3 (9) 1 (7) 1 (6) 1 (7) 2 (6) RD-char 4 (7) 3 (5) 4 (5) 2 (6) 2 (5)  3 (4) 3 (4) we investigate the impact of many different settings on models' performance defined in terms of distance and similarity scores between predicted and target vectors. We show that our adaptation of the DEFMOD architecture (Noraset et al., 2017) can perform competitively and that the use of multiple word embeddings can clearly improve the generation of word glosses. For REVDICT, we demonstrate that our approaches achieve top performance in terms of ranking, which makes them suitable for information retrieval applications. Our models perform competitively and our results on the CODWOE challenge can be found in Table 1. We make the code of our models and data analyses publicly available 1 . Subsequent work on Definition Modeling focused on variations of the problem of prediction of a word gloss from the word sense. These approaches consider gloss prediction based on sensespecific word embeddings (Gadetsky et al., 2018;Kabiri and Cook, 2020;Zhu et al., 2019), and on a word-based context indicating the word sense (Bevilacqua et al., 2020;Gadetsky et al., 2018;Mickus et al., 2019;Yang et al., 2020;Zhang et al., 2020). The proposed approaches are based either on RNNs (Gadetsky et al., 2018;Kabiri and Cook, 2020;Zhang et al., 2020;Zhu et al., 2019) or Transformers (Bevilacqua et al., 2020;Mickus et al., 2019). All of the previous approaches rely on word embeddings pre-trained on large corpora, most commonly word2vec (Mikolov et al., 2013).
Sense-aware approaches that take embeddings as input make use of either sense-aware word embeddings (Gadetsky et al., 2018;Kabiri and Cook, 2020) or of decomposition of word embeddings into sense-specific vectors (Zhu et al., 2019).
The initially proposed architecture of Noraset et al. (2017) is often used as a baseline solution. The most commonly used measure of model performance is the BLEU (Papineni et al., 2002) metric. Although there is some overlap in used datasets, most experiments rely on a specific dataset. The reported model performances vary greatly. Noraset et al. (2017) report BLEU of 31 and 23, depending on the dictionary. Subsequent experiments report, for the same approach, BLEU scores that range from as little as 11 (Gadetsky et al., 2018) to as much as 60 (Kabiri and Cook, 2020). The variation can be great even for the same language and experimental setup (Kabiri and Cook, 2020). The original approach of Noraset et al. (2017) remains competitive in the sense-aware setting, with the sense-aware approaches achieving BLEU increases that range between 1 2 (Gadetsky et al., 2018;Kabiri and Cook, 2020;Zhang et al., 2020) and 5 6 (Kabiri and Cook, 2020;Yang et al., 2020;Zhang et al., 2020), depending on the setting.
While we view the Definition Modeling primarily as a theoretically interesting task, potential applications include explainability of word embeddings and automatic generation of dictionaries, which might be of interest in low-resource settings.
Reverse Dictionary The Reverse Dictionary (REVDICT) is a task of finding the right word when a word description is given (Bilac et al., 2004;Dutoit and Nugues, 2002;Zock and Bilac, 2004). It is the formulation of the tip-of-the-tongue problem (TOT) (Brown and McNeill, 1966) that occurs during text synthesis. It is a condition in which a person knows a lot about the word, such as its meaning and origin, but is unable to recall it. REVDICT is a complex task. There are countless variations of input definitions that should lead to the same oneword concept. This complexity comes in part from the representation of the one-word concepts in the human mind. People tend to relate concepts on the conceptual and lexical level and form a highly connected network of abstractions (Zock and Bilac, 2004).
Therefore, a natural approach to solving REVDICT is to form a semantic network with nodes (one-word concepts) and edges (associ-ations) to search for the target word (Thorat and Choudhari, 2016;Zock and Bilac, 2004). REVDICT can be realized directly by comparing the input definitions with all the definitions in the dictionary and returning the most similar ones, without taking into account any semantic or grammatical information (El-Kahlout and Oflazer, 2004). However, REVDICT systems that include semantics give better results, such as in Méndez et al., 2013 andCalvo et al., 2016 where words are represented as vectors in a semantic space.
Recent REVDICT approaches utilize deep learning (DL) to map arbitrary-length definition phrases to the vector representation of the target word (Hill et al., 2016;Malekzadeh et al., 2021;Qi et al., 2020;Yan et al., 2020). The success of DL approaches indicates that REVDICT can be solved implicitly, i.e. by directly learning from given data, and doesn't require an explicit injection of domain knowledge. According to this observation, the DL approach is a good choice for solving the REVDICT task.

Dataset
The CODWOE datasets (Mickus et al., 2022) cover five languages (EN, ES, FR, IT, RU) and are derived from the Dbnary lexical data 2 . Each data point corresponds to a single word and contains word embedding vectors and the word gloss. Three types of embedding are used, labeled as sgns (pretrained word2vec), electra (contextual pretrained embeddings) and char (character-based embeddings). Pretrained embeddings are based on large corpora containing approximately 1B tokens.
Each dataset is divided into three sections: training, validation (development), and test. Datasets for training and validation have 43.608 and 6.375 samples, respectively. Each track also has a separate set of test data. The DEFMOD test dataset has 6.221 samples while the REVDICT has 6.208 samples.
More detailed statistics and analyses of the dataset can be found in the Appendices, including the gloss statistics (Table 5) and embedding vector statistics (Table 11). Descriptive analysis of the embedding vectors shows large variation in values that depend on a language and an embedding type (Figures 3 and 4). Additionally, an exploratory analysis showed that the embeddings for different languages are easily separable (Figures 6 and 5). Interestingly, patterns of vector-based word similarity seem to 2 http://kaiko.getalp.org/about-dbnary/ differ significantly across embedding types, and in this regard there are no visible relations between different embeddings (Figure 7).

System overview
Both the DEFMOD and the REVDICT models rely on unigram subword tokenizers (Kudo, 2018) trained on glosses from the train datasets.

Definition Modeling
Our approach to the challenging task of Definition Modeling on a limited dataset consists of preprocessing the input data, extracting the semantic information from the dataset, and controlling the model size and complexity.
The inspection of the learning data revealed that the gloss texts are often long since they consist of several alternative definitions. We opted to include only one definition per learning example. Our intuition is that this approach, also taken in (Noraset et al., 2017), alleviates the learning problem by inducing the model to learn shorter and atomic definitions. The approach should also reduce noise (since the number of alternative definitions in a gloss is arbitrary).
The inspection of glosses also revealed the presence of lexicographic labels that precede the gloss definitions. These labels, present for all languages except English, convey data about, for example, word semantics (ex. geography, history) or temporal category (ex. archaic). We chose to remove these labels since they introduce noise (the presence and the amount of labels appears arbitrary), increase the dictionary size, and thus make the learning problem harder.
To construct the dictionary we use the unigram subword tokenizer (Kudo, 2018) implemented as part of the SentencePiece tool (Kudo and Richardson, 2018). The reasons for using the subword tokenization were the expected improvement in performance for low-resource tasks (Kudo, 2018) and the reduction in the number of model parameters corresponding to token embeddings.
Since we opted for a deep learning model depending on token embeddings, we initialized the token embeddings with GloVe vectors (Pennington et al., 2014) trained on the dataset of normalized and cleaned atomic glosses. To demonstrate that the GloVe vectors capture a degree of word semantics, we aggregated the vectors on a gloss level using tf-idf weighting. Then we inspected, for each "target" gloss from a sample of English glosses, other dataset glosses ordered by cosine similarity to the target. This revealed that GloVe similarity corresponds to the similarity in gloss meaning. Additionally, we found that the models initialized with GloVe vectors achieve a lower final loss.
Machine learning model We decided to use an adaptation of the RNN-based model of (Noraset et al., 2017), that proved competitive in a number of experimental settings. In the context of the DEFMOD task, the model takes as input one or more word embeddings (sgns, electra or char) and produces a gloss (a sequence of tokens) that should correspond to the word's correct gloss.
From the input embeddings, we form two vectors, the seed vector s that is used to initialize the RNN, and the context vector c. For both the seed and the context vectors we consider using a single embedding, concatenation of embeddings, and a nonlinear transformation of the concatenation. At each position in the sequence the context vector is passed as input, together with the RNN's output, to the special GRU-like gated cell (Noraset et al., 2017). The output of the gated cell is then transformed (via linear transformation and softmax activation) to produce token-level probabilities. The gated cell can learn to effectively combine the semantic context with the RNN-level features in guiding the generation process (Noraset et al., 2017). The network architecture we use is labeled as S+G in Noraset et al. (2017).
The described model performs conditional generation of tokens in a sequence, which is a standard approach in RNN-based language modeling. The probability of a gloss g is factorized under the assumption that each token g i depends on the previous tokens, the seed embbedding s, and the context c : In (Noraset et al., 2017), the context is equal to the seed, i.e., the input word embedding. In our case, both the seed and the context can either be a single embedding or a function of multiple embeddings. This approach enables us to leverage the information from several word embeddings in a flexible way. For example, sgns embeddings can be used as a seed while the context can be formed by passing all the embeddings through a multilayer perceptron. Another important difference is that we use the unigram subword tokenization (Kudo, 2018). Finally, we experiment with using both LSTM and GRU as the network's RNN components.

Reverse Dictionary
We approach REVDICT as a supervised vector regression task and employ an end-to-end deep learning solution. Our model is based on a transformer architecture (Vaswani et al., 2017) used as a definition sentence encoder, and a fully connected feed-forward network used as an output regression module.
The transformer is used to produce useful representations from given inputs, where the inputs are tokenized definition sentences. For each subword token in the input sequence, the transformer gives a representation in the form of a vector. Our REVDICT systems implement three different approaches for aggregating the output vectors produced by the transformer: 1. sum, where we sum the representations given for each token in the input sequence; 2. average, where we average the representations given for each token in the input sequence; and 3. eos, where we use only the representation of the last token in the input sequence, i.e. end-of-sequence (eos) token. The output module further transforms these representations into word embedding vectors.
Additionally, we utilize a multi-task learning (Caruana, 1997;Ruder, 2017) approach. To support multi-task learning, we implemented multiple output regression modules that simultaneously predict different types of embedding vectors from the same representations produced by a single encoder. Multi-task learning is used during the model training phase and only output from one output module makes final predictions. The motivation for using a multi-task learning approach is to benefit from inductive transfer between tasks that could improve the results of predicting a single task (Caruana, 1997).

Model Selection and Experimental Setup
In this section we describe the technical details of data preprocessing and model selection that comprise our methods of constructing the DEFMOD and REVDICT models. The conceptual description of the methods is given in Section 3.

Definition Modeling
Our choices regarding the technical details of data preprocessing and model construction were guided by what we will call development experiments. These experiments consisted of training the model on the train set, and observing both the final development set loss and the quality of the produced glosses.
Output gloss quality was assessed using a separate "trial" dataset -a small dataset of 200 items provided by the organizers, containing gloss information consisting of the embedding vector, the original word, and the gloss text. The assessment was performed for English glosses only and aimed to assess the quality of the generated text, and the similarity of the output and the original glosses. A choice was deemed an improvement if it led to the improvement of development loss and either improved the generated glosses or caused no degradation in gloss quality. The development of the final algorithm was performed iteratively and heuristically. However, the overall improvement over the iterations is confirmed by the results of the test set evaluations.
Dataset transformation The transformation of the original dataset is performed by creating unambiguous training examples and removing the uninformative data that makes the problem harder.
In the original dataset a gloss definition often consists of several equivalent but differently phrased definitions. We divided the dictionary glosses into atomic definitions by splitting the text strings around the ";" character. This heuristic was motivated by gloss sample analysis and the inspection of a sample of atomic glosses revealed that it works in the majority of cases. Each atomic gloss in the new dataset was paired with all the embedding vectors of the original gloss.
In order to remove lexicographical labels from the beginning of the glosses' text, simple languagespecific regular expressions and removal rules were formed based on gloss sample analysis. This approach proved to be effective for a large majority of glosses.
To perform further normalization we additionally lowercased all the glosses and removed the punctuation from the end of texts. The code used to preprocess the original dataset, the new dataset, and the transformation log can be found in the code repository. We note that both the SentencePiece dictionary and the GloVe vectors used for DEFMOD are derived from the transformed dataset. The statistics of the transformed glosses are presented in Table 6 Dictionary We used the unigram subword tokenizer (Kudo, 2018) available as part of the Sen-tencePiece tool (Kudo and Richardson, 2018). and trained it using the default parameters. Experiments in Gowda and May (2020) suggest that a vocabulary of 8000 subwords is a good default choice for several languages in the case of machine translation. Additionally, our development experiments showed that English models using a vocabulary of 8000 subwords are superior to 10000 subword models. Therefore we decided to set the number of unigram tokens to 8000 in case of English, and to 8500 in case of other, highly inflected languages expected to have a higher number of distinct suffixes.
Pretrained token embeddings GloVe embeddings (Pennington et al., 2014) of the subword tokens, introduced to initialize the tokens with corpus-level semantic information, were constructed as follows. The model was trained on the set of transformed glosses, and the embedding size was fixed to 256 (the size of the gloss embeddings). The number of training iterations was set to 50, the "cutoff" parameter x max was set to 10, while all the other parameters retained their default values. No frequency-based vocabulary pruning was performed.
Machine learning model We fixed the maximum sequence length of the RNN models to 64 subword tokens. Our intuition is that this alleviates the learning problem and could lead to models focused on generating shorter but more correct glosses.
The models were optimized using the AdamW algorithm (Loshchilov and Hutter, 2017) and the standard categorical cross-entropy loss. The training process was stopped after a fixed number of epochs, or if the best solution did not improve by more than 0.1% over 10 epochs. During inference, the optimal solution was constructed using the beam search algorithm implementation provided by the competition organizers 3 .
We iteratively improved the models using the described development experiments, i.e., relying on the development set loss and analysis of model glosses produced for the trial dataset. We experimented with several architectural elements and hyperparameters: the formulation of the seed (RNN init. value) and context (gate input) of the network, RNN cell type, dropout, learning rate (LR) and LR scheduler, and the number of training epochs.
The most successful variant is constructed by using the concatenation of all the gloss embeddings as the context and the sgns embedding as the seed. This variant uses input dropout of 0.1 and network dropout of 0.3. The input dropout is applied to the seed and context vectors, as well as to the word embeddings. The network dropout is applied to the output of the RNN (final layer) and to the output of the gate cell. The chosen learning rate is 0.001, and the "plateau" LR scheduler is used -LR is multiplied by 0.1 if there is no improvement over 5 epochs.
For the context vector, we tried single embeddings and the combined embeddings merged via a multilayer perceptron. Both variants proved inferior to the concatenation of all vectors. The merged seed vector proved no different from the single embedding seed, so we opted for the simpler solution. Both the development experiments and the results showed no difference between the LSTM and the GRU cell.
Analysis of errors revealed that models sometimes produce a deformed output (very short or nonalphabetic string), and that this almost never occurs simultaneously for two distinct models. Therefore a way of heuristic model improvement is to combine it with another fallback model to be used in case of deformed outputs. We combined a model with a concatenated context and a model with a single-embedding context, or two models with distinct RNN cell types. A more detailed analysis of the model variants can be found in Appendix A.2.

Reverse Dictionary
We conducted various development experiments before deciding on the final configuration of our REVDICT solutions. In all of the experiments, we used the entire set of train data to train the model, and the entire set of validation (development) data for scoring. We used Mean Squared Error (MSE) as a loss function during training. We tested the effect of cosine loss if added to MSE with different coefficients, but we obtained the best results without cosine loss. We also used MSE for scoring models during Bayesian hyperparameter optimization (BHO).
To determine the optimal model size, we searched the space of two transformer hyperparameters: the number of heads and the number of layers. We used a grid search approach with these values v 2 {1, 2, 4, 8} for both hyperparameters. Additionally, we used BHO (Snoek et al., 2012) to find the optimal model for each grid point. However, the increase in model size did not increase the performance of the model. These results were in line with the expectations we had due to the small size of the datasets. Accordingly, we decided to use a transformer with two heads and two layers. Additionally, we experimented with the maximum length of the input sequence and achieved better validation performance with 256 tokens than 512 with tokens.
We compared performance with and without token embeddings initialization with GloVe vectors. Contrary to our expectations, there was no significant difference in validation performance between these two options, so we skipped the GloVe initialization in the REVDICT system settings. Another development experiment we conducted was to find the optimal method for aggregating the output vectors produced by the transformer, described in Section 3.2. We found that the average method gives the best results in all cases. Furthermore, we examined the influence of the number of layers in the output module on the final prediction. According to the results, there is no benefit in increasing the number of layers in the output module, so we chose a single-layer fully connected network. We also chose Rectified Linear Unit (ReLU) activation function for the output regression module, because it yielded better performance than hyperbolic tangent (Tanh) activation.
Finally, we made six different solutions for REVDICT task. All of these solutions used a two- Table 3: Results for the IRB-NLP team systems on the DEFMOD task. MoverScore, BLEU, and lemma-BLEU results are given for each of the five languages. Best result across all teams and models is given, followed by the results of our two best systems. Overall best results of our team are bolded and the rankings can be found in Table 1

Results
Definition Modeling On the DEFMOD task, the models were evaluated using three metrics: BLEU score (Papineni et al., 2002), lemma-level BLEU score, and MoverScore (Zhao et al., 2019). While the BLEU score is based on matching token n-grams between the reference and the modelproduced text, MoverScore calculates a measure of distance between texts embedded in a semantic space, i.e., between two sets of contextual word embeddings computed using a transformer model. Table 3 contains scores for two of our best model configurations, "version 3" and "version 4". Both model configurations are described in detail at the end of Section 4.1. While version 3 models are based on GRU RNN and trained using 300 training epochs, version 4 models are built with either GRU or LSTM and 450 epochs. The fallback strategy, which yields slight performance gains, is also used. These results are presented and analyzed in a more detailed manner in Appendix A.2.
Results in Table 3 show that our models are competitive with other teams' models on English, Span-ish and French, especially in terms of the BLEU scores. MoverScore results are weaker than those produced by the top models, but rank among the upper half of the systems except for Italian and Russian, languages for which our models' performance is below average. Rankings aggregated across all the scores, displayed in Table 1, reflect the above observations and show that the models we produced can perform quite competitively.
Our approach shows inter-language variation, both in relative (ranks) and absolute (score values) terms. The full results provided by the organizers 4 show that this is also true for other teams -for example, few of the high-performing models perform markedly better for Italian and Russian than for other languages. However, some approaches yield more stable results across all languages.
All of the models yielded by the CODWOE shared task perform weakly in terms of BLEU. Namely, the BLEU scores of the existing DEFMOD approaches commonly achieve BLEU scores in the range of 20 to 30 (Kabiri and Cook, 2020; Noraset et al., 2017), with some settings yielding BLEU as high as 60 (Kabiri and Cook, 2020). The experiments with the weakest reported BLEU scores (Gadetsky et al., 2018;Kabiri and Cook, 2020) reports BLEU scores of approx. 12, while the best CODWOE scores are below BLEU 10.
CODWOE DEFMOD models perform better in terms of MoverScore, a metric designed for machine summarization (Zhao et al., 2019). An analysis of a number of summarization systems showed that MoverScore values range between 15 and 24, with an absolute minimum of 10 and an average slightly below 20 (Fabbri et al., 2021). In comparison, top CODWOE systems reach scores between 12 and 15, except in the case of French, which puts them on the lower end of the summarization scale.
As for the representativeness of the test data, the visual analysis performed in A.1 shows that the distribution of test gloss embeddings matches the train distribution well. Another factor that potentially influences performance is word rarity. We observed that the English test examples contain a significant amount of rare words (such as "pelta", "akimbo", "gothy", or "dungarees"), while some DEFMOD experiments explicitly focus on the most frequent words (Noraset et al., 2017).
The greatest performance gains for the models we used come from using all three vector embeddings to form a context vector. This suggests that future approaches can benefit from leveraging several distinct embeddings types as input for gloss generation.
We believe that the question of the influence of various factors on the performance of DEFMOD systems is important and under-explored. These factors include model structure and parameters, performance metric, dataset size (both for training and pre-training), and the semantic relation between training and test data. Closely related is the question of the nature of semantic generalization that DEFMOD systems are capable of -what kind of examples (and relations contained within them) can inform a successful inference of glosses for unseen embeddings.
Further performance-related analyses can be found in Appendix A.2. Appendix A.3 contains a qualitative analysis of glosses that shows that generated glosses can capture varying levels of semantic properties of the correct glosses. We hypothesize that these variations in similarity are hard to capture with metrics such as MoverScore and BLEU.
Reverse Dictionary We used the following metrics for internal validation of our REVDICT solutions (described in Section 4.2): Mean Squared Error (MSE), Cosine Similarity (COS), and Central Kernel Alignment (CKA) (Cortes et al., 2012;Kornblith et al., 2019). COS measure has noted drawbacks (Heidarian and Dinneen, 2016). Therefore, we use the linear CKA similarity measure to gain another perspective on model performance. Validation scores can be found in Appendix B.2, Table 12. It is evident that each subsequent approach gives better validation results than the previous ones.
Test predictions were scored by the following metrics: MSE, COS, and Cosine-Based Ranking (RNK). The RNK measure is defined as the proportion of test samples with cosine similarity to the model output embedding higher than the ground Figure 1: Example of two different predictions for ground truth vector V GT , where predicted vector V 1 has better MSE and COS scores than V 2 , and V 2 has better RNK score than V 1 . The rest of the points represent vectors of other test samples. truth embedding. The final results of our solutions can be found in Table 13 (see Appendix B.2). Here, each subsequent approach has lower scores than the previous ones, which is the complete opposite of the validation results. This suggests potential overfitting to the dev dataset that could be the result of BHO. However, this is contrary to expectations as the last two solutions have three times fewer BHO points and should not overfit to the dev dataset. The reason for this phenomenon is unclear and needs further investigation. Finally, the best REVDICT results for each team can be found in Appendix B.2 (Table 14 for MSE, Table 15 for COS, and Table  16 for RNK). The test results and overall rankings of our solutions are summarized in Table 4.
Compared to other solutions, our systems have average or below-average performance in terms of MSE and COS test scores. However, they perform significantly better than the other approaches in terms of RNK test scores, from which we conclude that our solutions are better suited for the retrieval task. This is an interesting situation which we elaborate with the following example, shown in Figure  1. It depicts two different predictions, V 1 and V 2 , the first with better MSE and COS scores, and the second with a better RNK score. The second solution prefers a vector subspace with a lower density of test samples even if the absolute distance from the correct vector is greater. With a smaller set of possible surrounding solutions, retrieving the vector V GT from the vector V 2 is more precise than retrieving it from the vector V 1 .

Conclusion
Definition Modeling and Reverse Dictionary are two opposite learning tasks for exploring the relationship between different semantic representations of words. CODWOE SemEval task (Mickus et al., 2022) is designed to investigate these tasks on five different languages using three different types of word embeddings.
We propose an adaptation of an existing DEFMOD model and analyze its performance and the glosses generated by the model. We believe that DEFMOD is a theoretically interesting problem and that further investigations should focus on discovering which types of semantic generalization the models are able to perform, and how this generalization ability is influenced by both the data and the models' structure. The existing DEFMOD experiments are largely incomparable since they are based on different data and setups. We believe that a contribution of the CODWOE task is the creation of a multilingual evaluation setting, as well as the use of the flexible MoverScore as an evaluation metric.
Our REVDICT systems are based on deep regression models based on transformer architecture that achieved top scores for the difficult-to-predict sgns (word2vec) embeddings. In most cases our REVDICT solutions perform significantly better then the other systems in terms of the RNK score. These results imply that our solutions could be the appropriate approach for retrieving the right word from its description, a problem crucial for solving the TOT problem (Brown and McNeill, 1966) in machine-assisted text synthesis.
In summary, the models that we produced for the CODWOE task perform competitively when compared to other participants' models, and can therefore serve as a reasonable starting point for future tackling of DEFMOD and REVDICT problems. We believe that the promising directions for future optimizations include the construction of multilingual and multi-task models, as well as investigations of the influence of the external data, primarily in the form of huge pre-training corpora.

A.1 Train and Test Data
Motivated by the weak performance of DEFMOD models (see Section 5), we examined whether the distributions of train and test data are comparable. To this end we created 2D projections of sgns and electra embedding for all five languages using the t-SNE method (Van der Maaten and Hinton, 2008). The projections, depicted in Figure 2, show that the train and test distributions of the embeddings match well. It is therefore reasonable to expect that the distributions of the gloss texts are similar as well, as the gloss semantics expectedly matches the semantics of the corresponding words. However, this conjecture should be confirmed experimentally, for example by per-gloss aggregation of pretrained word embeddings extracted from huge corpora. Figure also shows that the electra vectors are more separable than the sgns vectors. The separability of the embedding vectors varies across languages, probably influenced by the corpora used for pre-training of the embeddings. We note that the observations about the train and test embedding distributions are also applicable to the REVDICT problem aimed at the prediction of the embeddings from gloss texts.
Basic gloss statistics can be found in Table 5. There exists a large variation in gloss size between languages, e.g., the longest gloss from the ES dataset is almost twice the size of the longest EN gloss. In addition, the longest glosses in the validation (development) datasets are significantly smaller then those in the train datasets, on average 42.55% smaller. The 'dictionary size' column in the table is the number of distinct tokens in each dataset. Dictionary sizes vary, for example, EN dictionary is approximately half the size of the RU dictionary. Differences between the gloss and dictionary sizes suggest that it is reasonable to use a separate model for each language.
Basic statistics of the transformed dataset can be found in Table 6. As expected, the transformed glosses are significantly smaller then the glosses in the original dataset. For example, the median transformed gloss size is on average 29.25% smaller.

A.2 DEFMOD Models' Performance
Here we append Section 5 with a more fine-grained analysis of the DEFMOD models. Table 8 contains the models' performances. As can be seen, the largest gains are achieved by using all of the embedding vectors as input for gloss generation (context=allvec). There exists a negligible difference between the LSTM and GRU RNNs, with GRU performing slightly better. Using a fallback model always slightly improves the MoverScore of a model. In Table 8 the architecture of the fallback model is the architecture of the main model with the corresponding parameter replaced with the value in the 'fallback' column. Interestingly, using contextual electra vectors does not help, i.e., the sgns (word2vec) vectors which are not context-aware perform comparably. This is true even when only a single embedding is used, i.e.,  when context equals electra. The equality of sgns and electra is unexpected since both the train and test datasets contain polysemous electra vectors and words with multiple senses. It is also interesting to consider the influence of the training data on the model's performance. We hypothesize that a DEFMOD model's score on a single test example is positively correlated with the semantic closeness of the example to the examples in the train set. To test this hypothesis we calculate Spearman correlation between test MoverScore and BLEU on one, and the cosine similarity of the test embedding and most similar train embeddings. This is done for the best-performing submitted model from Table 8. We also calculate the average scores on two sets of 10% test examples that are least similar and most similar to the train examples. Since the embeddings (sgns and electra) were built on large outside corpora, it is reasonable to believe that they capture semantic similarity of the associated words and glosses. Surprisingly, the results show a lack of consistent and strong correlation and the correlations range from weakly negative to weakly positive, depending on both the language and the embedding type. This lack of correlation could be caused by many factors, including the nature of the model, the nature of the pretrained embeddings, and the semantics of the cosine similarity measure.
The future extensions and improvements of the proposed analysis could reveal the nature of the train data necessary for the DEFMOD models to successfully generalize, and perhaps point to a similarity measure that reveals more fine-grained properties of such a generalization. Table 7: Correlation between the best DEFMOD model's scores on one, and the closeness of the test examples to the train set on the other side. The unit of correlation is an example from the test set, and its similarity to the train set is calculated as the average cosine similarity with the 10 most similar train embeddings.

A.3 Qualitative Analysis of Generated Glosses
The DEFMOD models achieve weak results in comparison to the previous state-of-art approaches, which is probably due to the comparably small amount of training and pretraining data. Here we demonstrate that the generated glosses can nevertheless capture a degree of the semantics of the correct glosses. Table 9 shows four categories of semantic similarity between the correct and model-generated glosses, in descending order (highest similarity first). These categories include hits or near hits (correct glosses), "near misses" (glosses that capture a significant amount of the original meaning), somewhat similar glosses, and complete misses. Several examples demonstrate that the subword-based models can produce syntactically incorrect glosses. Table 10 contains generated glosses for different senses of the word "consider", which demonstrate that the model was able to approximate, to a degree, the semantics of the senses.
A principled analysis of the generated and correct glosses, based on a well defined semantic annotation scheme, might prove revealing but it would be time-consuming and impractical. Therefore it would be of interest to automatize such efforts. It would be interesting to explore if this can be done using large pretrained transformers able to measure fine-grained semantic similarity. Of or pertaining to abundantly In an abundant manner ; in a sufficient degree ; in large measure In a very manner Glosses generated by the top submitted DEFMOD model, alongside the correct glosses, for the multiple senses of the word "consider". Table 10 Word True Gloss (describing the sense) / Generated Gloss consider To assign some quality to To hold the opinion consider To look at attentively To make something certain consider To have regard to ; to take into view or account ; to pay due attention to ; to respect To hold into consider To think of doing To permit consider To debate ( or dispose of ) a motion To make something certain B Appendix -Analysis of REVDICT Data and Models

B.1 Data Analysis
Here we analyze the properties of the pretrained embedding vectors assigned to the words defined by the glosses. We start by analyzing the numeric values contained in the vectors. Basic statistics of vector elements can be found in Table 11. It is noticeable that there are large variations in value depending on the language and the embedding type. For example, there is a significant difference between maximum values, especially between electra and sgns. To further investigate the vector elements, we visualize the shapes of their distributions for train datasets (Figure 3 and 4). Distribution shapes look similar for dev datasets. Next, we explore the vector data by reducing dimensionality to the 2D space using the Pairwise Controlled Manifold Approximation Projection (PaCMAP) algorithm (Wang et al., 2020). Figure 5 shows the distributions of all three types of embeddings in the train and validation (development) datasets for English, French, and Russian. We also visualize distributions of sgns (word2vec) and char embeddings for all languages, in Figure 6. As can be seen, the vector distributions vary greatly between the embedding types. Additionally, for all the embedding types, the vectors of different languages occupy a distinct area and are easily separable.
We further investigate the relationships between different embeddings in the following way. We first cluster the values of the electra vectors with k-means algorithm. We set the number of clusters to five and assign a different color to each cluster. We retain the electra cluster-based color of the samples (glosses) while visualizing the vectors of other embedding types, as shown in Figure 7. It can be clearly seen that the electra-based clusters are not preserved for other embedding types.

B.2 Model Performance
Here we present validation and test scores for our six REVDICT solutions described in Section 4.2. We use the following metrics for internal validation of our REVDICT solutions: Mean Squared Error (MSE), Cosine Similarity (COS) and Central Kernel Alignment (CKA) (Cortes et al., 2012;Kornblith et al., 2019). Validation scores for each REVDICT approach can be found in Table 12. The last three rows contain the total scores for each metric and each of our REVDICT solutions. A total score is the sum of the values of all datasets and we use it for a simple comparison of solutions. It is evident that each subsequent approach gives better validation results than the previous ones.
Test predictions are scored by these metrics: MSE, COS, and Cosine-Based Ranking (RNK). The RNK measure is defined as the proportion of test samples with cosine similarity to the model output embedding higher than the ground truth embedding. The final results for all our solutions can be found in Table 13. Here, each subsequent approach has lower scores than the previous ones, which is the complete opposite  of the validation results. This suggests potential overfitting to the dev dataset that could be the result of Bayesian hyperparameter optimization (BHO). However, this is contrary to expectations as the last two solutions have three times fewer BHO points and should not overfit to the dev dataset. The reason for this phenomenon is unclear and needs further investigation. The best REVDICT results for each team can be found in Table 14 for MSE score, Table 15 for COS score, and Table 16 for RNK score. When compared to other solutions, our systems have low to average performance according to the MSE scores. For the COS scores, our systems have very good performance on sgns (word2vec) vectors, and low performance on other embedding types. In terms of the RNK (ranking) our systems almost always yield the top performance, and this result is consistent across languages and embedding types.     Table 13: Test results for all our REVDICT (RD) approaches. For each score, comparative results are shown in color. Green is used for the best and red for the worst-performing solution per row (a metric defines whether higher or lower values are better). The total score is the sum of the values over all datasets and embeddings. Figure 5: Distributions of all three embedding types in train (2nd row) and validation (development, 1st row) datasets after dimensionality reduction to 2D space. sgns (word2vec, 1st column), char (2nd column), and electra (3rd column) embeddings are depicted for English (orange), French (green), and Russian (blue).   Table 15: COS test scores for each team in REVDICT task. The results of our team are bold (team 1). For each task, comparative results are shown in color. Green is used for the best and red for the worst-performing solution per column.  Table 16: RNK test scores for each team in REVDICT task. The results of our team are bold (team 1). For each task, comparative results are shown in color. Green is used for the best and red for the worst performing solution per column. Figure 7: Projection of the clusters in the electra embedding space (3rd column) to the spaces of the other two embedding types sgns (1st column) and char (2nd column). The analysis if performed for English (rows 1-2), French (rows 3-4), and Russian (rows 5-6) train and validation (development) datasets, after dimensionality reduction to 2D space.