Exploring the Representation of Word Meanings in Context: A Case Study on Homonymy and Synonymy

This paper presents a multilingual study of word meaning representations in context. We assess the ability of both static and contextualized models to adequately represent different lexical-semantic relations, such as homonymy and synonymy. To do so, we created a new multilingual dataset that allows us to perform a controlled evaluation of several factors such as the impact of the surrounding context or the overlap between words, conveying the same or different senses. A systematic assessment on four scenarios shows that the best monolingual models based on Transformers can adequately disambiguate homonyms in context. However, as they rely heavily on context, these models fail at representing words with different senses when occurring in similar sentences. Experiments are performed in Galician, Portuguese, English, and Spanish, and both the dataset (with more than 3,000 evaluation items) and new models are freely released with this study.


Introduction
Contrary to static vector models, which represent the different senses of a word in a single vector (Erk, 2012;Mikolov et al., 2013), contextualized models generate representations at token-level (Peters et al., 2018;Devlin et al., 2019), thus being an interesting approach to model word meaning in context. In this regard, several studies have shown that clusters produced by some contextualized word embeddings (CWEs) are related to different senses of the same word (Reif et al., 2019;Wiedemann et al., 2019), or that similar senses can be aligned in cross-lingual experiments (Schuster et al., 2019).
However, more systematic evaluations of polysemy (i.e., word forms that have different related meanings depending on the context (Apresjan, 1974)), have shown that even though CWEs present some correlations with human judgments (Nair et al., 2020), they fail to predict the similarity of the various senses of a polysemous word (Haber and Poesio, 2020).
As classical datasets to evaluate the capabilities of vector representations consist of single words without context (Finkelstein et al., 2001) or heavily constrained expressions (Kintsch, 2001;Mitchell and Lapata, 2008), new resources with annotations of words in free contexts have been created, including both graded similarities (Huang et al., 2012;Armendariz et al., 2020) or binary classification of word senses (Pilehvar and Camacho-Collados, 2019; Raganato et al., 2020). However, as these datasets largely include instances of polysemy, they are difficult to solve even for humans (in fact, the highest reported human upper bound is about 80%) as the nuances between different senses depend on non-linguistic factors such as the annotator procedure or the target task (Tuggy, 1993;Kilgarriff, 1997;Hanks, 2000;Erk, 2010).
In this paper, we rely on a more objective and simple task to assess how contextualized approaches (both neural network models and contextualized methods of distributional semantics) represent word meanings in context. In particular, we observe whether vector models can identify unrelated meanings represented by the same word form (homonymy) and the same sense conveyed by different words (synonymy). In contrast to polysemy, there is a strong consensus concerning the representation of homonymous senses in the lexicon, and it has been shown that homonyms are cognitively processed differently than polysemous words (Klepousniotou et al., 2012;MacGregor et al., 2015). In this regard, exploratory experiments in English suggest that some CWEs correctly model homonymy, approximating the contextualized vectors of a homonym to those of its paraphrases (Lake and Murphy, 2020), and showing stronger correlation with human judgments to those of polysemous words (Nair et al., 2020). However, as homonyms convey unrelated meanings depending on the context, it is not clear whether the good performance of CWEs actually derives from the contextualization process or simply from the use of explicit lexical cues present in the sentences.
Taking the above into account, we have created a new multilingual dataset (in Galician, Portuguese, English, and Spanish) with more than 3,000 evaluation items. It allows for carrying out more than 10 experiments and controlling factors such as the surrounding context, the word overlap, and the sense conveyed by different word forms. We use this resource to perform a systematic evaluation of contextualized word meaning representations. We compare different strategies using both static embeddings and current models based on deep artificial neural networks. The results suggest that the best monolingual models based on Transformers (Vaswani et al., 2017) can identify homonyms having different meanings adequately. However, as they strongly rely on the surrounding context, words with different meanings are represented very closely when they occur in similar sentences. Apart from the empirical conclusions and the dataset, this paper also contributes with new BERT and fastText models for Galician. 1 Section 2 presents previous studies about word meaning representation. Then, Section 3 introduces the new dataset used in this paper. In Section 4 we describe the models and methods to obtain the vector representations. Finally, the experiments and results are discussed in Section 5, while Section 6 draws some conclusions of our study.

Related Work
A variety of approaches has been implemented to compute word meaning in context by means of standard methods of distributional semantics (Schütze, 1998;Kintsch, 2001;McDonald and Brew, 2004;Erk and Padó, 2008). As compositional distributional models construct sentence representations from their constituents vectors, they take into account contextualization effects on meaning (Mitchell and Lapata, 2008;Baroni and Zamparelli, 2010;Baroni, 2013). However, these approaches often have scalability problems as their representations grow exponentially with the size of the sentences. Therefore, the datasets used to evaluate them are composed of highly restricted phrases (Grefenstette and Sadrzadeh, 2011).
The rise of artificial neural networks on natural language processing popularized the use of vector representations, and the remarkable performance of neural language models (Melamud et al., 2016;Peters et al., 2018) led to a productive line of research exploring to what extent these models represent linguistic knowledge (Rogers et al., 2020). However, few of these works have focused on lexical semantics, and most of the relevant results in this field come from evaluations in downstream tasks. In this regard, Wiedemann et al. (2019) found that clusters of BERT embeddings (Devlin et al., 2019) seem to be related to word senses, while Schuster et al. (2019) observed that clusters of polysemous words correspond to different senses in a cross-lingual alignment of vector representations.
Probing LSTMs on lexical substitution tasks, Aina et al. (2019) showed that these architectures rely on the lexical information from the input embeddings, and that the hidden states are biased towards contextual information. On an exploration of the geometric representations of BERT, Reif et al. (2019) found that different senses of a word tend to appear separated in the vector space, while several clusters seem to correspond to similar senses. Recently, Vulić et al. (2020) evaluated the performance of BERT models on several lexical-semantic tasks in various languages, including semantic similarity or word analogy. The results show that using special tokens ([CLS] or [SEP]) hurts the quality of the representations, and that these tend to improve across layers until saturation. As this study uses datasets of single words (without context), typelevel representations are obtained by averaging the contextualized vectors over various sentences.
There are several resources to evaluate word meaning in free contexts, such as the Stanford Contextual Word Similarity (Huang et al., 2012) and CoSimLex (Armendariz et al., 2020), both representing word similarity on a graded scale, or the Word-in-Context datasets (WiC), focused on binary classifications (i.e., each evaluation item contains two sentences with the same word form, having the same or different senses) (Pilehvar and Camacho-Collados, 2019; Raganato et al., 2020). These datasets include not only instances of homonymy but mostly of polysemous words. In this regard, studies on polysemy using Transformers have obtained diverse results: Haber and Poesio (2020) found that BERT embeddings correlate better with human ratings of co-predication than with similarity between word senses, thus suggesting that these representations encode more contextual information than word sense knowledge. Nevertheless, the results of Nair et al. (2020) indicate that BERT representations are correlated with human scores of polysemy. An exploratory experiment of the latter study also shows that BERT discriminates between polysemy and homonymy, which is also suggested by other pilot evaluations reported by Lake and Murphy (2020) and Yu and Ettinger (2020).
Our study follows this research line pursuing objective and unambiguous lexical criteria such as the representation of homonyms and synonyms. In this context, there is a broad consensus in the psycholinguistics literature regarding the representation of homonyms as different entries in the lexicon (in contrast to polysemy, for which there is a long discussion on whether senses of polysemous words are stored as a single core representation or as independent entries (Hogeweg and Vicente, 2020)). In fact, several studies have shown that homonyms are cognitively processed differently from polysemous words (Klepousniotou et al., 2012;Rabagliati and Snedeker, 2013). In contrast to the different senses of polysemous words, which are simultaneously activated, the meanings of homonyms are in conflict during processing, with the not relevant ones being deactivated by the context (MacGregor et al., 2015). To analyze how vector models represent homonymy and synonymy in context, we have built a new multilingual resource with a strong inter-annotator agreement, presented below.

A New Multilingual Resource of Homonymy and Synonymy in Context
This section briefly describes some aspects of lexical semantics relevant to our study, and then presents the new dataset used in the paper.
Homonymy and homography: Homonymy is a well-known type of lexical ambiguity that can be described as the relation between distinct and unrelated meanings represented by the same word form, such as match, meaning for instance 'sports game' or 'stick for lighting fire'. In contrast to polysemy (where one lexeme conveys different related senses depending on the context, e.g., newspaper as an organization or as a set of printed pages), it is often assumed that homonyms are different lexemes that have the same lexical form (Cruse, 1986), and therefore they are stored as independent entries in the lexicon (Pustejovsky, 1998). There are two main criteria for homonymy identification: Diachronically, homonyms are lexical items that have different etymologies but are accidentally represented by the same word form, while a synchronic perspective strengthens unrelatedness in meaning. Even if both approaches tend to identify similar sets of homonyms, there may be ambiguous cases that are diachronically but not synchronically related (e.g., two meanings of banco -'bench' and 'financial institution'-in Portuguese or Spanish could be considered polysemous as they derive from the same origin, 2 but as this is a purely historical association, most speakers are not aware of the common origin of both senses). In this study, we follow the synchronic perspective, and consider homonymous meanings those that are clearly unrelated (e.g., they unambiguously refer to completely different concepts) regardless of their origin.
It is worth mentioning that as we are dealing with written text we are actually analyzing homographs (different lexemes with the same spelling) instead of homonyms. Thus, we discard instances of phonologically identical words which are written differently, such as the Spanish hola 'hello' and ola 'wave', both representing the phonological form /ola/. Similarly, we include words with the same spelling representing different phonological forms, e.g., the Galician-Portuguese sede, which corresponds to both /sede/ 'thirst', and /sEde/ 'headquarters'.
In this paper, homonymous senses are those unrelated meanings conveyed by the same (homonym) word form. For instance, coach may have two homonymous senses ('bus' and 'trainer'), which can be conveyed by other words (synonyms) in different contexts (e.g., by bus or trainer).

Structure of the dataset:
We have created a new resource to investigate how vector models represent word meanings in context. In particular, we want to observe whether they capture (i) different senses conveyed by the same word form (homonymy), and (ii) equivalent senses expressed by different words (synonymy). The resource contains controlled sentences so that it allows us to observe how the context and word overlap affect word representations.
To allow for different comparisons with the same Sense Sentences 1-3 Sentence 4 Sentence 5 (1) We're going to the airport by coach.
They had to travel everywhere by bus. We're going to the airport by bus.
We're going to the airport by bicycle.
(2) That man was appointed as the new coach.
She has recently joined the amateur team as coach.

They need a new trainer
for the young athletes. That man was appointed as the new trainer.
That man was appointed as the new president. Table 1: Example sentences for two senses of coach in English ('bus' and 'trainer'). Sentences 1 to 3 include, in the same context, the target word, a synonym, and a word with a different sense (in italic), respectively. Sentences 4 and 5 contain the target word and a synonym in different contexts, respectively. and different contexts, we have included five sentences for each meaning (see Table 1 for examples): three sentences containing the target word, a synonym, and a word with a different sense, all of them in the same context (sentences 1 to 3), and two additional sentences with the target word and a synonym, representing the same sense (sentences 4 and 5, respectively). Thus, for each sense we have four sentences (1, 2, 4, 5) with a word conveying the same sense (both in the same and in different contexts) and another sentence (3) with a different word in the same context as sentences 1 and 2.
From this structure, we can create datasets of sentence triples, where the target words of two of them convey the same sense, and the third one has a different meaning. Thus, we can generate up to 48 triples for each pair of senses (24 in each direction: sense 1 vs. sense 2, and vice-versa). These datasets allow us to evaluate several semantic relations at the lexical level, including homonymy, synonymy, and various combinations of homonymous senses. Interestingly, we can control for the impact of the context (e.g., are contextualized models able to distinguish between different senses occurring in the same context, or do they incorporate excessive contextual information into the word vectors?), the word overlap (e.g., can a model identify different senses of the same word form depending on the context, or it strongly depends on lexical cues?), or the POS-tag (e.g., are homonyms with different POS-tags easily disambiguated?).
Construction of the dataset: We compiled data for four languages: Galician, Portuguese, Spanish, and English. 3 We tried to select sentences compatible with the different varieties of the same language (e.g., with the same meaning in UK and US English, or in Castilian and Mexican Spanish). However, we gave priority to the European varieties when necessary (e.g., regarding spelling variants).
The dataset was built using the following procedure: First, language experts (one per language) compiled lists of homonyms using dedicated resources for language learning, together with Word-Net and other lexicographic data (Miller, 1995;Montraveta and Vázquez, 2010;Guinovart, 2011;Rademaker et al., 2014). Only clear and unambiguous homonyms were retained (i.e., those in the extreme of the homonymy-polysemy-vagueness scale (Tuggy, 1993)). These homonyms were then enriched with frequency data from large corpora: Wikipedia and SLI GalWeb (Agerri et al., 2018) for Galician, and a combination of Wikipedia and Europarl for English, Spanish and Portuguese (Koehn, 2005). From these lists, each linguist selected the most frequent homonyms, annotating them as ambiguous at type or token level (absolute homonymy and partial homonymy in Lyons' terms (Lyons, 1995)). As a substantial part were nounverb pairs, only a few of these were included. For each homonym, the language experts selected from corpora two sentences (1 and 4) in which the target words were not ambiguous. 4 They then selected a synonym that could be used in sentence 1 without compromising grammaticality (thus generating sentence 2), and compiled an additional sentence for it (5), trying to avoid further lexical ambiguities in this process. 5 For each homonym, the linguists selected a word with a different meaning (for sen-  Table 2: Characteristics of the dataset. First three columns display the number of homonyms (Hom), senses, and sentences (Sent), respectively. Senses in parentheses are the number of homonymous pairs with different POStags). Center columns show the size of the evaluation data in three formats: triples, pairs, and WiC-like pairs, followed by the Cohen's κ agreements and their micro-average. The total number of homonyms and senses is the sum of the language-specific ones, regardless of the fact that some senses occur in more than one language.
tence 3), trying to maximize the following criteria: (i) to refer unambiguously to a different concept, and to preserve (ii) semantic felicity and (iii) grammaticality. The size of the final datasets varies depending on the initial lists and on the ease of finding synonyms in context.

Results
: Apart from the sentence triples explained above, the dataset structure allows us to create evaluation sets with different formats, such as sentence pairs to perform binary classifications as in the WiC datasets. Table 2 shows the number of homonyms, senses, and sentences of the multilingual resource, together with the size of the evaluation datasets in different formats.
As the original resource was created by one annotator per language, we ensured its quality as follows: We randomly extracted sets of 50 sentence pairs and gave them to other annotators (5 for Galician, and 1 for each of the other three varieties, all of them native speakers of the target language). We then computed the Cohen's κ inter-annotator agreement (Cohen, 1960) between the original resource and the outcome of this second annotation (see the right column of Table 2). We obtained a microaverage κ = 0.94 across languages, a result which supports the task's objectivity. Nevertheless, it is worth noting that few sentences have been carefully modified after this analysis, as it has shown that several misclassifications were due to the use of an ambiguous synonym. Thus, it is likely that the final resource has higher agreement values.

Models and Methods
This section introduces the models and procedures to obtain vector representations followed by the evaluation method.

Models
We have used static embeddings and CWEs based on Transformers, comparing different ways of obtaining the vector representations in both cases: Monolingual models: For English, we have used the official BERT-Base model (uncased). For Portuguese and Spanish, BERTimbau (Souza et al., 2020) and BETO (Cañete et al., 2020) (both cased). For Galician, we trained two BERT models (with 6 and 12 layers; see Appendix C).

Obtaining the vectors
Static models: These are the methods used to obtain the representations from the static models: Word vector (WV): Embedding of the target word (homonymous senses with the same word form will have the same representation). 6 In preliminary experiments we also used word2vec and GloVe models, obtaining slightly lower results than fastText. 7 These Portuguese and Galician models obtained better results (0.06 on average) than the official ones. 8 To make a fair comparison we prioritized base models (12 layers), but we also report results for large (24 layers) and 6 layers models when available.
Sentence vector (Sent): Average embedding of the whole sentence.
Syntax (Syn): Up to four different representations obtained by adding the vector of the target word to those of their syntactic heads and dependents. This method is based on the assumption that the syntactic context of a word characterizes its meaning, providing relevant information for its contextualized representation (e.g., in 'He swims to the bank', bank may be disambiguated by combining its vector with the one of swim). 9 Appendix D describes how heads and dependents are selected.
Contextualized models: For these models, we have evaluated the following approaches: Sentence vector ( Word vector (WV): Embedding of the target word, combining the vectors of the last 4 layers. We have evaluated two operations: vector concatenation (Cat), and addition (Sum).
Word vector across layers (Lay): Vector of the target word on each layer. This method allows us to explore the contextualization effects on each layer.
Vectors of words split into several sub-words are obtained by averaging the embeddings of their components. Similarly, MWEs vectors are the average of the individual vectors of their components, both for static and for contextualized embeddings.

Measuring sense similarities
Given a sentence triple where two of the target words (a and b) have the same sense and the third (c) a different one, we evaluate a model as follows (in a similar way as other studies (Kintsch, 2001; Lake and Murphy, 2020)): First, we obtain 9 We have also evaluated a contextualization method using selectional preferences inspired by Erk and Padó (2008), but the results were almost identical to those of the WV approach. three cosine similarities between the vector representations: sim1 = cos(a, b); sim2 = cos(a, c); sim3 = cos(b, c). Then, an instance is labeled as correct if those words conveying the same sense (a and b) are closer together than the third one (c). In other words, sim1 > sim2 and sim1 > sim3: Otherwise, the instance is considered as incorrect.

Evaluation
This section presents the experiments performed using the new dataset and discusses their results.

Experiments
Among all the potential analyses of our data, we have selected four evaluations to assess the behavior of a model by controlling factors such as the context and the word overlap: Homonymy (Exp1): The same word form in three different contexts, two of them with the same sense (e.g., coach in sentences [1:1, 1:4, 2:1] 10 in Table 1). This test evaluates if a model correctly captures the sense of a unique word form in context. Hypothesis: Static embeddings will fail as they produce the same vector in the three cases, while models that adequately incorporate contextual cues should correctly identify the outlier sense.

Synonyms of homonymous senses (Exp2):
A word is compared with its synonym and with the synonym of its homonym, all three in different contexts (e.g., coach=bus =trainer in [1:1, 1:5, 2:2]). This test assesses if there is a bias towards one of the homonymous senses, e.g., the most frequent one (MacGregor et al., 2015). Hypothesis: Models with this type of bias may fail, so as in Exp1, they should also appropriately incorporate contextual information to represent these examples.

Synonymy vs homonymy (Exp3):
We compare a word to its synonym and to a homonym, all in 3631 different contexts (e.g., coach=bus =coach in [1:1, 1:5, 2:1]). Here we evaluate whether a model adequately represents both (i) synonymy in context -two word forms with the same sense in different contexts-and (ii) homonymy -one of the former word forms having a different meaning. Hypothesis: Models relying primarily on lexical knowledge are likely to represent homonyms closer than synonyms (giving rise to an incorrect output), but those integrating contextual information will be able to model the three representations correctly.
Synonymy (Exp4): Two synonyms vs. a different word (and sense), all of them in the same context (e.g., [2:1, 2:2, 2:3]). It assesses to what extent the context affects word representations of different word forms. Hypothesis: Static embeddings may pass this test as they tend to represent typelevel synonyms closely in the vector space. Highly contextualized models might be puzzled as different meanings (from different words) occur in the same context, so that the models should have an adequate trade-off between lexical and contextual knowledge. Table 3 displays the number of sentence triples for each experiment as well as the total number of triples of the dataset. To focus on the semantic knowledge encoded in the vectors -rather than on the morphosyntactic information-, we have evaluated only those triples in which the target words of the three sentences have the same POS-tag (numbers on the right). 11 Besides, we have also carried out an evaluation on the full dataset. Table 4 contains a summary of the results of each experiment in the four languages. For reasons of clarity, we include only fastText embeddings and the best contextualized model (BERT). Results for all models and languages can be seen in Appendix A. BERT models have the best performance overall, both on the full dataset and on the selected experiments, except for Exp4 (in which the three sentences share the context) where the static models outperform the contextualized representations.

Results and discussion
In Exp1 and Exp2, where the context plays a crucial role, fastText models correctly labeled between 50%/60% of the examples (depending on the language and vector type, with better results for Sent and Syn). For BERT, the best accuracy surpasses 0.98 (Exp1 in English), with an average across languages of 0.78, and where word vectors outperform sentence representations. These high results and the fact that WVs work better in general than Sent may be indicators that Transformers are properly incorporating contextual knowledge.
Solving Exp3 requires both dealing with contextual effects and homonymy (as two words have the same form but different meaning) so that static embeddings hardly achieve 0.5 accuracy (Sent, with lower results for both WV and Syn). BERT's performance is also lower than in Exp1 and Exp2, with an average of 0.67 and Sent beating WVs in most cases, indicating that the word vectors are not adequately representing the target senses.
Finally, fastText obtains better results than BERT on Exp4 (where the three instances have the same context), reaching 0.81 in Spanish with an average across languages of 0.64 (always with WVs). BERT's best performance is 0.41 (in two languages) with an average of 0.42, suggesting that very similar contexts may confound the model.
To shed light on the contextualization process of Transformers, we have analyzed their performance across layers. Figure 1 shows the accuracy curves (vs. the macro-average Sent and WV vectors of the contextualized and static embeddings) for five Transformers models on Galician, the language with the largest dataset (see Appendix A for equivalent figures for the other languages).
In Exp1 to Exp3 the best accuracies are obtained at upper layers, showing that word vectors appropriately incorporate contextual information. This is true especially for the monolingual BERT versions, as the multilingual models' representations show higher variations. Except for Galician, Exp1 has better results than Exp2, as the former primarily deals with context while the latter combines contextualization with lexical effects. In Exp3 the curves take longer to rise as initial layers rely more on lexical than on contextual information. Furthermore, except for English (which reaches 0.8), the performance is low even in the best hidden layers (≈ 0.4). In Exp4 (with the same context in the three sentences), contextualized models cannot correctly represent the word senses, being surpassed in most cases by the static embeddings.
Finally, we have observed how Transformers representations vary across the vector space. Figure 2 shows the UMAP visualizations (McInnes et al.,

2018) of the contextualization processes of Exp1
and Exp3 examples in English. In 2a, the similar vectors of match in layer 1 are being contextualized across layers, producing a suitable representation since layer 7. However, 2b shows how the model is not able to adequately represent match close to its (a) Exp1: Sentence 2: "Chelsea have a match with United next week.". Sentence 3: "You should always strike a match away from you." (b) Exp3: Sentence 2: "A game consists of two halves lasting 45 minutes, meaning it is 90 minutes long.". Sentence 3: "He was watching a football stadium." Figure 2: UMAP visualizations of word contextualization across layers (1 to 12) in Exp1 and Exp3 in English (BERT-base). In both cases, sentence 1 is "He was watching a football match.", and the target word in sentence 3 is the outlier.
synonym game, as the vectors seem to incorporate excessive information (or at least limited lexical knowledge) from the context. Additional visualizations in Galician can be found in Appendix B.
In sum, the experiments performed in this study allow us to observe how different models generate contextual representations. In general, our results confirm previous findings which state that Transformers models increasingly incorporate contextual information across layers. However, we have also found that this process may deteriorate the representation of the individual words, as it may be incorporating excessive contextual information, as suggested by Haber and Poesio (2020).

Conclusions and Further Work
This paper has presented a systematic study of word meaning representation in context. Besides static word embeddings, we have assessed the ability of state-of-the-art monolingual and multilingual models based on the Transformers architecture to identify unambiguous cases of homonymy and synonymy. To do so, we have presented a new dataset in four linguistic varieties that allows for controlled evaluations of vector representations.
The results of our study show that, in most cases, the best contextualized models adequately identify homonyms conveying different senses in various contexts. However, as they strongly rely on the surrounding contexts, they misrepresent words having different senses in similar sentences.
In further work, we plan to extend our dataset with multiword expressions of different degrees of idiomaticity and to include less transparent -but still unambiguous-contexts of homonymy. Finally, we also plan to systematically explore how multilingual models represent homonymy and synonymy in cross-lingual scenarios.

Appendices
A Complete results Figure 3 and Table 5 include the results for all languages and models. We also include large variants (BERT and XLM-RoBERTa) when available. For static embeddings, we report results for the best Syn setting, which combines up to three syntactically related words with the target word (see Appendix D).    Figure 4a cálculos is correctly contextualized since layer 3. In Figure 4b, the outlier sense of queixo is not correctly contextualized in any layer. Second row shows examples of Exp2 (4c) and Exp4 (4d). In Figure 4c, the synonymys banco and cardume are closer to the outlier asento in layer 1 (and from 4 to 7), but the contextualization process is not able to correctly represent the senses in the vector space. In Figure 4d, the result is correct from layer 7 to 11, but in general the representations of words in similar sentences point towards a similar region. Third row incudes examples of Exp3. In Figure 4e, the occurrences of the homonym sede are correctly contextualized as the one in the first sentence approaches its synonym localización in upper layers. The equivalent example of Figure 4f is not adequately solved by the model, as both senses of bolo are notoriously distanct from molete, synonym of the first homonymous sense.

C Galician models
Training corpus: We combined the SLI GalWeb (Agerri et al., 2018), CC-100 (Wenzek et al., 2020), the Galician Wikipedia (April 2020 dump), and other news corpora crawled from the web. Following Raffel et al. (2020), sentences with a high ratio of punctuation and symbols, and duplicates were removed. The final corpus has 555M words (633M tokens tokenized with FreeLing (Padró and Stanilovsky, 2012;Garcia and Gamallo, 2010)). The corpus was divided into 90%/10% splits for train and development.
fastText model: We trained a fastText skip-gram model for 15 iterations with 300 dimensions, window size of 5, negative sampling of 25, and a minimum word frequency of 5. We used the same 90% split used to train the BERT models, but with automatic tokenization (≈ 600M tokens).
BERT models: We used the 90% train split of the corpus (with the original tokenization) to train two BERT models, with 6 and 12 layers: BERT-small (6 layers): This model has been trained from scratch using a vocabulary of 52,000 (sub-)words and a batch size of 208. It has been training during 1M steps (≈ 20 epochs) in 14 days.
BERT-base (12 layers): Following Kuratov and Arkhipov (2019), we initialized the model from the official pre-trained mBERT, therefore having the same vocabulary size (119,547). We trained it on the Galician corpus during 600k steps (≈ 13 epochs in 28 days) with a batch size of 198.
Both models were trained with the Transformers library (Wolf et al., 2020) on a single NVIDIA Titan XP GPU (12GB), a block size of 128, a learning rate of 0.0001, a masked language modeling (MLM) probability of 0.15, and a weight decay of 0.01. They have been trained only with the MLM objective.

D Syntax (Syn method)
To get the heads and dependents of each target word we have used the following hierarchies: For nouns: HeadV erb (the head verb, if any)> DepV erb (dependents of the head verb with one of the following relations: obj, nmod, obl)> DepAdj (a dependent adjective)> DepN oun (a dependent noun). For verbs: Head (only if it is a verb or a noun)> Obj (its direct object, if any)> Arg (a dependent with one of these relations: nsubj, nmod, obl). Using these hierarchies we have evaluated representations built by adding from 1 to 4 vectors to the one of each target word. As shown in Table 5, combining 3 syntactically related words to the target one obtains the best results.
For the experiments, we have parsed the datasets using the 2.5 Universal Dependencies models provided by UDPipe (Straka et al., 2019). (Figure 4) Figure 4a, sentence 1: "There must be some error in the calculations because the result is incorrect". Sentence 2: "According to my calculations we will finish in three days". Sentence 3: "[He/she] had several gallstones". Figure 4b, sentence 1: "For dessert [he/she] ate cheese with quince". Sentence 2:

E English translations
"We went to a cheese gastronomy days". Sentence 3: "[He/She] approached her and ran his hand over her chin". Figure 4c, sentence 1: "They were so many that they looked like a school of mackerel". Sentence 2: "From the rock small shoals of sea bass could be seen". Sentence 3: "This stone seat is somewhat uncomfortable". Figure 4d, sentences 1 and 2: "[He/She] wrote down all the phone numbers on the phone book." Sentence 3: "[He/She] crossed out all the phone numbers on the phone book". Figure 4e, sentence 1: "The choice of the next venue for the Olympics will take place". Sentence 2: "The location of the event will be decided this week". Sentence 3: "I'll get water from the spring, I am thirsty". Figure 4f, sentence 1: "[He/She] loves to eat the bread cake before soup". Sentence 2: "The bread had a slightly hard crust". Sentence 3: "They used live sand lance to attrack sea bass".