Probing for idiomaticity in vector space models

Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages: English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models


Introduction
Contextualised word representation models, like BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), seem to represent words more accurately than static word embeddings like GloVe (Pennington et al., 2014), as they can encode different usages of a word. In fact, representations of a word in several contexts can be grouped in different clusters, which seem to be related to the various senses of the word (Schuster et al., 2019), and they can be used to match polysemous words in context to specific sense definitions (Chang and Chen, 2019). However, multiword expressions (MWEs) fall into a continuum of idiomaticity 1 (Sag et al., 2002;Fazly 1 We understand idiomaticity as semantic opacity and its continuum as different degrees of opacity (Cruse, 1986(Cruse, ). et al., 2009King and Cook, 2017) and their meanings may not be directly related to the meanings of their individual words (e.g., graduate student vs. eager beaver as a hardworking person). Therefore, one question is whether and to what extent idiomaticity in MWEs is accurately incorporated by word representation models.
In this paper, we propose a set of probing measures to examine how accurately idiomaticity in MWEs, particularly in noun compounds (NCs), is captured in vector space models, focusing on some widely used representations. Inspired by the semantic priming paradigm (Neely et al., 1989), we have designed four probing tasks to analyse how these models deal with some of the properties of NCs, including non-compositionality (big fish as an important person), non-substitutability (panda car vs. bear automobile), or ambiguity (bad apple as either a rotten fruit or a troublemaker), as well as the influence of context in their representation. To do so, we have created the new Noun Compound Senses (NCS) dataset, containing a total of 9,220 sentences in English and Portuguese. This dataset includes sentence variants with (i) synonyms of the original NCs; (ii) artificial NCs built with synonyms of each component; or (iii) either the head or the modifier of the NC. Moreover, it is composed of naturalistic and controlled sense-neutral sentences, to minimise the possible effect of context words.
We compare five models (one static, GloVe, and four contextualised, ELMo and three BERT-based models) in English and Portuguese. The probing measures suggest that the standard and widely adopted composition operations display a limited ability to capture NC idiomaticity.
Our main contributions are: (i) the design of novel probes to assess the representation of id-iomaticity in vector models, (ii) a new dataset of NCs in two languages, and (iii) their application in a systematic evaluation of vector space models examining their ability to display behaviors linked to idiomaticity.
The remainder of this paper is organized as follows: First, Section 2 presents related work. Then, we describe the data and present the probing measures in Section 3. In Section 4, we discuss the results of our experiments. Finally, the conclusions of our study are drawn in Section 5.

Related Work
Priming paradigms have been traditionally used in psycholinguistics to examine how humans process language. For compounds, some findings suggest that idiomatic expressions are processed more slowly than semantically transparent ones, as processing the former may involve a conflict between the non-compositional and the compositional meanings (Gagné and Spalding, 2009;Ji et al., 2011). However, studies using event-related potential (ERP) data showed that idiomatic expressions, especially those with a salient meaning (Giora, 1999), have processing advantages (Laurent et al., 2006;Rommers et al., 2013). In NLP, probing tasks have been useful in revealing to what extent contextualised models are capable of learning different linguistic properties (Conneau et al., 2018). They allow for more controlled settings, removing obvious biases and potentially confounding factors from evaluations, and allowing both the use of artificially constructed but controlled sentences and naturally occurring sentences (Linzen et al., 2016;Gulordava et al., 2018). In priming tasks, related stimuli are easier to process than unrelated ones. One assumption is that, for models, related stimuli would achieve greater similarity than unrelated stimuli. These tasks have been used, for instance, to evaluate how neural language models represent syntax (van Schijndel and Linzen, 2018;Prasad et al., 2019), and the preferences that they may display, such as the use of mainly lexical information in a lexical substitution task even if contextual information is available (Aina et al., 2019).
Concerning pre-trained neural language models, which produce contextualised word representations, analyses about their abilities have shown, for instance, that they can encode syntactic information (Liu et al., 2019) including long-distance subjectverb agreement (Goldberg, 2019). Regarding se-mantic knowledge, the results of various experiments suggest that BERT can somewhat represent semantic roles (Ettinger, 2020). However, its improvements appear mainly in core roles that may be predicted from syntactic representations (Tenney et al., 2019). Moreover, from the representations generated by BERT, ELMo and Flair (Akbik et al., 2018) for word sense disambiguation, only the clusters of BERT vectors seem to be related to word senses (Wiedemann et al., 2019), although in crosslingual alignment of ELMo embeddings, clusters of polysemous words related to different senses have also been observed (Schuster et al., 2019).
The use of contextualised models for representing MWEs has been reported with mixed results. Shwartz and Dagan (2019) evaluated different classifiers initialised with contextualised and non-contextualised embeddings in five tasks related to lexical composition (including the literality of NCs) and found that contextualised models, especially BERT, obtained better performance across all tasks. However, for capturing idiomaticity in MWEs, static models like word2vec (Mikolov et al., 2013) seem to have better performance than contextualised models (Nandakumar et al., 2019;King and Cook, 2018). These mixed results suggest that a controlled evaluation setup is needed to obtain comparable results across models and languages. Therefore, we have carefully designed probing tasks to assess the representation of NCs in vector space models. As the same word can have different representations even in related paraphrased contexts (Shi et al., 2019), we adopt paraphrases with minimal modifications to compare the idiomatic and literal representations of a given NC.

Noun Compound Senses Dataset
The Noun Compound Senses (NCS) dataset is based on the NC Compositionality dataset, which contains NCs in English (Reddy et al., 2011), Portuguese andFrench (Cordeiro et al., 2019). Using the protocol by Reddy et al. (2011), human judgments were collected about the interpretation of each NC in 3 naturalistic corpus sentences. The task was to judge, for each NC, how literal the contributions of its component were for its meaning (e.g., "Is climate change truly/literally a change in climate?"). Each NC got a score, which was the average of the human judgments with a Likert scale from 0 (non-literal/idiomatic) to 5 (lit-eral/compositional). 2 For the NCS dataset, a set of probing sentences for the 280 NCs in English and the 180 NCs in Portuguese was added. For each NC, the sentences exemplify two conditions: (i) the naturalistic context provided by the original sentences (NAT), and (ii) a neutral context where the NCs appear in uninformative sentences (NEU). For the latter we use the pattern This is a/an <NC> (e.g., This is an eager beaver) and its Portuguese equivalent Este/á e um(a) <NC>. As some NCs may have both compositional and idiomatic meanings (e.g., fish story as either an aquatic tale or a big lie), these neutral contexts will be used to examine the representations that are generated for the NCs (and the sentences) in the absence of any contextual clues about the meaning of the NC. Moreover, they enable examining possible biases in the NC representation especially when compared to the representation generated for the NAT condition.
For each NC and condition, we created new sentence variants with lexical replacements, using synonyms of the NC as a whole or of each of its components. The synonyms of the NCs are the most frequent synonyms provided by the annotators of the original NC Compositionality dataset (e.g., brain for grey matter). The synonyms of each component were extracted from WordNet (Miller, 1995, for English) and from English and Portuguese dictionaries of synonyms (e.g., alligator for crocodile and sobs for tears). In cases of ambiguity (due to polysemy or homonymy), the most common meaning of each component was used. Experts (native or near-native speakers with linguistics background) reviewed these new utterances, keeping them as faithful as possible to the original ones, but with small modifications for preserving grammaticality after the substitution (e.g., modifications in determiners and adjectives related to gender, number and definiteness agreement).
NCS contains a total of 5,620 test items for English and 3,600 for Portuguese among neutral and naturalistic sentences, and it is freely available. 3 2 We averaged the Likert judgments for comparability with previous work, even though the median may reflect better the cases where there is more disagreement among the annotators. However, both mean and median are strongly correlated in our data: ρ = 0.98 (English) and ρ = 0.96 (Portuguese), p < 0.001.

Probing Measures
This section presents the probing measures defined to assess how accurately idiomaticity is captured in vector space models. For these measures we consider comparisons between three types of embeddings: (i) the embedding for an NC out of context (i.e. the embedding calculated from the NC words alone), represented by NC ; (ii) the embedding for an NC in the context of a sentence S, represented by NC ⊂ S 4 (iii) finally, the sentence embedding that contains an NC, which is represented by S ⊃ NC . Here we use the standard output of some widely used models with no fine-tuning to avoid possible interference. However, in principle, these measures could apply to any embedding even after fine-tuning.
The similarities between embeddings are calculated in terms of cosine similarity: cos( , ) where and are embeddings from the same model with the same number of dimensions. In NAT cases, the similarity scores for each of the three available sentences for a given NC are averaged to generate a single score. We use Spearman ρ correlation between similarities and the NC idiomaticity scores (280 for English and 180 for Portuguese) to check for any effects of idiomaticity in the probing measures. We also calculate Spearman ρ correlation between different embedding models to determine how much the models agree, and between the NAT and NEU conditions to see how much the context affects the distribution of similarities. We also analyse the distribution of cosine similarities produced by different models for each of the probing measures. All probing measures are calculated for both NAT and NEU conditions. P1: Probing the similarity between an NC and its synonym. If a contextualised model captures idiomaticity accurately, the embedding for a sentence containing an NC should be similar to the embedding for the same sentence containing a synonym of the NC (NC syn , e.g., for grey matter, NC syn = brain). Thus, sim (P1) . This should occur regardless of how idiomatic the NC is, that is, similarity scores are not expected to correlate with NC idiomaticity scores (ρ (P1) Sent 0). Moreover, this should also hold for the NC and NC syn embeddings generated in the context of this sentence, which means that ρ (P1) NC 0 and sim (P1) NC 1 4 For non-contextualised embeddings NC ⊂ S = NC. Naturalistic sentence NC NCsyn NCsynW Field work and practical archaeology are a particular focus.
field work research area activity The town centre is now deserted -it's almost like a ghost town! ghost town abandoned town spectre city How does it feel to experience a close call only to come out alive and kicking?
close call scary situation near claim Eric was being an eager beaver and left work late. eager beaver hard worker restless rodent No wonder Tom couldn't work with him; he is a wet blanket.
wet blanket loser damp cloak Table 1: Naturalistic examples with their NC syn and NC synW counterparts.
where sim (P1) NC = cos( NC ⊂ S , NCsyn ⊂ S ). The baseline similarity scores can be approximated using the out-of-context embeddings for NC and NC syn .
P2: Probing single component meaning preservation. As the meaning of a more compositional compound can be inferred from the meanings of its individual components, we evaluate to what extent an NC can be replaced by one of its component words and still be considered as representing a similar usage in a sentence. We mea- where w i is the component word (head or modifier) with the highest similarity, as for some NCs the main meaning may be represented by either its head or modifier. Similarity scores for idiomatic NCs should be low as they usually cannot be replaced by any of its components. In contrast, for more compositional NCs, the similarity is expected to be higher. For example, while for a more compositional NC like white wine, the head wine would provide a reasonable approximation as w i , the same would not be the case for grey matter, a more idiomatic NC. Therefore, we expect significant correlations between the similarity values and the NC idiomaticity scores, that is ρ (P2) Sent > 0 and ρ (P2) NC > 0. P3: Probing model sensitivity to disturbances caused by replacing individual component words by their synonyms. We examine whether vector representations are sensitive to the lack of individual substitutability of the component words displayed by idiomatic NCs (Farahmand and Henderson, 2016). To compare an NC with an expression made from synonyms of its component words (NC synW , e.g., for grey matter, NC synW = silvery material), we measure sim (P3) . These substitutions should provide more similar variants for compositional than for idiomatic cases, and the similarity scores should correlate to the NC idiomaticity scores, that is ρ (P3) Sent > 0 and ρ (P3) NC > 0.
P4: Probing the similarity between the NC in the context of a sentence and out of context.
To determine how much for a given model an NC in context differs from the same NC out of context we measure sim (P4) in-out = cos( NC ⊂ S , NC ). We expect similarity scores to be higher in the NEU condition, given their semantically vague context, than for the NAT condition.

Calculating Embeddings
We use as a baseline the static non-contextualised GloVe model (Pennington et al., 2014) and, for contextualised embeddings, four widely adopted models: ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and two BERT variants, DistilBERT (DistilB) (Sanh et al., 2019) and Sentence-BERT (SBERT) (Reimers and Gurevych, 2019b). For all the contextualised models, we use their pre-trained weights publicly available through the Flair implementation 5 . For GloVe, the English and Portuguese models described in Pennington et al. (2014) and Hartmann et al. (2017). For ELMo, we use the small model provided by Peters et al. (2018), and for Portuguese we adopt the weights provided by Quinta de Castro et al. (2018). For all BERT-based models, we used the multilingual models for both English and Portuguese. 6 To have a single embedding for the whole sentence or its parts, e.g., the NC representation, we use the standard procedure of averaging the vectors of the involved tokens. 7 In GloVe and ELMo, we average the output embeddings of each word, while in BERT-based models we obtain the final vector by averaging those of the sub-tokens (e.g., 'wet', 'blank' and '##et' for wet blanket).
Different combinations of the last five layers were probed in BERT-based models. However, they led to qualitatively similar results, and for reasons of presentation clarity, have been omitted from the discussion. We focus on embeddings calculated from a combination of the last four layers as they have been found to be representative of the other combinations. For ELMo, as it is intended to serve as a contextualised baseline, we represent the word embeddings using the concatenation of its three layers, albeit it is known that separate layers and weighting schemes generate better results in downstream tasks (Reimers and Gurevych, 2019a).

Results
This section discusses our results for each probing measure, using cosine similarities and Spearman ρ correlations. A qualitative analyses is also presented where we compare BERT and GloVe results of the five NCs in Table 1 (which shows the naturalistic sentences for each NC, together with their respective NC syn and NC synW ) 8 . We also discuss the average results of other NCs in both conditions and these results and other examples can be found in the Appendix.

Can contextualised models capture the similarity between an NC and its synonym?
If a contextualised model successfully captures idiomaticity, we would expect (i) high cosine similarity between a sentence containing an NC and its variant using a synonym of the NC (P1), and (ii) little or no correlation with the NC idiomaticity score. The results confirm high similarity values for all models, as shown in Figure 1a. However, this is not the case if we consider only the embeddings in context for NC and NC syn , which display a larger spread of similarity values (see Figure 1b). Moreover, contrary to what was expected, a moderate correlation was found between most models and the idiomaticity scores (P1 in Table 2), indicating lower similarity scores for idiomatic than for compositional cases, for both NAT and NEU conditions. Even though the high sim (P1) Sent values seem to suggest idiomaticity is captured, lower sim (P1) NC and moderate correlations with idiomaticity scores contradict it. Therefore a possible explanation for high similarities for Sent may be the effect of the overlap in words between a sentence and its variant (i.e., the context in Sent). This is also compatible with the larger similarities observed for NAT than for NEU condition since the average sentence length for the naturalistic sentences is 23.39 for English and 13.03 for Portuguese, while for the neutral it is five words for both languages. Moreover, a similar performance was also obtained with GloVe. 9 It is also worth noting that, in contrast to static embeddings, contextualised word representations are anisotropic, occupying a narrow cone in the vector space and therefore tending to produce higher cosine similarities (Ethayarajh, 2019).
The results with the first probing measure show that even though the similarities can be relatively high, they are consistently lower for idiomatic than for compositional cases, suggesting that idiomaticity may not be fully incorporated in the models. Qualitative analysis: In Table 3, in P1, the similarity scores between NC in Table 1 and their respective NC syn for BERT and GloVe models are shown. As expected, BERT shows higher scores than GloVe for all cases, and even if the values for P1 differ, both models follow the same tendency. There is a larger spread for GloVe (e.g., sim (P1) wet blanket = 0.21 vs. sim (P1) ghost town = 0.80) which could be explained by the choices of NC syn . For wet blanket NC syn = loser, which has probably a very dissimilar representation from both wet and blanket. On the other hand, ghost town with NC syn = abandoned town not only shares a word with the original NC, but we can also argue that ghost and abandoned are likely to have similar embeddings. Finally, the average results of P1 show that BERT-based models tend to intensify lexical overlap, resulting in high cosine similarities when both the NC and NC syn share (sub-)words. For instance, 47 (in English) and 49 (in Portuguese) out of the 50 compounds with highest sim (P1) NC-NAT share surface tokens, whether the NCs are more compositional (e.g., music journalist vs. music reporter) or more idiomatic (e.g., ghost town vs. abandoned town).

Can the lower semantic overlap between idiomatic NCs and their individual components be captured?
We would expect idiomatic NCs not to be similar to either of their individual components, which would be reflected by a larger spread of cosine similarity values for P2 than for P1. However, all models produced high similarities across the idiomaticity spectrum, see Figures 1c for Sent and 1d for NC.  The higher average similarities for P2 than for P1, compare Figures 1a and 1b with Figures 1c and 1d, reinforces the hypothesis that the models prioritise lexical overlap with one of the NC components rather than semantic overlap with a true NC syn-onym, even for idiomatic cases. Although there is some correlation with idiomaticity when it exists, it is lower than for P1, contrary to what would be expected (see P1 and P2 in Table 2). All of these indicate that these models cannot distinguish the partial semantic overlap between more compositional NCs and their components and the absence of overlap for idiomatic NCs. Qualitative analysis: The P2 results in Table 3 show the highest similarity scores between each example in Table 1 and one of its components. These high similarity scores highlight the prioritisation of lexical over semantic overlap mentioned above. Furthermore, some idiomatic NCs also show strong similarities with their components, suggesting that the idiomatic meaning is not correctly represented. For instance, poison pill (meaning an emergency exit) has an average similarity of sim (P2) poison pill-NAT = 0.94 with its head (pill).

Can they capture the lack of substitutability of individual components for idiomatic NCs?
We do not expect an idiomatic NC to keep the idiomatic meaning when each of its components is   individually replaced by synonyms, and this would be reflected in lower similarity values for P3 than for P1. However, high similarity values are found across the idiomaticity spectrum, and for all models and all conditions the average similarities are higher than those for P1 (see Figures 1e and 1f). Contrary to what would be expected, the correlations with idiomaticity scores are mostly nonexistent, and when they do exist they are much lower than for P1, (see P1 and P3 in Table 2).
The overall picture painted by P3 points towards contextualised models not being able to detect when a change in meaning takes place by the substitution of individual components by their synonyms.
Qualitative analysis: For P3, Table 3 shows the similarities scores at NC level between each NC and their NC synW counterpart. Again, similarity scores for GloVe are considerably lower than for BERT. As expected for GloVe, sim (P3) wet blanket = 0.69 is noticeably higher than sim (P1) wet blanket = 0.21, since individually the words damp and cloak are closer in meaning to wet and blanket, respectively, than loser is. Another evidence that contextualised models are not modelling idiomaticity well is, for NAT cases, the considerably higher sim (P3) wet blanket = 0.91 in comparison to sim (P1) wet blanket = 0.77, for BERT.
Although for the other NCs, sim (P3) NC and sim (P1) NC are comparable, the special case of the more idiomatic wet blanket highlights the issues of idiomaticity representation.

Is there a difference between an NC in and out of context?
For contextualised models, the greater the influence of the context, the lower we would expect the similarity to be between an NC in and out of context. However, especially for BERT models the results ( Figure 2) show a high similarity between the NC in and out of context (sim (P4) in-out > 0.8). Moreover, a comparison with the similarities for the synonyms in P1 resulted in sim (P4) in-out-NEU > sim (P1) NC-NEU and NC-NAT , which indicates that these models consider the NC out of context to be a better approximation for the NC in context than its synonym. In addition, for BERT models sim (P4) in-out is only weakly correlated with the idiomaticity score (Table 4), which suggests that the context may not play a bigger role for idiomatic than it does for more compositional NCs. Qualitative analysis: The sim (P4) in-out of the examples in Table 1   (also for ghost town) to 0.90 (eager beaver and wet blanket) in the neutral sentences for BERT. 10 Together with these examples, the general results of P4 show large differences not explained by the semantic compositionality of the NCs, as suggested by the weak correlation with the idiomaticity scores. In this respect, both the largest and smallest differences between sim (P4) in-out in NAT and NEU conditions appear in compositional NCs (engine room with sim (P4) in-out-NAT = 0.68, sim (P4) in-out-NEU = 0.89, and rice paper with sim (P4) in-out-NAT = 0.84, sim (P4) in-out-NEU = 0.86). Besides, we expected ambiguous compounds such as bad apple or bad hat to have large sim (P4) in-out differences between both conditions, as they occur with an idiomatic meaning in the NAT sentences. However, the differences were of just 0.06 in both cases, while other less ambiguous idiomatic NCs showed higher variations (e.g., melting pot, with 0.16). In sum, the results of P4 suggest that contextualised models do not properly represent some NCs.

But how informative are the contexts?
As the neutral sentences do not provide informative contextual clues, if the NCs in NAT and NEU conditions are similar, this would provide an additional indication that for these models contexts are not playing an important role in distinguishing usages (in this case between a neutral and uninformative usage and a naturalistic one). Indeed, the two conditions follow the same trends in the two languages, see Figure 1. Furthermore, there are significant correlations between NAT and NEU conditions, and some are very strong correlations. For example, for SBERT the correlations between the NC in context in naturalistic and neutral conditions are ρ (P1,P2,P3) NC(Nat/Neu) > 0.85 for English and > 0.76 for Portuguese, for probes P1, P2 and P3. This indicates that to evaluate the effect of the variants in each of these probes, a neutral sentence is as good as a naturalistic one. This reinforces the possibility that these models do not adequately incorporate the context in a way that captures idiomaticity.
In terms of the similarity between a sentence and its variants, as we assumed that the representation of a sentence corresponds to the average of the individual components, sentence length may have a strong impact on cosine similarity. This would explain the high values obtained for sentence similarities throughout the probes, as they could be more the effect of the number of words in a sentence than of their semantic similarity. Indeed, the correlation between naturalistic sentence length and the cosine similarities for the first three probes is moderate to strong for all models (Table 5), and higher for some of the contextualised models than for the baseline (e.g., DistilB in English and P2).  Table 5: Spearman ρ correlation between naturalistic sentence length and cosine similarity, p ≤ 0.001.

Other Operations
As referred in section 3.3 we have used vector averaging to obtain the NC embedding, as it is the standard procedure to represent not only MWEs but also out-of-vocabulary words, which are split into sub-tokens in contextualised models (Nandakumar et al., 2019;Wiedemann et al., 2019). However, we have also explored other methods to represent NCs in a single vector. First, we have incorporated type-level vectors of the NCs into a BERT model, inspired by compositionality prediction methods (Baldwin et al., 2003;Cordeiro et al., 2019). To do so, we annotated the target NCs in large English and Portuguese corpora (Baroni et al., 2009;Wagner Filho et al., 2018) and used attentive mimicking with onetoken-approximation Schütze, 2019, 2020b) to learn up to 500 contexts for each NC. These new vectors encode each NC in a single representation, therefore avoiding possible biases produced by the compositional operations. Then, we used BERTRAM (Schick and Schütze, 2020a) to inject these type-level vectors in the BERT multilingual model. As expected, learning the vectors of the NCs as single tokens improved the representation of idiomatic expressions (see BERTRAM in Tables 2 and 4), decreasing the correlation with idiomaticity in P1 (e.g., ρ (P1) NC-NAT = 0.30 in English), and increasing it in P2 (ρ (P2) NC-NAT = 0.45) and P3 (ρ (P3) NC-NAT = 0.39 > ρ (P1) NC-NAT ). For P4, the correlation also increased in NAT contexts. In sum, these results were in general better and more statistically significant (at the expense of re-training a model).
Second, we compared the performance of averaging vs. concatenating the vectors of the NC subwords. In this case, we selected those utterances in English including NCs with the same number of sub-words of their synonyms (273 sentences), thus allowing for vector concatenation. Using this operation instead of average slightly improved the results of the BERT-based models (e.g., ≈0.06 higher correlations on average for P3 NAT) and obtained more significant values.
As the latter approach does not involve retraining a model, in further work we plan to probe other concatenation and pooling methods able to compare MWEs with different number of input vectors (e.g., grey matter vs. brain) which have achieved good results in sentence embeddings (Rücklé et al., 2018).

Conclusions
This paper presented probing tasks for assessing the ability of vector space models to retain the idiomatic meaning of NCs in the presence of lexical substitutions and different contexts. For these evaluations, we constructed the NCS dataset, with a total of 9,220 sentences in English and Portuguese, including variants with synonyms of the NC and of each of its components, in neutral and naturalistic sentences. The probing tasks revealed that contextualised models may not detect that idiomatic NCs have a lower degree of substitutability of the individual components when compared to more compositional NCs. This behaviour is similar in the controlled neutral and naturalistic conditions both in English and Portuguese.
The next steps are to extend the probing strategy with additional measures that go beyond similarities and correlations. Moreover, for ambiguous NCs, we intend to add probes for the different senses. Finally, we also plan to apply them to more languages, examining how multilingual information can be used to refine the representation of noun compounds and other MWEs.

Appendices
A Naturalistic examples in English Table 6 includes naturalistic examples in English. We include the compositionality scores provided by the annotators and the BERT and GloVe results at NC level. Table 7 includes naturalistic examples in Portuguese. We include the compositionality scores provided by the annotators and the BERT and GloVe results at NC level.  Normalmente, os restaurantes encontram-se dentro de centros comerciais. Restaurants are usually located inside shopping malls (lit. comercial centres).
It was a sorrowful day, a sixth sense alerted me that one bad thing pulls another.