Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels

Accurate assessment of the ability of embedding models to capture idiomaticity may require evaluation at token rather than type level, to account for degrees of idiomaticity and possible ambiguity between literal and idiomatic usages. However, most existing resources with annotation of idiomaticity include ratings only at type level. This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level. We compiled 8,725 and 5,091 token level annotations for English and Portuguese, respectively, which are strongly correlated with the corresponding scores obtained at type level. The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context.


Introduction
Multiword Expressions (MWEs) such as noun compounds (NCs), have been considered a challenge for NLP (Sag et al., 2002). This is partly due to the wide range of idiomaticity that they display, from more literal to idiomatic combinations (olive oil vs. shrinking violet). The task of identifying the degree of idiomaticity of MWEs has been investigated at type level, to determine the potential of an MWE to be idiomatic in general. Some of these approaches are based on the assumption that the * Equal contribution. distance between the representation of an MWE as a unit and the representation of the compositional combination of its components is an indication of the degree of idiomaticity: they are closer if the MWE is more compositional. Good performances are obtained even with non-contextualised word embeddings like word2vec (Mikolov et al., 2013), and vector operations like addition and multiplication (Mitchell and Lapata, 2010;Reddy et al., 2011;Cordeiro et al., 2019). Additionally, for some MWEs, there is a potential ambiguity between an idiomatic and a literal sense, like in the potentially idiomatic MWE brass ring which can be ambiguous between the more literal meaning a ring made of brass and the more idiomatic sense of a prize. Considering that these MWEs can have both idiomatic and literal senses, a related task of token-level identification evaluates whether in a particular context an MWE is idiomatic or not. For this task, models that incorporate the context in which an MWE occurs tend to be better equipped to distinguish idiomatic from literal occurrences (Sporleder and Li, 2009;King and Cook, 2018;Salton et al., 2016).
Contextualised embedding models, like BERT (Devlin et al., 2019), brought significant advances to a variety of downstream tasks (e.g. Zhu et al. (2020) for machine translation and Jiang and de Marneffe (2019) for natural language inference). They also seem to benefit tasks like idiomaticity and metaphor identification (Gao et al., 2018), since their interpretation is often dependent on contextual clues. Nonetheless, previous work found that non-contextualised models seem to still bring informative clues for these tasks (King and Cook, 2018), and their combination with contextualised models could improve results (e.g. for metaphor identification (Mao et al., 2019)). This complementarity between non-contextualised and contex-tualised models may be an indication that enough core idiomatic information may already be available at type level. Moreover, type-based compositionality prediction measures that perform well with static embeddings may also perform well for token-based prediction with contextualised models.
To address these questions, in this paper, we present the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, containing 280 NCs in English and 180 in Portuguese, annotated with the degree of idiomaticity perceived by human annotators, at type and token level. 1 NCTTI contains a total of 8,725 annotations in 840 different sentences in English, and 5,091 annotations in 540 sentences in Portuguese. Moreover, NCTTI has several paraphrases for each NC which are classified as either type level or token level equivalents. To control for the level of idiomaticity, the NCTTI dataset has a balanced amount of compositional, partly compositional and idiomatic items. As the importance of context to determine interpretation may be related to factors like the degree of idiomaticity, association strength or the frequency of an NC, we present an illustrative analysis of their impact for the performance of different models in capturing idiomaticity. We also examine how the performance obtained for human idiomaticity judgments per type differs from the performance obtained per token.
Our contributions can be summarised as: (1) building the NCTTI dataset with information about type and token idiomaticity for NCs in two languages, (2) evaluating to what extent models are able to detect idiomaticity at type and token level, analysing different levels of contextualisation and (3) proposing two new measures of idiomaticity. Moreover, the paraphrases provided for each NC at type and token level make NCTTI a useful resource for enhancing paraphrase datasets (e.g. PPDB (Ganitkevitch et al., 2013)), for tasks involving lexical substitution (McCarthy and Navigli, 2007;Mihalcea et al., 2010), or for improving the results of downstream tasks, such as text simplification (Paetzold, 2016;Alva-Manchego et al., 2020). Such paraphrases may also be useful for improving the task of machine translation, avoiding the need for parallel MWE corpora (Zaninello and Birch, 2020).
Section 2 gives an overview of existing idiomaticity datasets. Section 3 presents the NCTTI dataset and the annotations, and section 4 discusses the evaluation of the performance of different word embeddings in detecting idiomaticity.
Regarding the use of contextualised embeddings to model idiomaticity, Nandakumar et al. (2019) compared different static and contextualised embeddings to predict the NCs compositionality, obtaining better results with static vectors learnt individually for each NC. Shwartz and Dagan (2019) train various classifiers initialised with static and contextualised embeddings for different compositional tasks, achieving the best results with BERT embeddings. Yu and Ettinger (2020), using partially idiomatic expressions of the BiRD dataset (Asaadi et al., 2019), show that contextualised embeddings from language models heavily rely on word content, missing additional information provided by compositional operations.
In this paper we take advantage of the NCTTI dataset to observe whether vector representations obtained with different strategies correlate with human annotations at both type and token levels.

The Noun Compound Type and Token Idiomaticity dataset
This section describes the procedure to create the NCTTI dataset and its main characteristics. 2

Source data
We used as basis the English and Portuguese subsets of the NC Compositionality dataset (Cordeiro et al., 2019), which contain compositionality scores for 280 two-word NCs in English (90 of which came from Reddy et al. (2011)), and 180 in Portuguese, all of them labeled at type level: i.e., the annotators provided a compositionality value for a compound (from 0 -fully idiomatic-to 5, fully compositional) after reading various sentences with this NC.
To obtain more fine-grained compatible tokenlevel annotations about the impact of different contexts in the interpretation of NCs, we used the same original sentences as in the source dataset (three sentences per compound with the same sense were selected from Reddy et al. (2011) dataset). 3 Language experts classified each noun compound regarding their semantic compositionality as idiomatic (e.g., gravy train), partially idiomatic (e.g., grandfather clock), or compositional (e.g., research project). For English, this resulted in 103, 88, and 89 idiomatic, partially idiomatic, and compositional compounds. For Portuguese, each class has 60 compounds, as the selection had been balanced when the source dataset was created.

Annotation procedure
We used the same protocol as Reddy et al. (2011) and Cordeiro et al. (2019), asking each participant to give 0 to 5 scores for an NC and its components in a specific sentence (e.g., glass ceiling in "Women are continuing to slowly break through the glass ceiling of UK business [. . . ]"). In particular, we asked participants for: (i) the contribution of the head to the meaning of the NC (e.g., is a glass ceiling literally a ceiling?); (ii) the contribution of the modifier to the meaning of the NC (e.g., is a glass ceiling literally of glass?); and (iii) the degree of compositionality of the compound (i.e., to what extent the meaning of the NC can be seen as a combination of its parts). Additionally, we asked for up to three synonyms of the NC in that particular sentence (e.g., synonyms at token level).
We used Amazon Mechanical Turk to obtain the annotations for English, and a dedicated online platform for the questionnaire in Portuguese, 4 as we could not find a suitable number of annotators for this language in AMT. 5 Taking this into account, the numbers of the Portuguese annotations are in general lower to those obtained for English.
For each language, we have included the three sentences of every compound in the dataset (840 sentences in English, and 540 in Portuguese), which were randomly submitted to the annotators.
For English, we compiled at least 10 annotations per sentence, resulting in 8,725 annotations (10.4 annotations per sentence on average). A total of 412 annotators have taken part in the process, and on average, each participant labeled 21 instances. For Portuguese we set the threshold in 5 annotations per sentence: we got 5,091 annotations by 33 participants, so that each sentence has a mean of 9.4 annotations and each annotator labeled on average 154 sentences.

Results
Inter-annotator agreement: we computed the inter-annotator agreements for two and three annotators with the largest number of sentences in common (Table 1). For English, we obtained Krippendorff's α (Krippendorff, 2011) values of 0.30 for two annotators (199 sentences) and 0.22 for three annotators (76 sentences). The α values for Portuguese were of 0.52 for two annotators (131 sentences) and 0.44 for three annotators (60 sentences). Overall, and using the divisions proposed by Landis and Koch (1977), the agreement results can be classified as 'fair' (for English), and 'moderate' (for Portuguese).    Table 3: Mean compositionality scores for each class in English and Portuguese (from 0, fully idiomatic, to 5, fully compositional), and standard deviations. Left columns contain the scores for the whole compound, while the values for the head and modifier are in the middle and right columns, respectively. The type averages for the NCs reported by Cordeiro et al. (2019) are 1.1, 2.4, and 4.2 for English and 1.3, 2.5, and 3.9 for Portuguese. dataset and those of the original resource (NC Compositionality dataset). Table 2 contains the correlation results for each language and compositionality class. The strong to very strong significant correlations confirm the robustness between type-level and token-level human compositionality annotations for these two datasets. 6 Idiomaticity values: with regards to the idiomaticity values of each class, Table 3 displays both the average scores and the standard deviation in both languages. As expected, for the whole compounds, partially idiomatic NCs are those with higher standard deviations, and their mean compositionality values are in the middle of the scale (2.34 and 2.46). In English, the results of both idiomatic and compositional compounds are more homogeneous, as they are clearly located on the margins of the scale (< 1 and > 4, respectively) with lower deviations. This is not the case in Portuguese, where the average values are > 1 and < 4 for idiomatic and compositional NCs, respectively, placing even the idiomatic cases closer towards the middle of the scale. With respect to the average values for the heads and modifiers, we can highlight the following observations: first, both head and modifier scores are consistently higher than the means for the whole compound in every scenario also suggesting at least a partial compositionality in their token occurrences. Second, for idiomatic NCs, the scores of the modifiers are higher than those of the heads, while for partially compositional NCs the results are the opposite. 7 Finally, regarding the compositional level, the modifier values are higher in English, while in Portuguese the heads seem to contribute more to the meaning of the NC. 6 Removing annotators with low agreement (Spearman ρ < 0.2, and ρ < 0.4) resulted in almost identical correlations. 7 The results for partially idiomatic compounds are expected to some extent as the head tends to bear more semantic load about the whole expression (e.g., as in collocations).
Observing the variability across the annotations, we found some divergence in a few compounds (e.g., brass ring labeled as idiomatic for a compositional occurrence "Three drawers, each with a brass ring pull, provide plenty of storage whatever you use it for."), which hints at possible interference from a salient meaning (Giora, 1999). However, further investigation is needed.
Paraphrases: as mentioned, we asked the participants to provide synonyms or paraphrases for the noun compounds in each particular context. In this respect, it is worth noting that while some suggestions may be applicable across all the sentences for an NC (e.g. spun sugar for cotton candy, considered as a type level synonym), others are more dependent on context and differ for specific sentences (e.g. flight recorder and unknown process, for black box, which can be considered as token level paraphrases). We have classified the paraphrases as type or token level using the following procedure: to organise the large set of paraphrases provided by the annotators (see below), we performed an automatic classification as follows: we labeled as type level synonyms those paraphrases proposed for the three sentences of each compound, and those suggested for two sentences with a frequency >= 3; token level synonyms are those proposed only for one sentence with a frequency >= 2.
In English, 9,690 different paraphrases were proposed by the annotators (average 34.60 per NC), and 3,554 were suggested by at least 5 participants (average of 12.70 per NC). Out of them, 1,506 were classified as type level (5.4 synonyms per NC, on average), and 353 at token level (0.42 per sentence, 1.3 per NC). Overall, 118 NCs have token level synonyms for one sentence, 69 for two sentences, and 16 for the three sentences.
For Portuguese, the annotators suggested a total of 6,579 paraphrases (314 by at least 5 participants

Sentence
Mean Paraphrase Keri enjoys music and has turned into a skilled disc jockey.
1.2 record player Quality wedding disc jockey equipment comes at a cost.
2.5 broadcaster Let one of our high energy disc jockeys entertain your next party.
1.7 announcer Idiomaticity score at the type-level: 1.25. Most common (type-level) paraphrase: DJ. Table 4: Annotation example of the English NC disc jockey. Each row includes a sentence with the target NC together with the mean idiomaticity score and a token-level paraphrase. Bottom row shows the most common (type-level) paraphrase and the mean idiomaticity score from the original dataset (also at the type-level). and 764 by >= 3, average of 4.2 per NC). 743 synonyms were proposed for the 180 compounds (an average of 4.1 per NC), being classified as type level. Concerning token level synonyms, we have collected 192 synonyms (1.1 per NC, on average). In this case the total number of annotations was lower, and the final resource contains 61 NCs with token level synonyms for one sentence, 38 for two sentences, and 6 compounds have token level synonyms for the three sentences.
The collection of paraphrases included in the NCTTI make this dataset a valuable resource for different evaluations, such as lexical substitution tasks and assessments of the performance of embedding models to correctly identify contextualised synonyms of NCs with different degrees of idiomaticity. Table 4 shows an annotation example for the NC disc jockey, in English. It includes the three sentences together with the average idiomaticity score and both token-level and type-level paraphrases.

Experiments
This section displays some of the comparative analyses for the relevance of type and token annotation for idiomaticity detection. First, we adapt the type level compositionality prediction approaches used on static word vectors (Mitchell and Lapata, 2010) to contextualised models (Nandakumar et al., 2019), here computing the correlation also at token level. In particular, the assumption is that compositionality can be approximated as the distance between the representation for an NC and the representation for the compositional combination of its individual components. Then, we measure whether the vector representations reflect the variability of the human annotators, who capture different nuances of the NCs depending on the sentences in which they occur. Similarly, in a third experiment we use the standard deviations of the idiomaticity scores in the three contexts to observe how the interpretation of the NCs varies across sentences, and whether this correlates with the contextualised representations produced by various models. More specifically, we assume that, if models adequately incorporate contextual information, the standard deviations of the similarities between the NCs in different contexts should be correlated with those of the human annotators.

Models
We evaluate four contextualised models: three BERT variants, based on the Transformers architecture (Vaswani et al., 2017), and ELMo, which learns word vectors using bidirectional LSTMs (Peters et al., 2018). For English we used the The representations of NCs (and their sentences) were obtained by averaging the word (or subword, if adopted by the model) embeddings. We used the concatenation of the three layers for ELMo and of the last four hidden layers for the BERT models. In GloVe, words which are not in the vocabulary were skipped.

Experiment 1: Compositionality prediction
Unsupervised type idiomaticity identification with static non-contextualised word embeddings often assumes that the similarity between the NC embedding and the compositional embedding of the component words (e.g. police car vs. police and car) is an indication of idiomaticity (Mitchell and Lapata, 2010): the more similar they are the more compositional the NC is. To approximate this with contextualised models, we calculate the cosine similarities between the contextualised vector of the NC in each sentence with two types of noncontextualised vectors. The first evaluates if even in the absence of an informative sentence context, each of the component words would be enough of a trigger to cue the NC meaning (e.g. eager for eager beaver). This is implemented as the vector for the NC out of context, obtained by feeding the model only with the compound, dubbed NC out. 9 The second non-contextualised vector evaluates if the representations for the individual words have enough information to reconstruct the meaning of the NC in the absence of context and of the collocated component. It is implemented as the sum of the individual vectors of the NC components, where each NC component is fed individually to the model as a sentence, referred to as NC out Comp . On each case, we calculate two Spearman correlations with human judgments: at token level, using all the sentences for each language; and at type level, comparing the average cosine similarities of each NC with their compositionality scores at type level. We also compute correlations between the similarities and frequency-based data, namely the NC raw frequency, and the PPMI (Church and Hanks, 1990) between its component words, to verify whether they have any impact in these measures of idiomaticity. The frequency data were obtained from ukWaC, with 2.25B tokens in English (Baroni et al., 2009), and brWaC, containing 2.7B tokens in Portuguese (Wagner Filho et al., 2018). The results by Cordeiro et al. (2019) suggested that if the two components of an NC are processed as a single token unit (for instance, by explic-itly linking them with an underscore) the resulting static representation captures the NC idiomatic meaning. This is not surprising since by linking the two components we create a new word that would be treated by the model as completely independent of the preexisting component words. But such preprocessing may not be desirable or even feasible. In this sense the contextualised models would be a good promise, since we expected that by processing a sentence with an idiomatic NC, the context would be enough to lead the model into linking the component words and assigning the corresponding idiomatic meaning. Figuratively speaking, the contextualised models would put the underscore for us. Therefore, if contextualised models capture idiomaticity, the similarity between NC and NC out Comp (or NC out) should have strong correlations with the idiomaticity scores of the NCs. Table 5 shows the significant correlations in English (top rows) and Portuguese (bottom). These results indicate at best weak (NC out Comp ) to moderate (NC out) correlations between models' predictions and human judgments, both at type and token levels. Moreover, the correlations obtained are much smaller than those found by the static models used by Cordeiro et al. (2019). For English, the best correlations (0.37) were obtained by BERT, while ELMo and Sentence-BERT achieved the best performance in Portuguese (0.27 and 0.26, respectively). In both languages, the lower values were those of DistilBERT. It is worth noting that a direct comparison between the BERT models in both languages should not be done, as they are monolingual (for English) and multilingual (for Portuguese).
For PPMI, only weak positive correlations were found for ELMo and DistilBERT, indicating that for them higher cosine values weakly imply NCs with stronger association scores. Moreover, weak to moderate negative correlations with frequency were found for the BERT models, suggesting that cosine similarity is higher for less frequent NCs. The differences between NC out and NC out Comp indicate the importance of some degree of contextualisation (also found by Yu and Ettinger (2020)), even if only as one component contextualising the other in NC out, which may not be retrievable from the combination of the context-independent vectors of the components (NC out Comp ). This is in line with the original strategy used with static embeddings, which learns the distribution of the NCs pre-identified as single tokens in corpora and that resulted in significantly better correlations per type than any of the contextualised models (Cordeiro et al., 2019).
To make a fairer comparison between both approaches, we injected into the BERT models single representations for the NCs, learnt from the referred ukWaC and brWaC corpora. We first annotated as single tokens in the corpus those NCs present in the dataset, and used attentive mimicking with one-token-approximation Schütze, 2019, 2020b) to learn up to 500 contexts for each compound. After that, we injected these type level vectors into the BERT models using BERTRAM (Schick and Schütze, 2020a). For English, these new representations obtained lower results than the original BERT in NC out (e.g., 0.37 vs. 0.28 at type level), but higher in NC out Comp (0.16 vs. 0.33 at type level). For Portuguese, including single representations for the NCs in BERT improved the correlations in three of the four scenarios (except for NC out at token level), but the best results were almost identical to those of ELMo (see the full results in the bottom rows of Table 5).
Regarding the results reported by Nandakumar et al. (2019), for English, our experiments yielded higher correlations for BERT and lower for ELMo (≈ 0.3 in both cases, depending on the setting), which may be due to differences in how the vectors are generated (e.g., the use of different input sentences, hidden layers or compositional operations).
In sum, the results of these evaluations suggest that the use of a straightforward adaptation of a compositionality prediction approach that led to good performance with static models was not as successful with contextualised models.

Experiment 2: Investigating idiomaticity with word embedding models
We analyse whether models are able to capture differences in idiomaticity perceived by human annotators across the sentences in which an NC occurs. That is, if an NC is found to be more idiomatic in one sentence than in others. For that, we created an annotator's vector for each sentence, combining the human scores to create a three dimensional vector representation, where the first dimension is the average NC compositionality, and the second and third are the average scores of the contributions of the head and of the modifier. For representing the sentence we obtain an embedding by averaging their (sub)words. We calculated the Euclidean distances between (i) the annotators' vectors and (ii) the cosine similarities between sentence embeddings of each of the possible combinations of the three sentences associated to each NC. Then, we measured the correlations between these values using Spearman ρ. We aim to assess if annotations and models indicate the same relative differences. 10 The results were averaged for the 280 (English) and 180 (Portuguese) NCs. Table 6 shows the results for the whole datasets and divided by compositionality level. As we compare Euclidean distances with cosine similarities negative values are actually positive correlations and vice versa. The average ρ is close to 0 suggesting that the embedding models do not capture the nuances in idiomaticity perceived by the annotators between the different sentences per NC.

Experiment 3: NC idiomaticity across sentences
We also analysed the similarity among the annotations for each NC in the three sentences, computing the standard deviations of the average compositionality scores given by the annotators. In contrast to the previous experiment, here we represent the human annotations using only the idiomaticity scores of the whole NCs and the models' output as the contextualised embedding of the NCs in each sentence. At token level most compounds (85.7% in English and 91.1% in Portuguese) have mean idiomaticity scores with less than 0.6 of standard deviation. Very few NCs have deviations higher than 1: five in English and four in Portuguese. Looking at the contexts in which they occur, the variability seems to be due to the different topics to which the sentences refer. For instance, the annotators have identified two senses of firing line: one, more idiomatic, referring to a position in which someone is criticised (mean score of 1.25), and a second one (partially compositional, with an average of 2.7) referring to a specific position in an armed conflict. In Portuguese, céu aberto ('open-air', lit. 'open-sky') was interpreted as less compositional (1.2) when describing urban settings (e.g., open-air shopping centers) than when referring to wild places (e.g., lobas que lutavam a céu aberto, 'wolves fighting in the open'), with a mean idiomaticity score of 3.
10 Spearman ρ is not used here as a statistical test but as a measure to evaluate if the sentence comparisons with two different metrics yield the same relative differences. As there are only three sentences to compare, ρ assumes only four values ±0.5 or ±1.   To observe whether language models capture these differences across sentences, we calculated the cosine similarities between the NCs in the three sentences and the standard deviation of these three values. We then computed the Spearman correlations between these deviations obtained from the models' representations and those of the human annotations: all correlations were very low and not significant, suggesting that the vector representations do not capture the variability perceived by the annotators. Finally, we have also selected two NCs in English with a combination of idiomatic and compositional meanings (brick wall, and gold mine). In these examples, we found that for BERT (our best model) the cosine similarities between the idiomatic meanings were higher (0.83 in both cases) than between idiomatic and compositional senses (0.68 and 0.7, respectively), suggesting that they are somehow identifying the different senses. However, since the highest standard deviations were achieved with NCs representing the same sense in all contexts (e.g., big wig and grass root), further analysis is needed.
As neither the cosine similarities obtained with BERT-based models nor the standard deviations between them were correlated with the variation in the human scores, these analyses suggest that stateof-the-art contextualised models still do not model semantic compositionality as human annotators do.
The experiments performed in this section have shown, on the one hand, some of the possibilities of a multilingual dataset labeled at type and token level; on the other hand, the results also suggest that capturing idiomaticity is a hard task for current language models, as only some of them show moderate correlations with human annotations in some scenarios.

Conclusions and Future Work
This paper presented the NCTTI, a dataset of NCs in English and Portuguese annotated at type and token level with human judgments about idiomaticity, and with suggestions of paraphrases. The very strong correlations found between type and token judgments confirm the robustness of the scores, while the paraphrases provide further validation of the interpretation of the NCs.
Moreover, evaluations involving embedding models with different levels of contextualisation suggest that they are still far from providing accurate estimates of NC idiomaticity, at least using the measures proposed and analysed in the paper. MWEs are still a pain in the neck for NLP, and datasets like the NCTTI can contribute towards finding better representations for them and better measures for idiomaticity identification.
Future work includes using these NCs as seeds in cross-lingual representations for enriching the dataset with NC equivalents in different languages. Besides, we also plan to enlarge the datasets including a subset of sentences with ambiguous NCs having idiomatic and compositional interpretations depending on the context.