Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction

Neural language models encode rich knowledge about entities and their relationships which can be extracted from their representations using probing. Common properties of nouns (e.g., red strawberries, small ant) are, however, more challenging to extract compared to other types of knowledge because they are rarely explicitly stated in texts. We hypothesize this to mainly be the case for perceptual properties which are obvious to the participants in the communication. We propose to extract these properties from images and use them in an ensemble model, in order to complement the information that is extracted from language models. We consider perceptual properties to be more concrete than abstract properties (e.g., interesting, flawless). We propose to use the adjectives' concreteness score as a lever to calibrate the contribution of each source (text vs. images). We evaluate our ensemble model in a ranking task where the actual properties of a noun need to be ranked higher than other non-relevant properties. Our results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful text-based language models.


Introduction
Common properties of concepts or entities (e.g., "These strawberries are red") are rarely explicitly stated in texts, contrary to more specific properties which bring new information in the communication (e.g., "These strawberries are delicious").This phenomenon, known as "reporting bias" (Gordon and Van Durme, 2013;Shwartz and Choi, 2020), makes it difficult to learn, or retrieve, perceptual properties from text.However, noun property identification is an important task which may allow AI applications to perform commonsense reasoning in a way that matches people's psychological or cognitive predispositions and can improve agent  communication (Lazaridou et al., 2016).Furthermore, identifying noun properties can contribute to better modeling concepts and entities, learning affordances (i.e.defining the possible uses of an object based on its qualities or properties), and understanding models' knowledge about the world.Models that combine different modalities provide a sort of grounding which helps to alleviate the reporting bias problem (Kiela et al., 2014;Lazaridou et al., 2015;Zhang et al., 2022).For example, multimodal models are better at predicting color attributes compared to text-based language models (Paik et al., 2021;Norlund et al., 2021).Furthermore, visual representations of concrete objects improve performance in downstream NLP tasks (Hewitt et al., 2018).Inspired by this line of work, we expect concrete visual properties of nouns to be more accessible through images, and text-based language models to better encode abstract semantic properties.We propose an ensemble model which combines information from these two sources for English noun property prediction.
We frame property identification as a ranking task, where relevant properties for a noun need to be retrieved from a set of candidate properties found in association norm datasets (McRae et al., 2005;Devereux et al., 2014;Norlund et al., 2021).We experiment with text-based language models (Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019) and with CLIP (Radford et al., 2021) which we query using a slot filling task, as shown in Figures 1(a) and (b).Our ensemble model (Figure 1(c)) combines the strengths of the language and vision models, by specifically privileging the former or latter type of representation depending on the concreteness of the processed properties (Brysbaert et al., 2014).Given that concrete properties are characterized by a higher degree of imageability (Friendly et al., 1982), our model trusts the visual model for perceptual and highly concrete properties (e.g., color adjectives: red, green), and the language model for abstract properties (e.g., free, infinite).Our results confirm that CLIP can identify nouns' perceptual properties better than language models, which contain higher-quality information about abstract properties.Our ensemble model, which combines the two sources of knowledge, outperforms the individual models on the property ranking task by a significant margin.

Related Work
Probing has been widely used in previous work for exploring the semantic knowledge that is encoded in language models.A common approach has been to convert the facts, properties, and relations found in external knowledge sources into "fillin-the-blank" cloze statements, and to use them to query language models.Apidianaki and Garí Soler (2021) do so for nouns' semantic properties and highlight how challenging it is to retrieve this kind of information from BERT representations (Devlin et al., 2019).Furthermore, slightly different prompts tend to retrieve different semantic information (Ettinger, 2020), compromising the robustness of semantic probing tasks.We propose to mitigate these problems by also relying on images.
Features extracted from different modalities can complement the information found in texts.Multimodal distributional models, for example, have been shown to outperform text-based approaches on semantic benchmarks (Silberer et al., 2013;Bruni et al., 2014;Lazaridou et al., 2015).Similarly, ensemble models that integrate multimodal and text-based models outperform models that only rely on one modality in tasks such as visual question answering (Tsimpoukelli et al., 2021;Alayrac et al., 2022;Yang et al., 2021b), visual entailment (Song et al., 2022), reading comprehension, natural language inference (Zhang et al., 2021;Kiros et al., 2018), text generation (Su et al., 2022), word sense disambiguation (Barnard and Johnson, 2005), and video retrieval (Yang et al., 2021a).We extend this investigation to noun property prediction.
We propose a novel noun property retrieval model which combines information from language and vision models, and tunes their respective contributions based on property concreteness (Brysbaert et al., 2014).Concreteness is a graded notion that strongly correlates with the degree of imageability (Friendly et al., 1982;Byrne, 1974); concrete words generally tend to refer to tangible objects that the senses can easily perceive (Paivio et al., 1968).We extend this idea to noun properties and hypothesize that vision models would have better knowledge of perceptual, and more concrete, properties (e.g., red, flat, round) than text-based language models, which would better capture abstract properties (e.g., free, inspiring, promising).We evaluate our ensemble model using concreteness scores automatically predicted by a regression model (Charbonnier and Wartena, 2019).We compare these results to the performance of the ensemble model with manual (gold) concreteness ratings (Brysbaert et al., 2014).In previous work, concreteness was measured based on the idea that abstract concepts relate to varied and composite situations (Barsalou and Wiemer-Hastings, 2005).Consequently, visually grounded representations of abstract concepts (e.g., freedom) should be more complex and diverse than those of concrete words (e.g., dog) (Lazaridou et al., 2015;Kiela et al., 2014).Lazaridou et al. (2015) specifically measure the entropy of the vectors induced by multimodal models which serve as an expression of how varied the information they encode is.They demonstrate that the entropy of multimodal vectors strongly correlates with the degree of abstractness of words.

Task Formulation
Given a noun N and a set of candidate properties P, a model needs to select the properties P N ⊆ P that apply to N .The candidate properties are the set of all adjectives retained from a resource (cf.Section 3.2), which characterize different nouns.A model needs to rank properties that apply to N higher than properties that apply to other nouns in the resource.We consider that a property correctly characterizes a noun, if this property has been proposed for that noun by the annotators.2005) dataset contains feature norms for 541 objects annotated by 725 participants.We follow Apidianaki and Garí Soler (2021) and only use the IS_ADJ features of noun concepts, where the adjective describes a noun property.In total, there are 509 noun concepts with at least one IS_ADJ feature, and 209 unique properties.The FEATURE NORMS dataset contains both perceptual properties (e.g., tall, fluffy) and non-perceptual ones (e.g., intelligent, expensive).

MEMORY COLORS:
The dataset contains 109 nouns with an associated image and its corresponding prototypical color.There are 11 colors in total.(Norlund et al., 2021).The data were scraped from existing knowledge bases on the web.
CONCEPT PROPERTIES: This dataset was created at the Centre for Speech, Language and Brain (Devereux et al., 2014).It contains concept property norm annotations collected from 30 participants.The data comprise 601 nouns with 400 unique properties.We keep aside 50 nouns (which are not in FEATURE NORMS and MEMORY COLORS) as our development set (dev).We use the dev for prompt selection and hyper-parameter tuning.We call the rest of the dataset CONCEPT PROPERTIES-test and use it for evaluation.CONCRETENESS DATASET: The Brysbaert et al. (2014) dataset contains manual concreteness ratings for 37,058 English word lemmas and 2,896 two-word expressions, gathered through crowdsourcing.The original concreteness scores range from 0 to 5. We map them to [0, 1] by dividing each score by 5.
Our property ranking setup allows to consider multi-piece adjectives (properties)3 which were excluded from open-vocabulary masking experiments (Petroni et al., 2019;Bouraoui et al., 2020;Apidianaki and Garí Soler, 2021).Since the candidate properties are known, we can obtain a score for a property composed of k pieces (P = (w t , ..., w t+k ), k ≥ 1) by taking the average of the scores assigned by the LM to each piece: We report the results in Appendix E.4 and show that our model is better than other models at retrieving multi-piece properties.

Multimodal Language Models (MLMs)
Vision Encoder-Decoder MLMs are language models conditioned on other modalities than text, for example, images.For each noun N in our datasets, we collect a set of images I from the web. 4 We probe an MLM similarly to LMs, using the same set of prompts.An MLM yields a score for each property given an image i ∈ I using Formula 3.
Score MLM (P, i) = log P MLM (w t = P|W \t , i), (3) where P MLM (•) is the probability from multimodal language model.In addition to the context W \t , the MLM conditions on the image i. 5 Then we aggregate over all the images I for the noun N to get the score for the property: ViLT We experiment with the Transformer-based (Vaswani et al., 2017) VILT model (Kim et al., 2021) as an MLM.VILT uses the same tokenizer as BERT and is pretrained on the Google Conceptual Captions (GCC) dataset which contains more than 3 million image-caption pairs for about 50k words (Sharma et al., 2018).Most other vision-language datasets contain a significantly smaller vocabulary (10k words). 6In addition, VILT requires minimal image pre-processing and is an open visual vocabulary model. 7This contrasts with other multimodal architectures which require visual predictions before passing the images on to the multimodal layers (Li et al., 2019;Lu et al., 2019;Tan and Bansal, 2019).These have been shown to only marginally surpass text-only models (Yun et al., 2021).
CLIP We also use the CLIP model which is pretrained on 400M image-caption pairs (Radford et al., 2021).CLIP is trained to align the embedding spaces learned from images and text using contrastive loss as a learning objective.The CLIP model integrates a text encoder f T and a visual peacock sunflower Top-1: An object with the property of showy.
Top-1: An object with the property of yellow.Bottom-1: An object with the property of tartan.
Bottom-1: An object with the property of kneaded.encoder f V which separately encode the text and image to vectors with the same dimension.Given a batch of image-text pairs, CLIP maximizes the cosine similarity for matched pairs while minimizing the cosine similarity for unmatched pairs.
We use CLIP to compute the cosine similarity of an image i ∈ I and this text prompt (s P ): "An object with the property of [MASK]", where the [MASK] token is replaced with a candidate property P ∈ P. The score for each property P is the mean similarity between the sentence prompt s P and all images I collected for a noun: This score serves to rank the candidate properties according to their relevance for a specific noun.
Figure 2 shows the most and least relevant properties for the nouns peacock and sunflower.

Concreteness Ensemble Model (CEM)
The concreteness score for a property guides CEM towards "trusting" the language or the vision model more.We propose two CEM flavors which we describe as CEM-PRED and CEM-GOLD.CEM-PRED uses the score (c P ∈ [0, 1]) that is proposed by our concreteness prediction model for every candidate property P ∈ P, while CEM-GOLD uses the score for P in the Brysbaert et al. (2014) dataset. 8 If there is no gold score for a property, we use the score of the word with the longest matching subsequence in the dataset. 9The idea behind this Table 2: The prompt template selected for each model.
heuristic is that properties without ground truth concreteness scores often have inflected forms or derivations in the dataset (e.g., sharpened/sharpen, invented/invention, etc.). 10 We also experimented with GLOVE word embedding cosine similarity which resulted in suboptimal performance (cf.Section 4).Additionally, sequence matching is much faster than GLOVE similarity (cf.Appendix B).
Both CEMs combine the rank11 of P proposed by the language model (Rank LM ) and by CLIP (Rank CLIP ) through a weighted sum which is controlled by the concreteness score, c P :

Concreteness Prediction Model
We generate concreteness scores using the model of Charbonnier and Wartena (2019) with FastText embeddings (Bojanowski et al., 2017).The model leverages part-of-speech and suffix features to predict concreteness in a classical regression setting.We train the model on the 40k concreteness dataset (Brysbaert et al., 2014), excluding the 425 adjectives found in our test sets.The model obtains a high Spearman ρ correlation of 0.76 with the ground truth scores of the adjectives in our test sets.This result shows that automatically predicted scores are a viable alternative which allows the application of the method to new data and domains where hand-crafted resources might be unavailable.

Baselines
We compare the predictions of the language, vision, and ensemble models to the predictions of three baseline methods.
RANDOM: Generates a RANDOM property ranking for each noun.
GLOVE: Ranking based on the cosine similarity of the GLOVE embeddings (Pennington et al., 2014) of the noun and the property.
GOOGLE NGRAM: Ranking by the bigram frequency of each noun-property pair in Google Ngrams (Brants and Franz, 2009).If a nounproperty pair does not appear in the corpus, we assign to it a frequency of 0.

Evaluation Metrics
We evaluate the property ranking proposed by each model using the top-K Accuracy (A@K), top-K recall (R@K), and Mean Reciprocal Rank (MRR) metrics.A@K is defined as the percentage of nouns for which at least one ground truth property is among the top-K predictions (Ettinger, 2020).R@K shows the proportion of ground truth properties retrieved in the top-K predictions.We report the average R@K across all nouns in a test set.MRR stands for the ground truth properties' average reciprocal ranks (more precisely, the inverse of the rank, 1 rank ).For all three metrics, high scores are better.

Implementation Details
Prompt Selection We evaluate the performance of BERT-LARGE, ROBERTA-LARGE, GPT-2-LARGE, and VILT on the dev set (cf. Section 3.2) using the prompt templates proposed by Apidianaki and Garí Soler (2021).For CLIP, we handcraft a set of prompts that are close to the format that was recommended in the original paper (Radford et al., 2021) and evaluate their performance on the dev set.We choose the prompt that yields the highest performance in terms of MRR on the dev set for each model, and use it for all our experiments (cf.Appendix A for details).Table 2 lists the prompt templates selected for each model.

Image Collection
We collect images for the nouns in our datasets using the Bing Image Search API, an image query interface widely used for research purposes (Kiela et al., 2016;Mostafazadeh et al., 2016). 12We use again the dev set to determine the number of images needed for each noun.We find that good performance can be achieved with only ten images (cf. Figure 7 in Appendix C.1). Adding more images increases the computations needed without significantly improving the performance.Therefore, we set the number of images per noun to ten for all vision models and experiments.

Model # Param Img
FEATURE NORMS CONCEPT PROPERTIES-test MEMORY COLORS A@1 A@5 R@5 R@10 MRR A@1 A@5 R@5 R@10 MRR A@1 A@2 A@3 RANDOM  PROPERTIES-test dataset.This may be due to the fact that 49 properties in this dataset do not have ground truth concreteness scores (vs.only 15 properties in FEATURE NORMS), indicating that the prediction model probably approximates concreteness better in these cases, contributing to higher scores for CEM-PRED.
As explained in Section 3.3.3,we explore two different heuristics to select the score for these properties for CEM-GOLD: longest matching subsequence and GloVE cosine similarity.The latter similarity metric results to a drop in performance on FEATURE NORMS and almost identical performance for CONCEPT-PROPERTIES-test. 16We notice that the GOOGLE-NGRAM baseline performs well on FEATURE NORMS with results on par or superior to big LMs.The somewhat lower results obtained on CONCEPT PROPERTIES-test might be due to the higher number of properties in this dataset (cf.Table 1), which makes the ranking task more challenging. 17There is also a higher number of noun-property pairs that are not found in Google Bigrams which are assigned a zero score. 18 The MEMORY COLORS dataset associates each noun with a single color so we only report Accuracy at top-K (last three columns of Table 3).We can compare these scores to a previous baseline, the top-1 Accuracy reported by Norlund et al. (2021) for the CLIP-BERT model which is 78.5. 19CEM-PRED and GOLD both do better on this dataset (88.1).GPT-3 gets much higher scores than the other three language models on this task with a top-1 Accuracy of 74.3, but is outperformed by CLIP and CEM.Note that MRR does not apply to GPT-3 since it generates properties instead of reranking them (cf.Appendix A.3).
The multimodal model with the lowest performance, VILT, is as good as GPT-3.CLIP falls halfway between VILT and CEM-PRED/GOLD.CEM-PRED and CEM-GOLD present a clear advantage compared to language and multimodal models, achieving a top-1 Accuracy of 88.1.Although ROBERTA gets very low Accuracy on MEMORY COLORS, it does not hurt performance when combined with CLIP in our CEM-GOLD model.This is because the color properties in this dataset have high concreteness scores (0.82 on aver- 17 The mean number of properties per noun in CONCEPT PROPERTIES is 6.6, and 3.1 in FEATURE NORMS. 18 26% of the pairs in CONCEPT PROPERTIES vs. 15% for FEATURE NORMS. 19We cannot calculate the other scores because CLIP-BERT has not been made available.In this model, a CLIP encoded image is appended to BERT's tokenized input before fine-tuning with a masked language modeling objective on 4.7M captions paired with 2.9M images.For more details refer to (Norlund et al., 2021).age), so CEM-GOLD relies mainly on CLIP which works very well in this setting.CEM-GOLD makes the same top-1 predictions as CLIP for 95 nouns (out of 109), while only 50 nouns are assigned the same color by CEM-GOLD and ROBERTA.

Additional Analysis
Concreteness level.We examine the performance of each model for properties at different concreteness levels.From the properties available for a noun in FEATURE NORMS, 20 we keep a single property as our ground truth for this experiment: (a) most concrete: the property with the highest concreteness score in the Brysbaert et al. (2014) lexicon; (b) least concrete: the property with the lowest concreteness score; (c) random: a randomly selected property. 21Figure 3 shows the top-1 Accuracy of the models for the properties in each concreteness band.Examples of nouns with their most and least concrete properties are given in Table 4.The results of this experiment confirm our initial assumption that MLMs (e.g., CLIP and VILT) are better at capturing concrete properties, and LMs (e.g., ROBERTA and GPT-2) are better at identifying abstract ones.GPT-3 is the only LM that performs better for concrete than for abstract properties, while still falling behind CEM variations.
Rank Improvement.We investigate the relationship between the performance of CEM and the concreteness score of the properties in CONCEPT  PROPERTIES-test.We measure the rank improvement (RI) for a property (P) that occurs when using CEM compared to when ROBERTA is used as follows: RI(P) = Rank CEM (P) − Rank RoBERTa (P) (7) A high RI score for P means that its rank is improved with CEM compared to ROBERTA.We calculate the RI for properties at different concreteness levels.We sort the 400 properties in CONCEPT PROPERTIES-test by increasing the concreteness score and group them into ten bins of 40 properties each.We find a clear positive relationship between the average RI and concreteness scores within each bin, as shown in Figure 4.This confirms that both CEM-PRED and CEM-GOLD perform better with concrete properties.We also run the best performing ROBERTA + CLIP combination again using weights fixed in this way, i.e. without recourse to the properties' concreteness score as in CEM-PRED and in CEM-GOLD.Note that we do not expect the combination of two text-based LMs to improve Accuracy a lot compared to ROBERTA alone.Our intuition is confirmed by the results obtained on FEATURE NORMS and shown in Figure 5.
The dashed and dotted straight lines in the figure represent the top-1 Accuracy of CEM-GOLD and CEM-PRED, respectively, when the weights used are not the ones on the x-axis, but the gold and predicted concreteness scores (cf.Equation 6).To further highlight the importance of concreteness in interpolating the models, we provide additional results and comparisons in Appendix D.2.Note that CEM-GOLD and CEM-PRED have highly similar performance and actual output.On average over all nouns, they propose 4.35 identical properties at top-5 for nouns in FEATURE NORMS, and 4.41 for nouns in CONCEPT PROPERTIES-test.
We observe a slight improvement in top-1 Accuracy (5%) when ensembling two text-based LMs (ROBERTA + BERT or ROBERTA + GPT-2).Text-based LMs have similar output distributions, hence combining them does not change the final distribution much.The ROBERTA + VILT ensemble model achieves higher performance due to the interpolation with an image-based model, but it does not reach the Accuracy of the CEM models (ROBERTA + CLIP).The VILT model gets lower performance than CLIP when combined with ROBERTA, because it was exposed to much less data than CLIP during training (400M vs. 30M).Finally, we notice that the best performance of ROBERTA + CLIP with fixed weight is slightly lower than that of the CEM models.This indicates that using a fixed weight to ensemble two models hurts performance compared to calibrating their mutual contribution using the concreteness score.Another advantage of the concreteness score is that it is more transferable since it does not require tuning on new datasets.
Properties Quality.Table 5 shows a random sample of the top-3 predictions made by each model for nouns in CONCEPT PROPERTIES-test.We notice that the properties proposed by the two flavors of CEM are both perceptual and abstract, due to their access to both a language and a vision model.We further observe that CEM retrieves rarer and more varied properties for different nouns, compared to the language models. 22 Figure 6 shows the number of nouns for which a model made the exact same top-3 predictions. 23 For example GPT-3 proposed the properties [tart, acidic, sweet, juicy, smooth] for 20 different nouns 24 in the same order.Note that better prompt engineering might decrease the number of repeated properties.However, we are already prompting GPT-3 with one shot, whereas the other models, including CEM are zero-shot.ROBERTA predicted [male, healthy, white, black, small] for both mittens and penguin, and [male, black, white, brown, healthy] for owl and flamingo.We observe that CEM-PRED and CEM-GOLD are less likely to retrieve the same top-K predictions for a noun than language models.CEM combines the variability and accuracy of CLIP with the benefits of textbased models, which are exposed to large volumes of texts during pre-training.

Conclusion
We propose a new ensemble model for noun property prediction which leverages the strengths of 22 Details on the frequency of the properties retrieved by each model are reported in Appendix E.1.We provide more randomly sampled qualitative examples in Appendix E.5. 23Refer to E.3 for the number of nouns with exact same top-K predictions for different values of K. 24 apple, plum, grapefruit, tangerine, orange, lime, lemon, grape, rhubarb, cherry, cap, cape, blueberry, strawberry, pine, pineapple, prune, raspberry, nectarine, cranberry language models and multimodal (vision) models.Our model, CEM, calibrates the contribution of the two types of models in a property ranking task by relying on the properties' concreteness level.The results show that the CEM model, which combines ROBERTA and CLIP outperforms powerful text-based language models (such as GPT-3) with significant margins in three evaluation datasets.Additionally, our methodology yields better performance than alternative ensembling techniques, confirming our hypothesis that concrete properties are more accessible through images, and abstract properties through text.The Accuracy scores obtained on the larger datasets show that there is still room for improvement for this challenging task.

Limitations
Our experiments address concreteness at the lexical level, specifically using scores assigned to adjectives in an external resource (Brysbaert et al., 2014) or predicted using (Charbonnier and Wartena, 2019).Another option would be to use the concreteness of the noun phrases formed by the adjectives and the nouns they modify.We would expect this to be different than the concreteness of adjectives in isolation since the concreteness of the nouns would have an impact on that of the resulting phrase (e.g., useful knife vs. useful idea).We were not able to evaluate the impact of noun phrase concreteness on property prediction because the property datasets used in our experiments mostly contain concrete nouns.Another limitation of our methodology is the reliance on pairing images with nouns.In particular, we use a search engine to retrieve images corresponding to nouns in order to get grounded predictions from the vision model.Finally, we only evaluate our methodology in English and leave experimenting with other languages to future work, since this would require the collection of multi-lingual semantic association datasets and/or the translation of existing ones.We did not pursue this extension for this paper as MULTILIN-GUAL CLIP model weights only became available very recently.
19-2-0201), the IARPA BETTER Program (contract 2019-19051600004), and the NSF (Award 1928631).Approved for Public Release, Distribution Unlimited.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, IARPA, NSF, or the U.S. Government.

A.1 Language Model Prompts
In our experiments with language models, we use the 11 prompts proposed by Apidianaki and Garí Soler (2021) for retrieving noun properties.As shown in Table 6, these involve nouns in singular and plural forms.The performance achieved by each language model with these prompts on the CONCEPT PROPERTIES development set is given in Table 8.The results show that model performance varies significantly with different prompts.The best-performing prompt is different for each model.For BERT and GPT-2, the "most + PLURAL" obtains the highest Recall and MRR scores.The best performing prompt for ROBERTA-LARGE is "SIN-GULAR + generally", and "PLURAL" for VILT.

A.2 CLIP Prompts
For CLIP, we handcraft ten prompts and report their performance on the CONCEPT PROPERTIES development set in Table 7. Similar to what we observed with language models, CLIP performance is also sensitive to the prompts used.We select for our experiments the prompt "An object with the property of [MASK].",which obtains the highest average Accuracy and MRR score on the CONCEPT PROPERTIES development set.

A.3 GPT-3 Prompts
Since we do not have complete control of GPT-3 at this moment, we treat GPT-3 as a questionanswering model using the following prompt in a one-shot example setting: Use ten adjectives to describe the properties of kiwi:\n 1. tart\n2.acidic\n3.sweet\n 4. juicy\n5.smooth\n6.fuzzy\n 7. green\n8.brown\n9.small\n 10. round\n Use ten adjectives to describe the properties of [NOUN]:\n We use the text-davinci-001 engine of GPT-3 which costs $0.06 per 1,000 tokens.On average, it costs $0.007 to generate 10 properties for each noun.
FEATURE NORMS CONCEPT PROPERTIES-test MEMORY COLORS Acc@1 R@5 R@10 MRR Acc@1 R@5 R@10 MRR Acc@1 Acc@3 Acc@5 CEM-GOLD Wartena (2019) using the concreteness scores of 40k words (all parts-of-speech) in the Brysbaert et al. (2014) dataset.We exclude 425 adjectives that are found in the FEATURE NORMS, CONCEPT PROPERTIES, and MEMORY COLORS datasets. 26he concreteness prediction model uses FastText embeddings (Mikolov et al., 2018) enhanced with POS and suffix features.We evaluate the model on the 425 adjectives that were left out during training and for which we have ground truth scores.The generation and ensembling methods on FEATURE NORMS.CEM achieves the highest performance across all metrics, indicating that concreteness offers a reliable criterion for model ensembling under unsupervised scenarios.

E.1 Unigram Prediction Frequency
In Table 13, we report the mean Google unigram frequency (Brants and Franz, 2009) for all properties in the top 5 predictions of each model.We observe that our CEM model -which achieves the best performance among the tested models, as shown in Table 3 -often predicts mediumfrequency words.This is a desirable property of our model compared to models which would instead predict highly frequent or rare words (highly specific or technical terms).This is the case for GPT3 and CLIP, which propose rarer attributes but obtain lower performance than CEM.It is worth noting that, contrary to CLIP, GPT3 retrieves properties from an open vocabulary.
Given that Google NGrams frequencies are computed based on text, many common properties might not be reported.For example, FEATURE NORMS propose as typical attributes of an "ambulance": loud, white, fast, red, large, orange.The frequency of the corresponding property-noun bigrams (e.g., loud ambulance, white ambulance) are: 0, 687, 50, 193, 283, and 0. Meanwhile, the bigrams formed with less typical properties (e.g., old, efficient, modern, and independent) have a higher frequency (1725, 294, 314, and 457).While language models rely on text and, thus, suffer from reporting bias, vision-based models can retrieve properties that are rarely stated in the text.

E.2 Prototypical Property Retrieval
We carry out an additional experiment aimed at estimating the performance of the models on prototypical vs. non-prototypical properties.Prototypical properties are the ones that apply to most of the objects in the class denoted by the noun (e.g., red strawberries); in contrast, non-prototypical properties describe attributes of a smaller subset of the objects denoted by the noun (e.g., delicious strawberry).We make the assumption that prototypical properties are common and, often, visual or perceptual; we expect them to be more rarely stated in texts and, hence, harder to retrieve using language models than using images.
We use the split of the FEATURE NORMS dataset performed by Apidianaki and Garí Soler (2021) into prototypical and non-prototypical properties, based on the quantifier annotations found in the Herbelot and Vecchi (2015) dataset. 27The first split (Prototypical) contains 785 prototypical adjective-noun pairs (for 386 nouns) annotated with at least two ALL labels, or with a combination of ALL and MOST (healthy banana → [ALL-ALL-ALL]).The second set (Non-Prototypical) contains 807 adjective-noun pairs (for 509 nouns) with adjectives in the ground truth that are not included in the Prototypical set.In Table 11, we report the performance of each model in retrieving these properties.
In the ALL, MOST column we consider properties that have at least 2 ALL annotations, with the combination of a MOST annotation, and in the SOME column, we consider all properties that do not contain NO and FEW annotations, and have at least one SOME annotation.The results confirm our intuition that non-prototypical properties are more frequently mentioned in the text.This is reflected in the score of the GOOGLE NGRAM baseline for these properties.For prototypical properties, our CEM model outperforms all other models.

E.3 Same Top-K Predictions by Different Nouns
Figure 8 shows the number of nouns in the FEA-TURE NORMS and CONCEPT PROPERTIES-test datasets for which a model made the exact same top-K predictions.We observe that LMs consistently repeat the same properties for different nouns, while MLMs exhibit a higher variation in their predictions.

E.4 Multi-piece Performance
Each model splits words into a different number of word pieces.Table 14 shows the number of multi-piece properties for each model, and its performance on these properties.We observe that all models perform worse than average (refer to Table 3 for the average performance) on the multi-piece properties, however, CEM has the smallest reduction in performance compared to the average values.This could be because CEM relies on information from two models with different tokenizers.

Figure 1 :
Figure 1: Our task is to retrieve relevant properties of nouns from a set of candidates.We tackle the task using (a) Cloze-task probing; (b) CLIP to compute the similarity between the properties and images of the noun; (c) a Concreteness Ensemble Model (CEM) to ensemble language and CLIP predictions which relies on properties' concreteness ratings.

FEATURE
NORMS: The McRae et al. (

Figure 2 :
Figure 2: Examples of Top-1 and Bottom-1 prompts ranked by CLIP.

Figure 3 :
Figure 3: Top-1 Accuracy for the FEATURE NORMS properties filtered by concreteness.The average concreteness score for each band is given on the x-axis.The error bars in the "random" category represent the standard deviation on 10 trials.

Figure 4 :
Figure4: The average Rank Improvement (RI) score for properties in the CONCEPT PROPERTIES-test grouped in ten bins according to their concreteness.The higher the concreteness score of the properties in a bin, the larger the improvement brought by CEM-GOLD and CEM-PRED over ROBERTA.

Figure 5 :
Figure 5: Top-1 Accuracy obtained by different ensemble models on the FEATURE NORMS.The x-axis shows the weight used to interpolate two models.The straight dashed and dotted lines are the top-1 Accuracy of CEM-GOLD (40.1) and CEM-PRED (39.9) respectively.

Figure 6 :
Figure 6: Number of nouns in FEATURE NORMS and CONCEPT PROPERTIES-test for which a model proposed the same top-3 properties in the same order.

Table 3 :
Results obtained on the three datasets.The best result for each metric is marked in boldface.

Table 4 :
Examples of nouns with their most and least concrete properties in FEATURE NORMS.Model Implementation All LMs and MLMs are built on the huggingface API.13The CLIP model is adapted from the official repository. 14M ensembles the ROBERTA-LARGE and the CLIP-VIT/L14 models.The experiments were run on Quadro RTX 6000 24GB.All our experiments involve zero-shot and one-shot (for GPT-3) probing, hence no training of the models is needed.The inference time of CEM is naturally longer than that of individual models, but it is still very fast and only takes a few minutes for each dataset, with pre-computed image features.For more details on runtime refer to Section B, and specifically to Table10, in the Appendix.

Table 5 :
Top-3 properties proposed by different models for nouns in FEATURE NORMS.

Table 6 :
Prompts used for language models.

Table 7 :
Full results of CLIP-ViT/L14 on the CONCEPT PROPERTIES development set.
Table 10 provides details about the runtime of the experiments.The second column of the Table indicates whether a model uses images.
D CEM Variations D.1 Concretess Prediction ModelIn Table12, we report the results obtained by the CEM model using predicted concreteness values (instead of gold standard ones).We predict these values by training the model ofCharbonnier and

Table 11 :
Results obtained on the FEATURE NORMS dataset filtered by prototypical and non-prototypical properties.

Table 12 :
Comparison of ensemble methods on the three datasets.The highest score for each metric is bolded and the second-best is underlined.