LMs stand their Ground: Investigating the Effect of Embodiment in Figurative Language Interpretation by Language Models

Figurative language is a challenge for language models since its interpretation is based on the use of words in a way that deviates from their conventional order and meaning. Yet, humans can easily understand and interpret metaphors, similes or idioms as they can be derived from embodied metaphors. Language is a proxy for embodiment and if a metaphor is conventional and lexicalised, it becomes easier for a system without a body to make sense of embodied concepts. Yet, the intricate relation between embodiment and features such as concreteness or age of acquisition has not been studied in the context of figurative language interpretation concerning language models. Hence, the presented study shows how larger language models perform better at interpreting metaphoric sentences when the action of the metaphorical sentence is more embodied. The analysis rules out multicollinearity with other features (e.g. word length or concreteness) and provides initial evidence that larger language models conceptualise embodied concepts to a degree that facilitates figurative language understanding.


Introduction
Infants acquire their first conceptual building blocks by observation and manipulation in the physical world.These primary building blocks enable them to make sense of their perceptions (Mandler and Cánovas, 2014).In return, their embodiment defines the capabilities with which they can explore and understand the world.The early conceptual system is built from spatial schemas, which enables early word understanding (Mandler, 1992).These so-called Image Schemas are recurring cognitive structures shaped by physical interaction with the environment.They emerge from bodily experience and motivate subsequent conceptual metaphor mappings (Johnson, 2013).The metaphorical mapping is visible in our everyday language whenever we use figurative language.For example, if we say that she dances like a turtle, that is to say, that she dances poorly.The metaphor in this phrase is readily interpreted by humans, who would favour the interpretation of dances poorly over dances well.The turtle dance example employs a conceptual mapping in which the turtle provides the source domain for the attributes slow and rigid, which turn the dance target domain into poorly dancing.This mapping draws from the human, bodily experience of dancing and therefore enables interpretation.
For a language model (LM), the understanding of figurative language is a great challenge (Liu et al., 2022).By nature of their digital implementation as computer algorithms, LMs are nonembodied and do not ground their conceptualisation by physical interaction with the environment.Instead, LMs learn statistical features of language by deep learning vast amounts of data (Vaswani et al., 2017).Whether these learned statistical features allow LMs to mirror or copy natural language understanding (NLU) is subject to discussion (Zhang et al., 2022a).Moreover, Tamari et al. (2020) suggest an embodied language understanding paradigm for LMs can benefit NLU systems through grounding by metaphoric inference.
One can argue that most embodied metaphors are heavily conventional (e.g.UP IS GOOD, DOWN IS BAD, KNOWLEDGE IS LIGHT, IG-NORANCE IS DARKNESS) and as such, they are lexicalised in a language without an inherent need to understand their bodily basis.This lexicalisation should allow LMs to conceptualise and interpret them correctly and more robustly than less conventional metaphors.Conventionality relates to word frequency and age of acquisition (AoA), i.e. more frequent words and words that are acquired early in life are more conventional.We argue that embodiment has a measurable effect on the interpretation of metaphors that differs from the effect of other linguistic features.Moreover, we investigate whether an interpretation of figurative language with more embodied concepts is easier for LMs.Analogously, we investigate conflating factors such as the AoA, word frequency, concreteness and word length.The relation between embodiment and LMs' ability to interpret figurative language has not yet been investigated and is the key contribution of this research.
The following Section (2) starts with a review of language model abilities, more specifically figurative language interpretation abilities.The review identifies a suitable data set for our experiment and describes its formation in Section 3. We use a subset of the Fig-QA data set (Liu et al., 2022), a Winograd-style figurative language understanding task, and correlate the performance of various LMs concerning the degree of embodiment of the metaphorical actions that the LMs are tasked to interpret.In Section 4, we identify that models, that can reach a certain performance on our Fig- QA subset, shows a significant and positive correlation between the rating of the embodiment of the action involved in the metaphorical phrase and the model's ability to interpret the metaphor correctly.An in-depth analysis of additional features, such as the AoA, word length or frequency, does not indicate multicollinearity among those features.In Section 5 we conclude that the degree of embodiment of the action within the metaphoric phrase is a predictor of the LMs' ability to correctly interpret the figurative language.Lastly, we discuss the limitations and broader implications of the work.

Language Model Abilities
The presented work investigates the zero-shot capabilities of LMs of different types and sizes.Arguably, LMs' capabilities to solve language-based tasks, which they have not been trained on, are an emerging property of their complexity and largescale statistical representation of language.It is a property that makes them unsupervised multitask learners (Radford et al., 2019;Brown et al., 2020).Despite task-agnostic pre-training and a task-agnostic architecture, LMs can perform various NLP tasks without seeing a single example of the task, albeit with mixed results (Srivastava et al., 2022).This raises the question of whether language models mirror the human conceptual understanding encoded in language or whether they "only" learn statistical features from the underlying training distribution, allowing them to generalise and convincingly solve previously unseen tasks.
Several works have tried to assess to what extent LMs are capable to perform more complex NLP tasks (e.g.logical reasoning or metaphoric inference).For example, Zhang et al. (2022a) investigate the logical reasoning capabilities of BERT (Devlin et al., 2019).For this, the authors define a simplistic problem space for logical reasoning and show that BERT learns statistical features from its training distribution, but fails to generalise when presented with other distributions and drops in performance.According to the authors, this implies that BERT does not emulate a correct reasoning function in the same way that humans would conceptualise the problem.Similarly, Sanyal et al. (2022) evaluate whether the RoBERTa model (Liu et al., 2019) or the T5 model (Raffel et al., 2020) can perform logical reasoning by understanding implicit logical semantics.The authors test the models on various logical reasoning data sets whilst introducing minimal logical edits to their rule base.Consequently, Sanyal et al. (2022) show that LMs, even when fine-tuned on logical reasoning, do not sufficiently learn the semantics of some logical operators.Han et al. (2022) present a diverse data set for reasoning in natural language.An evaluation of the GPT-3 model (Brown et al., 2020) on their data set shows a performance that is only slightly better than random.This indicates that there is a fundamental gap between human reasoning and LM reasoning and their conceptualisation capabilities.Yet, language models have demonstrated emergent abilities (Wei et al., 2022), encompassing enhanced skills and capabilities that are absent in smaller language models.Such abilities cannot be accurately predicted by extrapolating the performance of smaller models.Consequently, investigating the influence of model size on different tasks becomes imperative in comprehending the potentials and constraints of smaller and large language models.
The related works show that, although LMs seem to mirror an aspect of reasoning, e.g.logical reasoning, a closer look at the underlying conceptualisation of these abilities can reveal they are not robust and fail to mirror deeper semantics.Both logical reasoning and figurative language interpretation require an understanding of relationships between words and concepts and the ability to make inferences based on that understanding.This overlap in cognitive processes allows for the development of models that can perform both tasks effectively.Moreover, their results indicate that overall, LMs fall short of human performance.On a phrase and word level, the authors find that longer phrases are harder to interpret and that metaphors relying on commonsense knowledge concerning objects' volume, height, mass, brightness or colour are easier to interpret.This indicates that bodily modalities seem to facilitate interpretation success.They also show that larger models (i.e.number of parameters) perform better on the task.All of these findings have been reproduced by our experiments.Chakrabarty et al. (2022) present FLUTE, a data set of 8,000 figurative NLI instances.Their data set includes the different figurative language categories of metaphor, simile, and sarcasm.In contrast to Fig-QA, the authors do not create metaphors in a Winograd scheme as a forced-choice task but create natural language explanations (NLE) using GPT-3 (Brown et al., 2020) and human validation.Their experiments with state-of-the-art NLE benchmark models show poor performance in comparison to human performance.The authors do not differentiate the metaphors, similes and sarcastic phrases concerning linguistic features.Moreover, they include a language model in the creation process, which, as far as our study is concerned, introduces a bias to the data set.Hence, we decide to use the

Modelling Embodied Language
It is generally understood that language is grounded in experience based on interaction with the world (Bender and Koller, 2020;Bisk et al., 2020).Hence, there is an interest to leverage LMs' capabilities in interactions with the environment.For example, Suglia et al. (2021) present EmBERT, which attempts language-guided visual task completion.Their model uses a pre-trained BERT stack fused with an embedding for detecting objects from visual input.The model achieves competitive performance on ALFRED, a benchmark task for interpreting instructions (Shridhar et al., 2020).Huang et al. (2022) investigate if LMs know enough embodied knowledge about the world to ground high-level tasks in the procedural planning of instructions for household tasks.For example, the authors pass a prompt, e.g."Step 1: Squeeze out a glob of lotion" to a pre-trained LM (e.g.GPT-3) and extract actionable knowledge from its response.Their results indicate that large language models (<10B parameters) can produce plausible action plans for embodied agents.
Embodiment In this study, the term embodiment relates to cognitive sciences: Humans process a linguistic statement such as "to grab an apple" using embodied simulations in the brain.Perceptual experiences activate cortical regions that are dedicated to sensory actions and those regions partially reactivate premotor areas to implement, what Barsalou (1999) calls, perceptual symbols.Reading of actions words such as kick or lick is associated with premotor cortex activation responsible for controlling movements for these actions (Hauk et al., 2004).This effect is diminished by figurative language (Schuil et al., 2013).Therefore, a statement such as "to grasp the idea" does not necessarily rely on premotor cortex simulation.The semantic processing of the linguistic statement is therefore linked to its context and degree of embodiment in the sense that the action can be simulated by a brain in a body (Zwaan, 2014).This understanding of the term embodiment guides the evaluation of how language models, which do not have a brain in a body, can interpret figurative language phrases with a varying degree of embodied actions.

Statistical Evaluation
The review of related works shows that there are abilities of LMs that go beyond mere language generation, e.g.logical reasoning, and action planning.It is unclear how LMs conceptualise actions that humans conceptualise using interaction with the environment.Figurative language acts as a test bed to assess metaphorical conceptualisations since they are grounded in embodied experience and interaction with the environment.We take Liu et al. (2022)'s findings as a starting point to focus on the effect of embodiment in figurative language interpretation by langauge models of various sizes.

Experimental Framework
Embodiment Rating and Data Set To assess the effect of embodiment on the task, we discuss the effects of embodiment in semantic processing and introduce the simplification underlying our study through an example.The The LM is prompted with each combination of sentence completion and interpretation (i.e.A.1+A.I, A.1+A.II, A.2+A.I, A.2+A.II).Notably, Liu et al. (2022) have shown that the addition of "that is to say" as a concatenation between metaphorical phrase and interpretation phrase elicits better model performance, hence we also include this prompt in our studies.Subsequently, the prediction scores of the language modelling head (scores for each vocabulary token) are retrieved and the highest probability becomes the LMs choice of interpretation (for more details, see (Liu et al., 2022)).We compare this example with a different Given our hypothesis that embodiment affects the LMs' ability to interpret these phrases, we score (A) and (B) concerning the embodiment.As a simplification, we limit the rating of embodiment to the actions within the phrase.Every phrase evaluated has at least one word with a score related to an action.Most of the time, these related actions are verbs.Thus, we rate faded for (A) and dances for (B) with respect to their relative embodiment.For this scoring, we consult data by Sidhu et al. (2014).
In their empirical study, Sidhu et al. ( 2014) characterise a dimension of a relative embodiment for verbs.In the construction of the data set, "participants were asked to judge the degree to which the meaning of each verb involved the human body, on a 1-7 scale" (Sidhu et al., 2014).Their resulting data set consists of ratings for 687 English verbs.Our hypothesis is that embodiment is a semantic component which affects the interpretation ability of LMs concerning figurative language.With their data set, the authors provide evidence that the meaning of a verb has a semantic component linked to the human body in the lexical processing of that verb.They assume that more robust semantic activation is generated by more embodied verbs (Sidhu et al., 2014).This provides us with data set we can apply to our experiment on figurative language.Moreover, their experiment provides additional control variables such as the AoA and word length, which have a known effect on lexical processing (Colombo and Burani, 2002) and are included in our results (Sec.4).
At the time of conducting our experiment, Fig-QA did only provide the training and development data, which we will refer to as train & dev.Hence, we identify all phrases from the train & dev data set that contain at least one word with an embodiment rating from Sidhu et al. (2014).The process of creating the subcorpus with embodiment ratings (C Emb ) begins by identifying verbs using spaCy (Honnibal et al., 2020).The lemmatized versions of the verbs for the metaphorical phrases are then matched with embodiment scores, resulting in a subcorpus (C Emb ) with 1,438 entries.If more than one verb is present in the metaphorical sentence, the average is assigned.We note that, future work will assess whether a different heuristic for treating multiple actions influences our results.Analogously, we construct a subcorpus of the same size with metaphorical phrases that do not contain an embodied verb (C N oE ).For both subcorpora, we only keep phrases in which the verb is contained in the Winograd pair.The resulting subcorpora statistics are listed in Table 1 and  Hypotheses The main hypothesis for the statistical evaluation can be summarized as follows: 1.There is a correlation between the LMs' interpretation capabilities of metaphors and the amount of embodiment of the verbs within those metaphorical phrases.
Intuitively, more embodied actions such as kick, move or eat are much more concrete, shorter and basic, when compared to resonate, compartmentalise or misrepresent.Therefore, the analysis of embodied actions must take into account factors such as concreteness, AoA, word length and word frequency.Moreover, common metaphors are conventional and more lexicalised.Consequently, they might simply be more embodied and the effect of embodied verbs might stem from the fact that these verbs are more concrete in the context that they are presented.Hence, the first hypothesis should not stand alone, but will be evaluated along with two additional null hypotheses: 1.I There is no correlation between the LMs' interpretation capabilities of metaphors and the amount of concreteness of the verbs within those metaphorical phrases irrespective of their embodiment rating.
In our evaluation, the concreteness of a word in its context will be scored using an open-source predictor2 based on distributional models and behavioural norms explained in (Rotaru, 2020).Details of the concreteness scoring with the predictor have been summarized in Sec.3.2.Concreteness ratings are often subjective ratings (Brysbaert et al., 2014) or determined by other low-level features, such as AoA, word frequency and word length (Rotaru, 2020).To isolate the effect of embodiment, we add the second null hypothesis: 1.II There is no correlation between the LMs' interpretation capabilities of metaphors and other linguistic features, such as AoA, word frequency and word length.
For AoA we obtain scores for each of the actions from (Kuperman et al., 2012) and for word frequency from (Van Heuven et al., 2014).Together with word length and embodiment score we test for variance inflation to respond to 1.II.

Model Selection
The selection of our models is based on three criteria: First, we want to reproduce the results by (Liu et al., 2022) having a comparable measure.Second, we want to check whether the effect generalises to other large LMs.Third, we want a variation of different model sizes to account for varying performance on the task as a result of model size.For the latter two criteria, we start with the smallest available models of each type and check intermediate model sizes.We do not consider it necessary to check whether or not scaled, largest versions of each model perform better on the task since this is a general property of LMs (Brown et al., 2020;Srivastava et al., 2022).
In the original Fig-QA study, the authors examine three transformer-based LMs with different parameter sizes: GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020) and GPT-Neo (Black et al., 2021).To reproduce the results by Liu et al. (2022), we include GPT-2, GPT-3, GPT-Neo LMs and add OPT LMs (Zhang et al., 2022b).An overview of the models and their specifications is shown in Table 2. Notably, we want to correlate whether the type and number of parameters play a role when it comes to performance concerning the embodiment.Hence, we include pairs of models from each type that are small (<1 billion) and medium to large (>1 billion) in their number of parameters.

Methodology
We apply the same methodology of evaluation as Liu et al. (2022).In our zero-shot setting, each pretrained LM is prompted with the metaphor sentences combined with one of the interpretation sentences, concatenated with that is to say.For Ope-nAI models, the API provides the log probabilities per token as logprob return value.We access all other models using huggingface.coand its transformer library.To create the same evaluation metric as for results by (Liu et al., 2022), we follow (Tunstall et al., 2022) and implement a function that returns the logprob based on the prediction scores of the language modelling head.All code and data is publicly available3 .Colombo and Burani (2002), we test for the effects of word concreteness, AoA, word frequency and word length.This analysis includes an assessment of the amount of multicollinearity within the regression variables by determination of the variance inflation factor (VIF).Moreover, we conduct linear regressions for all models (and all sizes) with respect to their task performance and the features: embodiment score, AoA, word frequency and word length.For these linear regressions, we include those with and without the embodiment score feature in order to assess whether this feature contributes to a higher coefficient of determination (R 2 ).

Embodiment Correlation
The results of all models are listed in Table 3 and visualized in Figure 1.Overall, for each pair of small and larger models, the larger models always perform better on the interpretation task than the smaller version of the model.Moreover, all larger versions of the models show a significant correlation (p < 0.05) between the embodiment rating and task performance.In two instances, GPT-NeoX (20B) and GPT-2 (1.5B), the p value is < 0.01.In the case of GPT-3, both model variants show a significant correlation.In all correlations, the coefficient is positive, albeit small (<0.1), which indicates that embodiment has a positive effect on task performance.All smaller models (except for GPT-3 with 350M parameters) do not show a significant correlation between embodiment score and task performance.
Concreteness Using the concreteness-in-context predictor, we provide a concreteness value for each verb in C Emb and correlate those predictions with all models' performance.As a result, there is no significant correlation between the concreteness of the action word in its context and the performance of the LM on the interpretation (results in Appendix B).We do not reject hypothesis 1.I.

Regression Analysis
The linear regressions for all models and model sizes, both with and without the feature of embodiment score, revealed that the coefficient of determination (R 2 ) was consistently higher for regressions that included the embodiment score feature.Furthermore, for cases without the embodiment feature, none of the other variables, such as Age of Acquisition (AoA), word frequency, or word length, showed a significant correlation with task performance.Related results and figures are available in Section D of the Appendix.Variance Inflation Pairwise correlations between AoA, word frequency, embodiment score and word length are visualized in Figure 2. Intuitively, frequency and AoA are expected to be correlated with each other, because words that are acquired much later in life are often less frequently used words, as they tend to be more complex or specific words.The multicollinearity test through VIF is presented in Table 4.All factors are close to 1.0, which indicates that there is no multicollinearity among predictor variables (if the VIF is between 5 and 10, multicollinearity is likely to present) (James et al., 2013).Given that there is no multicollinearity between embodiment score and other linguistic features, we do not reject hypothesis 1.II.

Interpretation
The results of the correlation analysis indicate that embodiment affects the LMs' ability to interpret figurative language when the LM achieves a certain level of performance, which depends on the size of the model.The correlation coefficient is positive in all significant cases and those significant correlations occur in all larger (>1B parameter) model ver- sions.Since task performance increases with model size, the effect of embodiment becomes more apparent through more successful interpretations in better-performing models.The fact that concreteness, AoA, word length and word frequency do not inflate this effect, shows that the embodiment rating is not an arbitrary construct that implicitly models another linguistic feature.
There are slight differences in the model types when it comes to the performance of the models.For example, GPT-3 shows a significant correlation for the effect of an embodiment for both, the small (350M) and large (175B) model sizes.Yet, this effect does not occur in the small GPT-2 (355M) model, but in the large GPT-2 (1.5B) version.Notably, OpenAI does not explicitly list the Ada model with 350M, but its performance ranks close with 350M versions on various tasks (Brown et al., 2020).Hence, this difference has only limited relevance.Nonetheless, we assume that the effect is correlating with model size and that a reliable effect can be seen in larger models with a parameter number of over 1 billion.

Contribution to the Field
We successfully reproduce results that are in line with (Liu et al., 2022).Moreover, we provide a subcorpus with ratings of an embodiment for the Fig-QA task.We identify the contribution of embodied verbs to LMs' ability to interpret figurative language.To the best of our knowledge, this study is the first to provide evidence that the psycholinguistic norm of the perceived embodiment has been investigated in an NLP task for LMs.

Discussion
Benchmarks, such as BIG-bench (Srivastava et al., 2022), show that different types and sizes of LMs can be evaluated on many different tasks to identify potential shortcomings or limitations.This paper takes an entirely different approach by zeroing in on a particular task, which has been augmented with a specific semantic evaluation (embodiment ratings of actions), to highlight how difficult tasks, such as figurative language interpretation, benefit not only from model size but from specific embodied semantics.
Figurative language is difficult for LMs because its interpretation is often not conveyed directly by the conventional meaning of its words.Human NLU is embodied and grounded by physical interaction with the environment (Di Paolo et al., 2018).Consequently, it could be expected that LMs struggle when the interpretation of figurative language depends on a more embodied action.Yet, the opposite has been shown as more embodied concepts are more lexicalised and larger LMs can interpret them better in figurative language.Hence, our study provides valuable insight that raises the question of whether this effect is limited to figurative language or translates to other NLU tasks for LMs.

Limitations and Future Works
The current results are limited to one specific figurative language task (Fig- QA).In future work, we aim to test whether our hypothesis holds for other figurative language interpretation tasks, such as those by (Chakrabarty et al., 2022;Stowe et al., 2022).Moreover, we want to assess BIG-bench (Srivastava et al., 2022) performances on various other tasks concerning embodiment scoring and see whether the bias can be detected in tasks other than figurative language interpretation.
The statistical evaluation has attempted to measure many different linguistic dimensions, e.g.AoA, word length, word frequency and concreteness in context.Empirically, this indicates that the effect of embodiment is not simply explainable by other factors.Theoretically, we argue that this correlation can be causally explained through the lexicalisation of conventional metaphors.We simplify conventionality by assuming that word frequency and age of acquisition (AoA) are indicators of conventionality, i.e. more frequent words and words that are acquired early in life are more conventional.Nonetheless, a thorough explanation of the effect of embodiment on LMs' capabilities for language tasks requires many more studies.
Ethical Consideration It should be noted that a key component of the experiment is built from (Sidhu et al., 2014) with their ratings of relative embodiment.For their study, the authors have sampled data exclusively from (N =67, 57 female) "graduate students at the University of Calgary who participated in exchange for bonus credit in a psychology course, had a normal or corrected-to-normal vision, and reported English proficiency" (Sidhu et al., 2014).Even though embodiment is supposed to be a general, human experience, the pool of participants is relatively homogeneous (mostly female, educated and presumably able-bodied).A broader and more diverse set of ratings, specifically concerning differently-abled participants and cultural backgrounds should be targeted.
Computing Cost All model inferences (except OpenAI) have been conducted on University servers with 8x NVIDIA RTX A6000 (300 W).Each experiment for each model lasted at most 10min with full power consumption.A conservative estimate of 2,400 W (8 GPUs x 300 W) for 20 experiments results in a power consumption of at most 8 kWh, which equals emission of at most ∼3.5 kg CO 2 for all experiments with model inference.) for all models and all model sizes through linear regression with (orange) and without (blue) embodiment score as feature.For all regressions, the features Age of Acquisition, word frequency and word length have been included.For all models and sizes, the R 2 value is lower when embodiment is excluded as a feature.This is in line with the VIF analysis.Details of these linear regressions are exemplified in the results for GPT3_350m in Table 8 (with embodiment score feature) and in D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.

Data and Artefact Usage
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.
Liu et al. (2022) are among the first to quantitatively assess the ability of LMs to interpret figurative language.Their Fig-QA data set is publicly available 1 and we discuss the construction of our subcorpus in more detail in Section 3.1.In short, the authors present crowdsourced creative metaphor phrases with two possible interpretations of various LMs and check for which interpretation the model returns the higher probability distribution.The main contribution of Liu et al. (2022) is the Fig-QA task, which consists of 10,256 examples of human-written creative metaphors that are paired in a Winograd schema.The authors also contribute an assessment of various LMs in zeroshot, few-shot and fine-tuned settings on Fig-QA.
Fig-QA data set instead of FLUTE.
Fig-QA provides the following item: (A) The pants were as faded as ... (A.1) ... the memory of pogs (A.2) ... the sun in June with the possible interpretations: (A.I) they were very faded (A.II) they were bright Fig-QA item: (B) She dances like a ... (B.1) ... fairy (B.2) ... turtle with the possible interpretations: (B.I) she dances well (B.II) she dances poorly further examples from the subcorpus are presented in the Appendix in Section A. The previous examples (A) and (B) are thus augmented as follows: (A) The pants were as faded as ... Embodiment Rating: 2.36 (B) She dances like a ... Embodiment Score: 6.50 With the annotated Fig-QA subcorpus C Emb we now turn to the models we select to assess whether there is a correlation between embodiment score and LM task performance.

Figure 1 :
Figure 1: Performance of four language models in two size-variations on C Emb .Significant results of the point biserial correlation between embodiment score and model performance are marked for p < 0.05 with * and for p < 0.01 with **.Colours correspond to the same model type, and x-labels provide model size.

Figure 2 :
Figure 2: Pairwise correlation for the variables of AoA, word frequency, embodiment score and word length.Positive numbers indicate that increase in one variable correlates with an increase in the other and analogously for negative numbers with a decrease.Value 1 is the perfect correlation of the variable with itself.aoa: Age of acquisition, freq: Word frequency, embod: Embodiment score and word len: word length.
Existing artefacts used in this research are attributed to their creators and their consent has been acquired before the studies.This concerns the embodiment ratings by Sidhu et al. (2014), the Fig-QA corpus by Liu et al. (2022) and the concreteness predictor by Rotaru (2020).

Figure 3 :
Figure3: Coefficients of determination (R 2 ) for all models and all model sizes through linear regression with (orange) and without (blue) embodiment score as feature.For all regressions, the features Age of Acquisition, word frequency and word length have been included.For all models and sizes, the R 2 value is lower when embodiment is excluded as a feature.This is in line with the VIF analysis.Details of these linear regressions are exemplified in the results for GPT3_350m in Table8(with embodiment score feature) and in Table9(without embodiment score feature).

Table 1 :
Emb Fig-QA train/dev Phrases that have at least one action with embodiment rating 1,438 C N oE Fig-QA train/dev Phrases that do not have an action with embodiment rating 1,438 C Liu is 100% of the Fig-QA test set and 11% of the entire Fig-QA data set (Liu et al., 2022).Our selected subsets C Emb and C N oE are mutually exclusive and each composes 14% of the entire Fig-QA data set.

Table 3 :
Experimental results of all model pairs (small and larger versions) on the C Emb corpus.The last column marks significant results of the point biserial correlation between embodiment score and model performance for p < 0.05 with * and for p < 0.01 with **.

Table 4 :
VIF from the four features.Constant denotes the intercept provided for the VIF.A factor close to 1 indicates no correlation with values above 4 regarded as moderate correlation.

Table 5 :
Linguistic examples of more embodied and less embodied metaphors, sampled from C Emb (derived from Liu et al. (2022)).Every pair of sentences is presented with both target sentences to each model.Embodiment scores retrieved from Sidhu et al. (2014).

Table 6 :
Results of the point biserial correlation between the concreteness of action in context and performance of LM on the interpretation of figurative language phrases (C Emb ).None of the LMs shows a statistically significant correlation between the variables (α < 0.05).

Table 7 :
Comparing the zero-shot performance of the GPT-3 models Ada (∼350M parameters) and Davinci (∼175B parameters) on the different corpora (Tab.2).The comparison includes the variable that is to say suffix prompting.

Table 8 :
Linear Regression results with all features including embodiment score for GPT-3 (ada).
Table 9 (without embodiment score feature).C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.