Can Language Models Understand Physical Concepts?

Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85\% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented LMs such as CLIP and BLIP achieve a human-level understanding of embodied concepts. Analysis indicates that the rich semantics in visual representation can serve as a valuable source of embodied knowledge. Inspired by this, we propose a distillation method to transfer embodied knowledge from VLMs to LMs, achieving performance gain comparable with that by scaling up the parameters of LMs 134x. Our dataset is available at \url{https://github.com/TobiasLee/VEC}


Introduction
With the emergent capabilities such as arithmetic (Brown et al., 2020;Wei et al., 2022) and multi-step reasoning (Chowdhery et al., 2022) brought by large-scale pre-training, language models (LMs) are gradually becoming unified interfaces (Hao et al., 2022), capable of instructing embodied robots for high-level tasks such as cleaning the spilled coke in interactive and embodied environments (Ahn et al., 2022).Understanding physical concepts is an essential prerequisite for these tasks, e.g., producing correct instructions for cleaning the coke requires understanding the visual characteristics of a coke can, as well as physical properties such as hardness.However, it still remains unclear whether current LMs can understand basic physical concepts (Driess et al., 2023).
To answer the question, we first define an evaluation suite of physical concepts covering visual and embodied concepts.Specifically, visual concepts examine knowledge that can be gained via visual perception, including generic visual concepts, such as color, shape, and material of common objects, and spatial perception, which focuses on the relationship between visual stimuli, i.e., relative size and height of objects.The ability to deal with visual concepts serves as the basis for understanding real-world scenes to perform further instruction.Embodied concepts examine knowledge that requires more interaction and multimodal sensory experience in the embodied world, including knowledge about the mass, temperature, and hardness of objects, e.g., ice is colder than water.Understanding embodied concepts is essential for an embodied agent to make correct choices when translating language into actions (Bisk et al., 2020a).We compose a Visual and Embodied Concepts evaluation benchmark VEC, with examples shown in Table 1.
With the benchmark, we examine a wide range of LMs.We cover masked language models and causal language models in text-only LMs, including BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b), GPT (OPT)-family (Radford et al., 2019;Zhang et al., 2022b) with parameters ranging from 125M to 175B, LLaMA-1/2 (Touvron et al., 2023a,b) and Vicuna (Chiang et al., 2023).Furthermore, as humans understand the world by learning from multiple modalities, especially using the visual modality (Bloom, 2002), we are interested in whether the vision supervision in recent vision-augmented language models (VLMs) (Chen et al., 2019;Radford et al., 2021;Wang et al., 1: The illustration of VEC benchmark.We design two forms of probing tasks.The former (Color, Shape and Material) asks models to make a choice between two tail options given the head object.The latter (Size, Height, and all embodied concepts) requires LMs to judge whether the relation is valid given the head and the tail.Madureira, 2021) could also facilitate the understanding ability of embodied concepts.CLIP (Radford et al., 2021) and BLIP (Li et al., 2022a) are chosen as representatives of VLMs for evaluation, due to their promising performance and the ability to deal with textual-only inputs.To eliminate the effects of training corpus (Tan and Bansal, 2020), we train BERT, OPT, and CLIP on the same caption dataset with a similar Transformer model (Vaswani et al., 2017) from scratch for a fair evaluation.Furthermore, as previous studies have shown that prompting methods that fit the pretraining paradigm could better elicit the knowledge learned from LMs (Petroni et al., 2019;Schick and Schütze, 2021a;Brown et al., 2020), we adopt pre-trained-objective style promoting methods to narrow the gap between probing and pre-training.
Our zero (few)-shot results on the VEC benchmark show that: (i) Moderate-sized LMs such as BERT and RoBERTa exhibit a random-level understanding of both visual and embodied concepts.(ii) A decent visual understanding of specific concepts emerges as LMs scale up, while they still struggle to understand the embodied knowledge with performance slightly better than random guessing.(iii) Image-grounded caption text-only pre-training, instruction tuning, and visual supervision could provide performance gain regarding visual concepts, yet only the last one enhances the understanding of embodied knowledge of LMs.
We further investigate the source of embodied knowledge in VLMs.A case study demonstrates that embodied knowledge in the VLM of CLIP is potentially rooted in the rich semantics of image representations.We thus devise a knowledge distillation method to transfer the learned embodied knowledge in VLMs into LMs, resulting in an average accuracy gain of 3.38, comparable to the 4.46 gain achieved by scaling the model parameters 134x.Nevertheless, the improved LMs still exhibit great gaps with humans, indicating great potential for further advancements.

VEC Benchmark
Our VEC benchmark aims to evaluate the understanding of physical concepts of LMs.Inspired by the world scope definitions by Bisk et al. (2020a), we divide physical knowledge into visual knowledge and embodied knowledge.The former are visual properties that can be acquired via visual perception, while the latter focus on knowledge that requires multimodal sensory interaction.

Visual Concepts
Perception is necessary for language learning because it forms the basis for many of our semantic axioms (Bisk et al., 2020a).Among the various types of perception, visual concepts model a vastness of experiences in the world that cannot be stated by text alone (Harnad, 1990).In this work, we consider evaluating the visual understanding ability of LMs by examining their performance on various visual concepts.Specifically, we combine the recently proposed visual knowledge probing datasets, including Spatial Commonsense (Liu et al., 2022a) and ViComTe (Zhang et al., 2022a).The combined dataset requires not only understanding various generic visual concepts including color, shape, and material, but also understanding the relationship between common objects, such as size and height.For generic visual concepts, i.e., color, shape, and material identification, we define an answer selection game: selecting a correct value from two options for the attribute given an object.For example, given a head object banana, the model should pick the ground-truth tail answer yellow instead of an alternative option such as black.For visual relationships, i.e., size and height understanding, we define a comparison game: LMs need to perform a comparison between different objects.For example, given a head entity ant and a tail entity bird, the LM is asked to compare the size of two objects and makes a prediction between the correct relation smaller and the false one larger.

Embodied Concepts
The embodied concepts refer to physical realities of objects, e.g., mass, and temperature, which infants could learn by interacting with the environment (Gopnik et al., 1999).This kind of knowledge is the basis of intelligence and enables agent models to explore challenging tasks in physical environments.We are curious about whether current LMs can capture embodied knowledge via large-scale pre-training.In this work, we define embodied knowledge as the knowledge that requires multimodal sensory interaction with the environments beyond visual perception.We construct embodied knowledge evaluation datasets regarding basic physical properties including mass, temperature, and hardness.

Mass Dataset
We build the Mass dataset by transforming the Image2Mass dataset curated by Standley et al. (2017), which annotates 56 common objects with corresponding weights.The most light-weight object in the dataset is a red Lego brick, weighing 0.026 lbs, and the heaviest object is a 2.664 lbs drill.Directly asking the LM for the absolute mass of objects can be challenging (Wallace et al., 2019).We define the task in a comparison format.Specifically, each comparison pair contains two objects with a weight gap greater than 1 lbs.The threshold is set according to the Weber-Fechner laws (Fechner et al., 1966) to guarantee that the mass difference is perceivable for humans.We build 654 triplets such as (hair dryer, heavier than, red Lego brick) for evaluation.
Temperature Dataset We design a temperature probing dataset by collecting the temperature of 22 common objects from Wikipedia.2For example, the ice is 0 • C, and the temperature of water vapor is 100 • C. We convert the object with temperature annotations into pairs, and each pair contains two objects and the corresponding temperature relation.For example, (ice, colder than, water vapor).The temperature gap between two objects must be greater than a difference threshold, which is loosely set to 10 • C for assurance of thermal perception for human (Jones, 2009).The final Temperature dataset consists of 422 pairs in total.
Hardness Dataset Hardness is a measure of the resistance to localized plastic deformation in material science.For example, hard metals such as titanium are harder than soft minerals such as talc.Humans can perceive the hardness of different materials in interaction with the environment by using tactile organs like fingers (Gueorguiev et al., 2016).To investigate whether LMs capture hardness knowledge, we build a Hardness dataset by collecting the Mohs hardness scores of 25 objects from Wikipedia. 3We define the task in a comparison format.For example, (talc, softer than, titanium).Each pair contains two objects.The gap between two objects is greater than the threshold for human-level understanding.The final dataset contains 1, 016 pairs.

Prompting Methods
Recent studies have shown that prompting methods that fit the pre-training paradigm are more effective than other possible prompting methods (Petroni et al., 2019;Schick and Schütze, 2021a).Following these studies, we design specific prompts for LMs with different objectives.
Prompting Masked Language Models Following PET (Schick and Schütze, 2021a,b), we probe the masked language models by converting knowledge facts into a question-answering form.For example, a size knowledge fact (coin, smaller than, table) is converted into a sentence with a special mask token: Question: is a coin smaller than a table?Answer: [MASK].We also explored other prompts, such as Is a coin [MASK] than table.However, our experiments show that a question-answering form can better induce models to generate answers and avoid the influence of tokenization of different LMs.Given masked inputs, the model is asked to predict the probabilities of the mask token over two choices, For BERT-like models with a masked language head, we convert the knowledge fact to a question and perform prediction with the head over yes or no.For OPT models, we evaluate the perplexity of different assertions and take the one with lower perplexity as a valid fact.For CLIP, we devise a matching-based probing framework.
i.e., yes for confirming the knowledge fact is valid or no for an unreasonable assertion.We observe that in specific LMs, the prediction can be biased toward some answers as investigated by Zhao et al. (2021).We calibrate the prediction by normalizing the probabilities according to an estimated prior following Zhao et al. (2021).
Prompting Causal Language Models Different from BERT, there is no special [MASK] token in causal language models like GPT (Radford et al., 2019).Therefore, introducing a special token would result in an inconsistency between pretraining and evaluation.To remedy this, for each knowledge fact, we state it in natural sentences according to prompting templates and evaluate the sentence perplexity as the proxy metric.Specifically, for size-property evaluation, we convert it into a valid knowledge assertion s1 = A coin is smaller than a table, and an invalid one by replacing the relation with the antonym adjective s2 = A coin is larger than a table.The sentence with lower perplexity is then chosen as the predicted one.We evaluate the perplexity of each sentence s where P M denotes the conditional word probability of the causal language model to be probed and n is the number of tokens in s.We compare the perplexity PPL(s 1 ) and PPL(s 2 ) and choose the sentence with lower PPL as a more valid assertion and calculate the prediction accuracy accordingly.
Prompting Vision-augmented Language Models of CLIP Unlike text-only LMs that support word predictions, the text encoder in CLIP only has one sentence representation without any pre-trained language heads.To probe the learned knowledge in VLMs such as CLIP, we design a matching-based prompting method.In more detail, for the size fact stated before, we first obtain two object descriptions o 1 = a photo of a coin, and o 2 = a photo of a table.These two sentences are encoded to get the corresponding object vectors via the CLIP language encoder: We then derive an attribute sentence a = a photo of a small object, and encode it to an attribute adjective vector with the language encoder: The prediction is then performed by comparing the cosine similarity cos(o 1 , a) and cos(o 2 , a). 4 The object with higher similarity with the attribute description is adopted as the answer, i.e., a coin is smaller than a table, if cos(o 1 , a) > cos(o 2 , a).Otherwise, we assume that the model thinks the reversed relation holds.We can also adopt the antonym adjective large for getting the attribute vectors.The results of the best-performing adjective words for CLIP are reported and we discuss the influence of adjective options in § 4.3.

Experimental Settings
Models We cover two kinds of LMs, text-only LMs and visual-augmented LMs.Text-only LMs include BERT-base/large (Devlin et al., 2019), Table 2: Zero-shot probing results on visual datasets.Models with the YFCC-15M subscript represents that these models are trained from scratch on YFCC-15M data.Scaling OPT-family brings clear improvements on size and color datasets.The scaling law fails on the height dataset.
RoBERTa-base/large (Liu et al., 2019b) for masked language models, and OPT models with parameters ranging from 125M to 175B.We further incorporate recent variants of causal language models into evaluation, including LLaMA-1/2 (7B and 13B) (Touvron et al., 2023a), Vicuna models (7B and 13B, v1.3) (Chiang et al., 2023) trained with the instruction tuning dataset, and LLaMa-2 Chat models (7B and 13B) (Touvron et al., 2023b) trained with supervised fine-tuning and RLHF (Ouyang et al., 2022).For VLMs, we include the text encoders of CLIP-ViT-B/32 and CLIP-ViT-L/14 (Radford et al., 2021) as a base and a large version, respectively.We also include an enhanced VLM with masked language modeling as self-supervision, DeCLIP-ViT-B/32 (Li et al., 2022b) and BLIP, a boosted VLM by unifying multi-modal understanding and generation tasks (Li et al., 2022a). 5Since directly comparing the VLMs and text-only LMs can be unfair due to the difference in model configuration and training corpus (Tan and Bansal, 2020), we re-train CLIP, BERT, and GPT from scratch with a similar Transformer model on the same text corpus, the caption dataset in the YFCC-15M dataset (Thomee et al., 2016).All models are trained for 32 epochs.The only difference between these models is the pre-training objective.Detailed model and training settings are elaborated in Appendix B.
Prompts We manually write several prompts (at least 4 prompts for each task) to eliminate the sideeffect of the expression variations and report the averaged accuracy.Besides, the variance across different prompts could also serve as an indicator to evaluate the robustness of learned knowledge facts.We report the averaged performance over multiple prompts for all models.All used prompts can be found in Appendix C.

Main Findings
The ability of certain visual concepts emerges as scaling up LMs, but there are still basic visual concepts where the scaling law fails.The evaluation results on visual datasets are shown in Table 2. Interestingly, with the scaling up of OPT-family models, the prediction accuracy increases obviously on specific visual concepts such as color and size.On material and color, the largest OPT-175B model even achieves better results than VLMs of CLIP-ViT/L-14, which are augmented   with vision supervision and are supposed to perform better (Zhang et al., 2022a;Liu et al., 2022b).
A potential reason is that the combination of color and material frequently occurs (e.g., red apples) in raw texts, and these co-occurrence statistics are well captured by large LMs.The significant performance improvements after training on visualgrounded text corpus YFCC-15M validate this explanation.Besides, OPT-13B and LLaMa-1 13B models excel in different visual concepts, with OPT-13B performing well on material concepts and LLaMa-1 13B on relative size comparisons, likely due to the difference of pre-training corpus distribution.However, increasing LMs to 175B brings negligible improvements in the Height dataset, indicating that there still remain visual concepts where the scaling law does not hold even though these concepts can be easily captured by humans.LMs exhibit a poor understanding of embodied concepts.As shown in Table 3, the scaling law fails again on the embodied concepts, as all LMs, including OPT-175B and variants trained with captions data, perform poorly.Among LMs, the LLaMa series shows a better performance in embodied concepts, yet still reaches a plateau of around 55% overall accuracy.We further conduct a few-shot prompt evaluation for OPT models by constructing the inputs with k = 16 randomly sampled instances and adopt PET (Schick and Schütze, 2021a) for masked language models.The results are illustrated in Figure 2 and Table 4, respectively.We find that while the performance is boosted, the average results are still worse than the CLIP-ViT/L-  14 model without any demonstration, which only utilizes 0.08% parameters of OPT-175B.These findings show that visual supervision can help learn embodied knowledge, but there is still a large gap between the best results of existing LMs with human performance.
Compared with human annotators, OPT-175B and VLMs achieve competitive performance regarding visual concepts, yet they exhibit great gaps with humans on embodied concepts.We conduct a human evaluation to better understand the performance of different models.Specifically, we randomly sample 100 examples for each task and ask three volunteers to label these examples.
The annotators achieve substantial agreements on all the tasks with Cohen's kappa (Cohen, 1960) κ larger than 0.7, except for the Hardness dataset with a moderate κ = 0.52.The comparison with bestperforming models, i.e., OPT-175B, CLIP-ViT/L-14 and DeCLIP is illustrated in Figure 3.We find that (i) Regarding visual concepts, both OPT and CLIP-like models perform closely to human annotators.CLIP and DeCLIP even outperform the human annotators on the shape task, which is potentially due to the noise introduced by the automatic construction of the dataset (Zhang et al., 2022a Instruction tuning enhances proficiency in both visual and embodied concepts.After posttraining with the instruction tuning dataset, Vicuna models display enhanced proficiency in both visual and embodied concepts, with larger LLMs demonstrating a more significant improvement.For instance, when using LLaMa-1 (13B) as a baseline model, the average accuracy in three embodied tasks rises from 52.8 to 55.2.Moreover, LLaMa-2-Chat models, which are further trained with a supervised instruction tuning dataset and RLHF techniques, show consistent accuracy gains in both visual and embodied concept tasks as well.However, disentangling the influence of instruction tuning and RLHF on these models presents a challenge as they are intertwined.Nevertheless, a clear performance gap still remains between more recent LMs and VLMs, indicating the significance of visual supervision.

Analysis
Does BERT behave similarly regarding visual and embodied concepts?The overall prediction results of BERT-like models in the visual and embodied world are both at a random level.We investigate this question result by first checking whether BERT models perform consistently at a guessing level for all the entities in the dataset.We compute the entity correct ratio among different prompts for the objects in different datasets and compare the distribution on different tasks with the BERT model trained on YFCC-15M dataset.As illustrated in Figure 4, in the Material identification task, there are entities that the model that could provide consistent correct predictions.However, the distribution of the Hardness dataset in embodied evaluation exhibits a bell curve, i.e., most entities are predicted 11849  correctly at a random-chance level.The distributions of other tasks show similar results and can be found in Appendix D. These results suggest that BERT learns visual knowledge for certain entities yet indeed struggles regarding embodied concepts.
Exploring learned embodied knowledge in image representations.We are interested in how the VLMs of CLIP learn embodied knowledge.A potential answer is that the images contain rich semantics regarding embodied knowledge such as the heat of the object, and the knowledge can be propagated to the VLMs via the contrastive learning objective.To examine this, we perform a case study by calculating the attribute similarities among the images.We first take clips from a video of heating a pile of ice and then perform a binary classification by calculating the cosine similarities with text prompts a photo of a hot object.and a photo of a cold object for each frame.The left of Figure 5 shows that the probability of a hot object increases during the heating procedure.Similarly, we perform a binary classification over heavy and light-weight objects ranging from an elephant to a feather.The right of Figure 5 shows that the image representations are aware of the mass of different objects.This qualitative study shows that image representations are the potential source of embodied knowledge.
Transferring embodied knowledge from VLMs to LMs.We further verify whether the learned embodied knowledge in CLIP could be transferred to text-only models.Specifically, we perform knowledge distillation (Hinton et al., 2015) by treating the original text-only language model as a student, and the CLIP text encoder as a teacher model providing the learned embodied knowledge.How-A photo of a soft object.59.43% (std: 2.00%) A photo of a hard object.44.00% (std: 1.26%) A photo of a light-weight object.35.63% (std: 2.68%) A photo of a heavy object.65.20% (std: 4.75%) A photo of a soft object.59.43% (std: 2.00%) A photo of a hard object.44.00% (std: 1.26%) A photo of a light-weight object.35.63% (std: 2.68%) A photo of a heavy object.65.20% (std: 4.75%) A photo of a soft object.59.43% (std: 2.00%) A photo of a hard object.44.00% (std: 1.26%) A photo of a light-weight object.35.63% (std: 2.68%) A photo of a heavy object.65.20% (std: 4.75%) A photo of a soft object.59.43% (std: 2.00%) A photo of a hard object.44.00% (std: 1.26%) A photo of a light-weight object.35.63% (std: 2.68%) A photo of a heavy object.65.20% (std: 4.75%) ever, our preliminary study in Appendix F shows that vanilla alignment on the predicted word distributions could not be effective.Inspired by our case study showing that the rich embodied knowledge contained in the representations, we utilize Neuron Selectivity Transfer (Huang and Wang, 2017) which transfers the inner states such as spatial activation patterns of teacher neurons to student neurons, by aligning the token representations of the last layer between the teacher and student language models, which is implemented as a squared maximum mean discrepancy (MMD) with a polynomial kernel to measure the distance between the activation patterns of student neurons and teacher neurons.The total training objective of the language model is a combination of the original language modeling loss and the MMD loss with a balancing coefficient.We refer readers to Appendix E for more details.As shown in Table 5, the distillation provides a performance boost on embodied concepts understanding, e.g., learning from a CLIP-ViT-L/14 teacher model achieves improvement that is comparable with that brought by scaling the model parameter 134x from OPT-1.3B to OPT-175B. 6It validates our assumption and indicates that future studies could utilize the richer representations in VLMs for improving LMs, yet the gap between distilled LM and VLM suggests that there is still room for advancement.
VLMs perform poorly when dealing with ambiguous text descriptions.During our preliminary study, we observe that VLMs of CLIP perform relatively poorly for specific adjectives such as hard.To further investigate this issue, we examine the retrieved images using prompts with different attribute adjectives on the CC12M dataset (Chang-pinyo et al., 2021).Our results, illustrated in Figure 6, revealed that for the prompt a photo of a hard object, the retrieved images were mostly about abstract and difficult learning materials, with only one rock image related to the attribute of hardness.Additionally, for the prompt light-weight, the retrieved images are biased towards meanings related to lighting bulbs and light-toned colors.These observations demonstrate that handling semantic ambiguity remains a challenge for VLMs (Ren et al., 2023), suggesting that future improvements may incorporate more language-side supervision into the text encoder of VLMs (Li et al., 2022b).
In this paper, we investigate the ability of LMs to understand physical concepts.Different from PIQA (Bisk et al., 2020b) consists of questions requiring physical commonsense reasoning, our VEC benchmark examines the understanding ability of the fundamental physical concepts.The evaluation on the VEC benchmark demonstrates that text-only LMs can learn specific visual concepts after scaling up while struggling with the embodied concepts.
Vision-Language Pre-training Unifying crossmodal representations via vision-language pretraining has achieved promising progress.Pilot studies adopt masked reconstruction to learn shared representations across modalities from a mixed visual and language inputs (Li et al., 2019;Tan and Bansal, 2019;Su et al., 2020;Chen et al., 2019;Li et al., 2020).CLIP (Radford et al., 2021) introduces a contrastive language-image pre-training framework, utilizing language as supervision for  learning transferable image representations with large-scale image-text pairs, triggering a series of variants for further improvements (Jia et al., 2021;Li et al., 2022b;Yao et al., 2022;Li et al., 2021Li et al., , 2022a)).Our study uses VLMs of CLIP and BLIP to investigate the impact of visual supervision on understanding physical concepts and our results suggest that visual supervision is crucial for LMs to understand embodied concepts, which can be utilized to enhance the text-ony LMs.

Conclusion
In this paper, we introduce VEC for evaluating the understanding of physical concepts in LMs.Our results show that large LMs understand specific visual concepts but struggle with embodied knowledge.VLMs instead perform much better in both the visual and the embodied world, indicating that visual signals are vital for understanding physical concepts.Further analysis suggests that transferring the VLM representations to LMs effectively boosts embodied concepts understanding, shedding light on directions for improving LMs.

Limitations
Limited Scopes of Physical Concepts In this work, we focus on evaluating certain physical properties such as color, mass, temperature, and hardness.These properties are chosen because they can be measured using well-established metrics and are easily sensed by humans.However, this selection introduces a bias in our approximation of embodied knowledge.Despite this bias, our results are still sufficient to demonstrate the poor performance of current text-only language models (LMs) in understanding embodied concepts.We suggest that incorporating vision supervision could help improve the understanding of embodied concepts.Additionally, our current benchmark only examines the fine-grained understanding ability of specific physical concepts, while neglecting the more complex physical understanding that involves multiple interactions or observations within a single example.Developing a dataset that encompasses compositional physical concepts holds promise for future research.
Limited Adoption of VLMs While there are many multi-modal models available, we restrict our investigation to visual-linguistic models (VLMs) based on CLIP and its variants.We choose CLIP for its superior image representation performance and support for text-only encoders.Since our evaluation focuses on language-oriented tasks, we require models that can handle inputs consisting of pure text.Consequently, VLMs like UNITER (Chen et al., 2019), which require multimodal inputs, are not considered.CLIP is selected as a representative work of VLMs for evaluation.However, it is important to note that the findings from CLIP may not readily generalize to other V+L models, as CLIP utilizes a large dataset of million-level image-text pairs collected from the web, which could be a significant source of embodied knowledge itself.Furthermore, there have been recent proposals for VLM models with various architectures and pre-training recipes, such as SimVLM (Wang et al., 2022), UniT (Hu and Singh, 2021), ViLT (Kim et al., 2021), and FLAVA (Singh et al., 2022), as well as vision-enhanced multimodal agents like InstructBLIP (Dai et al., 2023), Qwen-VL (Bai et al., 2023), and Ying-VLM (Li et al., 2023).These models have shown promising performance in both cross-modal and singlemodality tasks, and we look forward to evaluating these advanced models in our benchmark in the future.
dataset D, we transfer the sequential activation patterns of T (x) ∈ R |x|×d to S(x) ∈ R |x|×d , where T (x) and S(x) denote the last hidden representations of the VLM and the LM, respectively.d is the number of hidden units.The squared maximum mean discrepancy (MMD) with kernel trick (Huang and Wang, 2017) is adopted to measure the distance between the activation patterns: We adopt a polynomial kernel k(x; y) = x ⊤ y + c p with p = 2 and c = 0.The MMD objective L MMD is minimized along with the original language modeling objective L LM : where β is a weighting factor set to 20 to achieve a balance between objectives.

F Evaluation and Distillation with Oscar
We examine whether other vanilla distillation from traditional V+L pre-training methods brings gains regarding visual and embodied knowledge.Specifically, following Zhang et al. (2022a), we distill the knowledge of Oscar (Li et al., 2020) into a BERT model by performing knowledge distillation (Hinton et al., 2015) on the image-caption pair dataset.Specifically, the paired text and image are fed into the Oscar model for getting the vision-aware vocabulary distribution, and a student BERT model is performing masked language modeling on the text data only and learns from the soft labels provided by the Oscar teacher model.The distillation results in a DistilledOscar model supporting textonly inputs.We also evaluate VLM-BERT learned via Vokenziation (Tan and Bansal, 2020), which devises a fine-grained token-voken matching framework to utilize visual supervision.The models are evaluated on the four largest datasets in GLUE, including SST-2 (Socher et al., 2013), QQP (Iyer et al., 2017), QNLI (Rajpurkar et al., 2016) and MNLI (Williams et al., 2018)   than the vanilla BERT in both NLU tasks and probing tasks regarding visual and embodied knowledge.Besides, while VLM-BERT achieves improvements on NLU tasks, it still performs at the random level on the probed tasks.These indicate that not all VLMs could learn embodied knowledge and it is non-trivial to distill the visual supervision from VLMs to LMs via purely language modeling.

Color
[Head] can be of the color [Tail].the [Head] can be of color [Tail].the color of a(an) [Head] is [Tail].the color of [Head] is [Tail].
the [Head] is in [Tail].

Figure 1 :
Figure1: An illustration of prompting methods.For BERT-like models with a masked language head, we convert the knowledge fact to a question and perform prediction with the head over yes or no.For OPT models, we evaluate the perplexity of different assertions and take the one with lower perplexity as a valid fact.For CLIP, we devise a matching-based probing framework.

Figure 2 :
Figure 2: Few-shot results of OPT-175B with 16 instances as demonstration on embodied tasks.

Figure 3 :
Figure 3: Comparison between the best-performing models and human annotators on sampled subsets of VEC.The best-performing LMs and VLMs achieve close-to-human results on visual datasets, yet far lag behind humans in embodied datasets.

Figure 5 :
Figure 5: Case study showing that the image representations in CLIP exhibit embodied knowledge.(Left) The probability of an image being classified as "hot" increases as the ice melts being heated in a boiler in a video.(Right) The probability of an image being classified as "heavy" along with corresponding mass annotation.

Figure 6 :
Figure 6: Top-5 retrieved images and the prediction accuracy with different attribute prompts.The accuracy drops when the text inputs contain ambiguous words and compound words, as the retrieved images are biased toward specific meanings.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Histogram of entity correct ratio across different prompts on the Size dataset.

Figure 12 :Figure 13 :Figure 14 :
Figure 12: Histogram of entity correct ratio across different prompts on the Temperature dataset. 2022;

Table 3 :
Zero-shot results on embodied datasets.LMs struggle to understand embodied knowledge, including OPT (175B) and visual-augmented LMs, with 71.95 as the best average performance.

Table 4 :
The few-shot results of BERT variants.With 16 instances, the fine-tuned BERT variants are still worse than zero-shot visual-augmented LMs.

Table 5
for stable results.As shown in Table 8, DistilledOscar performs worse

Table 12 :
Prompts used for VLMs of CLIP.