Does Vision-and-Language Pretraining Improve Lexical Grounding?

Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-and-Language (VL) models, trained jointly on text and image or video data, have been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text-only counterparts. This paper compares the semantic representations learned via VL vs. text-only pretraining for two recent VL models using a suite of analyses (clustering, probing, and performance on a commonsense question answering task) in a language-only setting. We find that the multimodal models fail to significantly outperform the text-only variants, suggesting that future work is required if multimodal pretraining is to be pursued as a means of improving NLP in general.


Introduction
Large pretrained language models (LMs)-e.g., BERT (Devlin et al., 2019), GPT (Radford et al., 2019;Brown et al., 2020)-derive representations of words and sentences by distilling patterns that exist in large text corpora. While such representations have shown strong empirical performance on many benchmark language understanding tasks, they have been criticized for their lack of grounding, i.e., the ability to connect words to the realworld entities, events, and ideas to which they refer. While grounding is obviously necessary for mulitmodal language understanding tasks (e.g., identifying a dog in an image), it has further been argued to be fundamental for learning semantic representations in general. For example, Bender and Koller (2020) argues that models trained without grounding will ultimately fail on some text-only tasks such as goal-oriented dialogue, and Merrill et al. (2021) argues that an embedding space learned from text alone cannot encode the correct conceptual structure. One proposed solution is to shift from textonly models to multimodal models, which learn to associate language with representations of the non-linguistic world (Bisk et al., 2020a). Such approaches are intuitively appealing, but have not yet been rigorously analyzed in practice.
We test the hypothesis that grounded pretraining yields better linguistic representations (of words and sentences) than does text-only pretraining. For two recently released vision-and-language (VL) models, VideoBERT and VisualBERT, we compare the performance of the multimodal model to a textonly variant. We measure how well the representations encode 1) common sense inferences about the physical world, 2) the semantic structure of verbs and their arguments, and 3) compositional information about objects and their properties. Overall, we do not find evidence that the linguistic representations learned via multimodal pretraining differ meaningfully from those learned from text alone. We argue that such results do not imply that grounding is unimportant for language understanding, but rather that substantial future work on how to combine modalities is required if multimodal methods are to impact NLP in general. Our code is available at https://github.com/ tttyuntian/vlm_lexical_grounding.

Related Work
Analyzing Pretrained LMs. There has been substantial prior work on analyzing pretrained LMs and the linguistic properties of their representations, looking, e.g., at syntactic parse structure (Hewitt and Manning, 2019;Linzen et al., 2016), semantic structure such as semantic roles and coreference (Tenney et al., 2019a), lexical semantics (Chronis and Erk, 2020;Vulić et al., 2020), and lexical composition (Yu and Ettinger, 2020). Particularly relevant to our studies is prior work which has explored how well text-only models capture commonsense knowledge about the physical world via intrinsic (Ettinger, 2020;Forbes et al., 2019) and extrinsic (Zellers et al., 2018(Zellers et al., , 2019Bisk et al., 2020b) measures. Despite the interest in representations of the non-linguistic world, such analyses have not, to our knowledge, been run on multimodal LMs.
Vision-and-Language Pretraining. There is a long history of multimodal distributional semantics models (Howell et al., 2005;Lazaridou et al., 2015), to which pretrained transformer-based models are the latest addition (Sun et al., 2019;Li et al., 2020). Evaluations of these recent visionand-language (VL) models has tended to focus on inherently multimodal tasks , e.g., image and video captioning (Sun et al., 2019), visual question answering (Li et al., 2020), or instruction following in robotics (Majumdar et al., 2020). Cao et al. (2020) describes a series of "probing" analyses for multimodal language representations, but focuses on explicit grounding, e.g., to where do models attend in the image when processing "dog"? Little work has analyzed whether the presence of grounded training data impacts the linguistic representations in general. Work that does perform exploratory analyses of the multimodal conceptual representations (Tan and Bansal, 2020;Radford et al., 2021) does not include analysis of comparable text-only models, limiting the conclusions that can be drawn.

Vision-and-Language Pretraining
This section describes pretraining approaches that use both vision and language information. In particular, we focus on two that extend the BERT (Devlin et al., 2019) pretraining for text, VideoBERT (Sun et al., 2019) and VisualBERT (Li et al., 2020). Both are single-stream models which directly combine visual and text information at the model inputs, and are trained on paired video+speech and im-age+caption data, respectively.
More specifically, VideoBERT encodes video data by vector quantization, mapping visual features extracted from 1.5 seconds long video segments into "visual words" with K-Means clustering. The authors downloaded around 300K publicly available cooking videos from YouTube, and obtained the human speech data from YouTube's automatic speech recognition system. Sequences of visual words and speech that are temporally aligned in the original videos are concatenated and fed into a BERT base encoder. Similarly, VisualBERT con-catenates image region embeddings derived from pretrained object detectors, with their corresponding image captions. The model is pretrained on the COCO (Chen et al., 2015) dataset which contains images and five human annotated captions per image. Both pretraining methods rely on the BERT pre-training objectives, modified to their multimodal setups. Specifically, the objectives contain two parts: (1) a masked language modeling (MLM) objective to predict masked out tokens (VideoBERT predicts both visual and text tokens, while VisualBERT predicts only text tokens) and (2) a visual-language prediction objective, which predicts whether the visual and language sequences come from the same video/image or not.
VL pretraining setup. For VideoBERT, we obtained the training data and pretrained checkpoints from the authors. For VisualBERT, we downloaded the pretrained VisualBERT-NLVR checkpoint 1 pretrained on the Karpathy train split (Karpathy and Fei-Fei, 2015) of COCO (Chen et al., 2015). We refer to these two multimodal pretrained checkpoints as VideoBERT VL and VisualBERT VL . Both VideoBERT VL and VisualBERT VL are based on the BERT base architecture, with the difference that our obtained VideoBERT VL was trained from scratch (to ensure a controlled comparison to the text-only model, see below), while the public VisualBERT VL is initialized with its text-only counterpart.
Text-only pretraining setup. For comparison, we train text-only counterpart for each model, VideoBERT text and VisualBERT text , using the same text data as the VL model (i.e., the transcribed speech, the captions), while the image data is removed (i.e. the "visual tokens" of a video or an image). Text-only models are pretrained with masked language modeling objective and next sentence prediction objective, since there are multiple sentences of descriptions of a video (VideoBERT) and multiple captions of an image (VisualBERT). We follow the multimodal pretraining setups as faithfully as possible: we used the same BERT base encoder with their corresponding initialization method, the same maximum sequence length, as well as other optimization hyperparameters such as learning rate and number of training epochs. Therefore, the VL models and text-only counterparts have the same architecture and the same number of parameters: VideoBERT models have 125M parameters, while VisualBert models have 109M parameters. More details can be found in Appendix A.2.
Limitations. Our experiments are based on two popular variants of VL pretraining frameworks. We picked these two models as they reflect the common trends in VL pretraining for videos and images, and their model architectures and pretraining objectives closely resemble the BERT model, making it easier to compare with their text-only counterparts. However, this comes with the limitation that the models we analyze are trained on data of a different domain than many of our evaluation tasks (e.g., the data for VideoBERT comes from cooking videos on YouTube while the probing tasks are drawn largely from general web text). Thus, absolute results must be interpreted with this domain mismatch in mind. That said, our inclusion of a text-only baseline still allows us to isolate the benefit of the visual modality in an apples-to-apples comparison. Ideally, we would train VL models on multimodal corpora which match the evaluation domains. However, such corpora simply do not exist at the time of writing. Thus, despite the limitations due to domain, our results are representative of the current benefits of VL training.

Experiments and Results
We hypothesize that grounded pretraining leads models to learn better linguistic representations than does text-only pretraining. Specifically, we are interested in whether grounded pretraining yields benefits on NLP tasks that are defined entirely over textual inputs, so do not require grounded representations (i.e., as opposed to tasks like visual question answering, for which the need for grounded representations is not debatable). We consider three different evaluations of "semantics": commonsense reasoning about the physical world ( §4.1), inferring sentence-level semantic structure ( §4.2), and composing lexical semantic concepts ( §4.3).

Physical Commonsense QA
We first ask whether VL pretraining yields gains to benchmark NLP tasks that intuitively rely on multimodal knowledge, even if they don't explicitly require representing non-text inputs. We use Physi-calQA (PIQA) (Bisk et al., 2020b), a commonsense reasoning benchmark in which models are given a sentence describing a physical goal ("Remove gloss from furniture.") and must select between two candidate solutions ("Rub furniture with steel wool."/ "Rub furniture with cotton ball."). Following the setup in PIQA, we consider each solution candidate independently by combining the goal with one solution ([CLS] goal [SEP] solution [SEP]), and using the [CLS] token embedding at the last hidden layer as the representation of the candidate. We train a probing classifier to perform a binary classification task, with the two candidate representations as its inputs.
We consider linear, MLP, and transformer probing classifiers. For the linear and MLP probes, we freeze the encoder weights and only train the classifiers. For the transformer probe, we finetune the last transformer encoder layer and a linear layer on top of it. See Appendix B.1 for details. Table 1 shows our results. Across all settings, we see that VL pretraining produces consistent but marginal gains. In addition, we see that training on YouTube video captions, even without using the video information itself (e.g., comparing VideoBERT text to original BERT) yields a fewpoint improvement. Figure 1 shows the results based on word-level edit distance between two solutions. We see that VL pretraining brings a few points improvements when edit distance is low (one or two words), i.e., where picking the right solution hinges on grounded information for single lexical items. On manual inspection of the errors, we do not observe any consistent patterns that reflect different behaviors for VL models and text models. This is true even when we focus only on cookingrelated examples for VideoBERT models (i.e., examples we expect to be in domain and thus most likely to demonstrate gains).
Thus, overall, our results are mixed. We see that VL pretraining can yield improvements on text-only tasks, and that these gains likely come from both the difference in the distribution of language as well as the non-linguistic information itself. However, the gains are quite small-only a few points, despite the fact that the task in question (PIQA) is intended to directly probe the type of understanding that one gains from interacting with the physical world. We note, however, that most of the goals and solutions in PIQA are not cookingrelated, and thus the limited impact might be due to domain mismatch. Future work on domain-general VL pretraining would offer valuable insight.

Coreference and Semantic Roles
The PIQA results above suggest VL pretraining yields some gains on extrinsic tasks like QA. Such  (2019b), in which a probing classifier takes as input a token span(s), represented as a weighted sum of the layer activations of the token embeddings in the words, and needs to predict a task-related label (e.g. part of speech, parse information). The evaluation suite includes ten syntactic and semantic tasks. Results for all tasks, along with training details, are given in Appendix B.2. Per the above intuition, we are particularly interested in tasks that probe semantic structure. We focus on the following: Entity Coreference (Coref.), e.g., recognizing that "apples" and "them" refer to the same entity in "After the apples are chopped, put them in the bowl"; Semantic Role Labeling (SRL), which requires encoding semantic agents and patients, e.g., recognizing that "carrots" are the recipient of the pureeing action in "The carrots are then pureed in the food processor"; Semantic Proto-Roles (SPR), which requires predicting features such as awareness or cause for words in context, e.g., recognizing "the food processor" causes the pureeing event, but is not aware of it; and Semantic Relations (Rel.), which requires predicting relations like entity-destination, e.g., the relation between "apples" and "bowl" in "put the apples in the bowl". Table 2 shows results. Across the board, we observe extremely marginal gains in performance when comparing VL models to their text-only counterparts. In 7 out of 8 comparisons, the VL model outperforms the text model, versus just 1 comparison in which the text model outperforms. However, the differences that exist do not appear meaningful (∼ 0.5 percentage points), and we thus do not conclude that VL pretraining leads to any clear improvement in the models' ability to encode abstract semantic structure.

Adjective-Noun Composition
Finally, we investigate whether multimodal pretraining impacts conceptual structure at the lexical level. Arguably, if VL pretraining were to affect linguisitic representations in any meaningful way, we would expect it to manifest in the conceptual representations of visually-groundable concepts. To explore this, we focus on adjective-noun compo-   Table 3: Summary metrics (range 0 to 1) for clustering noun embeddings (e.g., "apple") according to their adjective modifiers (e.g., "ripe"). Numbers are averaged over five random seeds. We see no significant improvement in any metric when grounded (video or image) data is included during training. Homogeneity of 1 means that every point in a cluster belongs to the same class. Completeness of 1 means that every point belonging to a given class is in the same cluster. Vmeasure is the harmonic mean of the two.

Encoder Accuracy
BERT base 0.968 ± 0.002 VideoBERT text 0.992 ± 0.001 VideoBERT VL 0.993 ± 0.001 VisualBERT text 0.984 ± 0.002 VisualBERT VL 0.982 ± 0.001 sition, as this provides a simple way of defining a space of visually-groundable objects and properties that we expect conceptual representations to encode. For example, we expect that embeddings of the word "knife" from contexts in which the knife is described as "sharp" should be more similar to other instances of sharp knives than to instances of knives that are described as "dull". We focus on the list of visually grounded adjec-tives introduced in Isola et al. (2015) (e.g., "small", "bright", "sharp"). We then mine the WikiHow dataset (Koupaee and Wang, 2018) for all adjectivenoun bigrams involving these adjectives. We chose WikiHow because it does not overlap with the training corpus of either of our models, but contains similarly concrete, descriptive language. We perform several additional filters to remove low frequency bigrams, described in Appendix B.3, which results in an analysis set of 651 unique adjectivenoun bigrams across 11,970 contexts. We test how well each pretrained model's representations of the noun (e.g., "knife") encodes information about the adjective (e.g., "sharp") that modified it. Figure 2 provides a qualitative example of how noun representations cluster when using representations from VideoBERT text vs. VideoBERT VL . Quantitatively (Tables 3 and 4; see Appendix B.3 for experimental details), we do not see significant differences between VL and text-only models. Thus, again,VL pretraining does not appear to produce the desired improvements.

Conclusion
We provide a series of experiments which compare grounded vision-and-language (VL) pretraining to comparable text-only pretraining in terms of the quality of the linguistic representations produced. We find that VL pretraining sometimes produces gains, but that the text-only baselines perform well, and thus the margins are too small to support conclusions that VL pretraining (in its current form) has benefits for NLP in general. While there are good arguments to be made that grounding is necessary for learning general-purpose language representations, we conclude that current methods, which use direct extensions of NLP architectures and are often trained on data from narrow domains, have yet to produce such benefits. Future work is required to explore more domain-general VL training, as well as alternative architectures and losses for combining vision and language signals.

A.1 Domain-specific Masking
Masking tokens uniformly at random in BERT is found to be suboptimal (Joshi et al., 2019;Levine et al., 2020). In addition, we hypothesize that the benefits of visual-linguistic alignment might be greater if masking occurs on content words (which, in the cooking domain, are likely to be visually-groundable concepts). Thus, we implement a domain-specific masking, which aggressively masks the most frequent cooking-related verbs and nouns. We apply the BERT tokenizer on the cooking corpus, and manually pick the most frequent 500 cooking-related tokens. During the pretraining data generation, 15% of the tokens will be chosen, where the frequent tokens has 80% probability of being chosen, while the other tokens has 15% probability. The masking strategy is similar to the original BERT, where 80%/10%/10% of the chosen tokens will be replaced with [MASK] tokens/ random tokens/ the original tokens respectively. VideoBERT is pretrained with both random masking and domain-specific masking, while Visu-alBERT is only pretrained with random masking.

A.2 Pretraining Details
We pretrain VideoBERT text from scratch on the same cooking dataset in (Sun et al., 2019). We strictly follow the training setup of VideoBERT VL which is based on BERT base : it has 12 layers of transformer blocks, where each block has 768 hidden units and 12 self-attention heads. We use 4 Cloud TPUs with a total batch size of 128, and we train a model for 400K iterations. We use the Adam optimizer with an initial learning rate of 1e-5, and a linear decay learning rate schedule. The training process takes around 2 days.
We initialize VisualBERT text with the pretrained BERT base weights released by (Devlin et al., 2018). This text-only model has the same configuration as its VL variant: it has 12 layers of transformer blocks, where each block has 768 hidden units and 12 self-attention heads. The training process also largely follows the setup of VisualBERT VL : we use 4 TitanV GPUs with a total batch size of 64 and cap the sequences whose lengths are longer than 128. VisualBERT text is trained for 10 epochs, or roughly 90K iterations, with the Adam optimizer with an initial learning rate of 5e-5. The warm-up step number is set to 10% of the total training step count. The training process takes around 25 hours.

B Experimental Details B.1 PIQA
We use [CLS] token embedding e at the last hidden layer as the representation of a candidate solution for a goal. This embedding will be passed into the probing classifiers: a single linear layer, an MLP, and a transformer. The MLP probe has a hidden size of 512 and has architecture as below: We train a model by cross-entropy loss and by using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 32, an initial learning rate of 1e-4. We evaluate a model on the validation set every 1000 steps, halve the learning rate if no improvement is seen in 5 validations, and stop training if no improvement is seen in 20 validations. In this way, we limit the expressive power of the probes (since we are primarily interested in understanding differences in the representations that result directly from pretraining), yet still consider a number of ways (linear/nonlinear) that such information could potentially be encoded.

B.2 Syntactic and Semantic "Edge Probing" Tasks
Edge probing formulates probing tasks into the same format, where the probing classifier takes a span s 1 = [i 1 , j 1 ) and an optional span s 2 = [i 2 , j 2 ), and needs to predict a task-related label    Table 7: Summary metrics (range 0 to 1) for clustering noun embeddings (e.g., "apple") according to their adjective modifiers (e.g., "ripe"). Numbers are averaged over five random seeds. We see no significant improvement in any metric when grounded (video or image) data is included during training. Homogeneity of 1 means that every point in a cluster belongs to the same class. Completeness of 1 means that every point belonging to a given class is in the same cluster. Vmeasure is the harmonic mean of the two.
based on the span representations. A span representation is a weighted sum of the layer activations of the token embeddings in the given spans. We train a probing classifier for each task with encoder weights frozen, and follow the probing architecture and training strategy in (Tenney et al., 2019c). Figure 6 shows the results on all tasks for all models.

B.3 Lexical Composition
We preprocess WikiHow dataset by tokenizing the 215K instructions into 5 million single sentences. We run a bigram search over all the sentences to find pairs of an adjective and a noun. The lower  bound of bigram occurrence is set to 10, while the bigrams whose nouns do not pair with more than 10 unique adjectives are filtered out. Eventually, this leaves us 57,521 bigrams and 651 unique bigrams. Encoders then produce the representations of the nouns in these bigrams.
Following, we apply a visually grounded adjective filter based on the list of adjectives introduced in (Isola et al., 2015). For a unique bigram, up to 20 noun representations are randomly sampled. Finally, there are 62 unique adjectives, 48 unique nouns, and 11,970 noun representations.
We use K-Means to cluster the representations of each noun, with K equal to the the number of unique adjectives that modifies the noun in our dataset. We measure the quality of the resulting clusters using three clustering metrics: homogene-ity, completeness, and V-measure 2 (Rosenberg and Hirschberg, 2007), which are roughly analogous to precision, recall, and f1-score. We use the adjectives as the ground-truth cluster labels; i.e., scores are higher when the noun representations cluster according to the adjectives which modifies the noun in context.
Last, we carry out a probing experiment to attempt to evaluate the adjective information that is linearly encoded in the noun representations produced by the models. Given noun embeddings, a linear probing classifier, that is built on top of each model, classifies the adjectives that modify the nouns.
Based on a series of quantitative analyses, Tables 7 and 8 and Figure 3, we do not see significant differences between VL and text-only models.