Acquiring Linguistic Knowledge from Multimodal Input

In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language. In this submission to the BabyLM Challenge (Warstadt et al., 2023), we test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical language models. Although previous work looking into this question found that multimodal training can even harm language-only performance, we speculate that these findings can be attributed to catastrophic forgetting of complex language due to fine-tuning on captions data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et al., 2022), a multimodal vision-and-language model, independently varying the volume of text and vision input to quantify how much text data (if any) can be offset by vision at different data scales. We aim to limit catastrophic forgetting through a multitask pretraining regime that includes unimodal text-only tasks and data sampled from WiT, the relatively diverse Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely negative: Multimodal pretraining does not harm our models' language performance but does not consistently help either. That said, our conclusions are limited by our having been able to conduct only a small number of runs. While we must leave open the possibility that multimodal input explains some of the gap in data efficiency between LMs and humans, positive evidence for this hypothesis will require better architectures and techniques for multimodal training.


Introduction
Children can learn language from a relatively small amount of linguistic input: at most 100 million words (Gilkerson et al., 2017).By contrast, the quantity of training data a language model needs to achieve strong grammar and language performance is on the order of billions or tens of billions of words (Zhang et al., 2021).This data efficiency gap may be due, in part, to innate differences in learning mechanisms between models and humans, but environmental differences likely play a role as well (Warstadt and Bowman, 2022).This work tests the hypothesis that the lack of visual grounding in language models accounts for some of the gap in data efficiency.
The likelihood of finding evidence for this hypothesis rests largely on two factors: (1) its cognitive plausibility and (2) its technological viability.If vision does help children learn language, then there ought to be some way of incorporating vision into text-only language models that improves their learning ability.However, whether or not we can find this approach depends on the present technological capabilities of multimodal models.
To address the first point, one cognitivelymotivated mechanism for how children integrate nonlinguistic sensory data in language learning is cross-situational learning (XSL) (Smith and Smith, 2012).This theoretical mechanism holds that the learner accumulates statistical evidence about word meanings by observing multiple instances of cooccurring word-object pairs across many different real-world situations (Smith et al., 2011;Kachergis et al., 2014;Zhang et al., 2019).Encouragingly, Nikolaus and Fourtassi (2021) find that, in a highly constrained visual-linguistic domain, computational multimodal models do benefit from cross-situational learning.
To address the second point, prior evidence that vision will improve language models given current technologies is, at best, mixed.Recent approaches have successfully trained Transformer-based multimodal language models using self-supervised objectives resembling those developed originally for the training of unimodal models (Tan and Bansal, 2019).Nevertheless, in comparison to the unimodal models, multimodal LMs often perform relatively poorly on language-only tasks (Iki and Aizawa, 2021).We hypothesize that these shortcomings may be due to the common practice of training multimodal models by fine-tuning pretrained language models on captions data.
Our approach addresses these limitations in two ways that differ from most previous pretraining recipes.First, we use the FLAVA architecture and follow its multitask training procedure (Singh et al., 2022).Second, we train on the Wikipediabased WiT dataset (Srinivasan et al., 2021), which pairs images with a mixture of strongly aligned (but formulaic) captions and weakly aligned (but syntactically complex) articles.Despite these efforts, our results show that the addition of image data and multimodal training objectives leads to no reliable improvement over text-only baselines on benchmarks for grammar (BLiMP; Warstadt et al., 2020a), understanding (GLUE; Wang et al., 2018), and generalization (MSGS; Warstadt et al., 2020b).We conclude that, to the extent that multimodality is partly responsible for the data-efficiency gap, present multimodal (and multitask) pretraining methods do not benefit from this richer learning signal.
To summarize, this work brings forward three main contributions: 1. We develop a robust codebase 1 for pretraining (from scratch) large multimodal LMs under varying text and vision input configurations.
2. We evaluate, in this controlled environment, the effects of the visual signal on the model's textual encoder (hence, its linguistic ability).
3. We investigate plausible mechanisms for how incorporating visual input into the pretraining procedure might affect linguistic behavior.

Background
Prior work on multimodal language model training can be roughly differentiated by whether the main objectives are cognitively-oriented or engineeringoriented.So far, neither of these directions has produced clear evidence supporting the hypothesis that multimodality aids language learning at the scale of human language acquisition.Many cognitively oriented contributions are limited by a small data scale or a restricted domain.By contrast, engineering-oriented contributions using state-ofthe-art Transformer-based architectures achieve 1 https://github.com/amariucaitheodor/acquiringlinguistic-knowledgemore developmentally plausible scale and diversity but emphasize multimodal performance over language learning.

Cognitively Oriented Approaches
Infants enter a diverse and abundant visual world where they develop mental models to comprehend and mimic the patterns they encounter.These mental models empower them to grasp and anticipate their surroundings and accomplish objectives by incrementally improving their communicative abilities (Roy and Pentland, 2002).Relatedly, the impact of vision on specific aspects of human language learning has been an important question in human development for decades.
Contemporary research has tried to answer this question through computational simulations of cognitive processes involved in language acquisition.Multimodal models trained on visual question answering or reference games can use crosssituational learning to learn grounded meanings of words (Mao et al., 2019;Wang et al., 2021;Nikolaus and Fourtassi, 2021;Portelance et al., 2023).Nonetheless, computational models show different learning biases than humans in many cases, at least in the absence of specific training or architectural interventions (Gauthier et al., 2018;Vong and Lake, 2022).Ultimately, however, all of these studies are limited in their cognitive plausibility and language learning by a reliance on supervised training on small, artificial datasets in which texts and images correspond to arrangements of a limited set of objects in a simple, usually static scene.
Other studies aim for more naturalistic training.Lazaridou et al. (2017) and Chrupała et al. (2015) are notable for pioneering self-supervised training objectives for multimodal models several years before the advent of Transformer architectures trained on masking objectives.Wang et al. (2023) train LMs on data from the SAYCam dataset (Sullivan et al., 2021), pairing (written) child-directed utterances with visual data from the child's point of view.While this data domain is nearly ideal from a developmental plausibility perspective, the available data is too small to model anything past the first month of development.
Finally, we note that most of the studies in this area focus primarily on word learning.However, the data efficiency gap applies more broadly to language learning.Recent studies evaluating contemporary Transformer-based models have largely re-ported negative results for the effect of multimodality on semantics (Shahmohammadi et al., 2022), commonsense reasoning (Yun et al., 2021), and learning biases (Kuribayashi, 2023).To the best of our knowledge, ours is the first work to perform targeted syntactic evaluation (Marvin and Linzen, 2018;Warstadt et al., 2020a;Hu et al., 2020) on multimodal models.
These studies, for the most part, share many aspects of a typical recipe: First, they initialize all or some of the model parameters with the pretrained weights of a model such as BERT.Second, they fine-tune (using one or more self-supervised objectives) on a dataset of image-caption pairs. 2 Finally, the model is evaluated on multimodal tasks such as visual question answering or image captioning.
While the ability to perform such grounded tasks is the key advantage of multimodal models over unimodal ones, it is critical for our research question to examine whether this advantage comes at the cost of language ability.Unfortunately, few of the works that train new multimodal models evaluate on language-only tasks.Some works perform this evaluation post hoc.Iki and Aizawa (2021) study five multimodal architectures, all initialized with BERT and fine-tuned using identical data and training objectives by Bugliarello et al. (2021).Evaluating on the GLUE benchmark (Wang et al., 2018), they find that, in nearly all cases, the original pretrained BERT outperforms the models with additional multimodal fine-tuning.Similar results are reported by Madasu and Lal (2023) and Yun et al. (2021).
2 For example, in the masked multimodal modeling task (MMM; Tan and Bansal, 2019), regions of an aligned imagetext pair are randomly masked before being input into the model and then predicted.As the information from the image presumably helps reconstruct the masked text (Frank et al., 2021), this objective encourages learning text representations that encode information from the visual modality (and viceversa).
From a human development perspective, it may seem unintuitive that additional supervision on images harms language performance.However, from a machine learning perspective, this finding is easy to explain as an example of domain mismatch (Yun et al., 2021), catastrophic forgetting (McCloskey andCohen, 1989), under-parameterization (Amariucai, 2023), or other similar technical reasons.
BERT's original training data (Wikipedia and books) is diverse in terms of writing style and subject matter.By contrast, captions datasets commonly used to train multimodal LMs, such as MS COCO (Chen et al., 2015) or Visual Genome (Krishna et al., 2017), consist entirely of short formulaic physical descriptions of objects or scenes.Hence, the text domain that the models were trained on most recently bears little resemblance to the texts in the GLUE tasks, for example.Furthermore, the multimodal tasks incentivize using the models' limited parameters for both text and image processing, potentially sacrificing language ability.
Our experiments, which we describe in Section 3, are designed to address these issues through two complementary approaches: First, we prevent catastrophic forgetting by multitask-training on the language-only masked language modeling (MLM) objective jointly with the multimodal objectives.Second, we lessen the impact of domain mismatch by training on data that pairs images, not just with captions, but also with longer and more complex texts.

Methods
We conduct experiments to uncover differences in how language models' linguistic abilities change as the amount of visual input varies.We pretrain and evaluate multimodal LMs in eight conditions, derived by independently varying the volume of text data (10M or 100M words) and image data (none, 40K, 400K, or 4M images).We perform only one training run for each of the eight conditions due to computing constraints (see Limitations, Section 5).The text quantities are compatible with both human-scale linguistic exposure (Gilkerson et al., 2017) and the BabyLM strict-small and strict tracks (Warstadt et al., 2023).

Dataset
All the data for our experiments comes from WiT, a large, multimodal dataset entirely sourced from Wikipedia (Srinivasan et al., 2021).Our choice of WiT was motivated by its size and the diversity and complexity of its text data.English WiT includes 5.5M image-text pairs,3 making it one of the largest public datasets of its kind.It contains extended passages from Wikipedia articles, offering a more representative sample of sentence types than typical multimodal datasets sourced from captions.Furthermore, WiT features multiple types of text aligned with a given image.From most strongly aligned to most weakly aligned, these include alt text, captions, article text from the same section as the image, and article text from the lead section.Together with the fact that Wikipedia covers many different concepts and real-world entities, we hypothesize that WiT provides an adequately rich environment for supporting cross-situational learning while maintaining strong grammar and language understanding performance.
We subsample from the English portion of WiT to reach the desired data volume for each modality.For training purposes, we use either one (when either modality is 0%) or three (when both modalities are non-zero) data loaders.For example, when training on 100M words and 40K images, we sample the first 10% of the pairs for the text unimodal data loader, the first 1% for the vision unimodal data loader, and the first 1% = min(10%, 1%) for the multimodal data loader (containing paired images and text).Hence, all images in this configuration will be paired with some text, but not all texts will be paired with an image.This logic also implies that some images and texts will be seen both in the multimodal and their corresponding unimodal data loaders.

Model
For our experiments, we use the FLAVA model architecture and training objectives (Singh et al., 2022).We choose to study FLAVA for two reasons: First, Singh et al. (2022) conduct a controlled comparison between a unimodally trained FLAVA text encoder and a fully multimodal FLAVA, and they report improved performance on languageonly tasks from the multimodal model.As such, FLAVA is the only example of a large multimodal model for which prior (anecdotal) evidence supports our hypothesis that vision can help language learning.Second, FLAVA is trained in a multitask setting on a combination of unimodal text, unimodal vision, and multimodal objectives.This methodology addresses our concern (Section 2.2) that other common multimodal training recipes can lead to catastrophic forgetting of linguistic ability.
FLAVA's architecture combines three modalityspecific encoders: Text and vision embeddings are fed into unimodal text and vision encoders, respectively, and the hidden states output by these encoders are concatenated before being fed into a multimodal encoder.For the unimodal objectives, task-specific heads can be placed after the corresponding unimodal encoder.All encoders are based on the ViT-B/16 encoder (Dosovitskiy et al., 2021). 4ollowing the original work, we pretrain models from scratch using multitask learning with the following five objectives: masked image modeling, masked language modeling, masked multimodal modeling for both text and vision, imagetext matching, and cross-modal contrastive learning.More details on each objective, as well as the encoder architecture itself, can be found in the original paper (Singh et al., 2022).We use two distinct learning rates because a lower value is commonly recommended for textonly pretraining (Liu et al., 2019;Devlin et al., 2019), while a higher one was originally used for multimodally pretraining FLAVA.Multiple strategies for correctly choosing modality-specific learning rates are treated extensively in Yao and Mihalcea (2022), where the "Keep" Strategy (ours) is among the most straightforward of them.While simple, it outperforms (in the authors' empirical study) the global learning rate strategy as it ensures that each unimodal subpart still has effective gradients when training the fusion model.

Software
We use Pytorch Lightning (Falcon and The PyTorch Lightning team, 2019) as the main training framework and Weights and Biases (Biewald, 2020) to track relevant metrics in real time.We use the Huggingface datasets library (Lhoest et al., 2021) to interleave the modality-specific datasets, and the HuggingFace transformers library (Wolf et al., 2020) to access and train randomly-initialized FLAVA models.
Hardware We run each training job in Distributed Data-Parallel mode, across two NVIDIA Tesla A100 Ampere 40 GB graphics cards, on the same node, in ETH Zürich's Euler datacenter.For each of the two GPUs, there are 4 CPU workers loading data (this number was empirically found to be optimal), with each CPU worker having 10GB of RAM available.The average runtime for our jobs running on 100M words was six days, and for 10M words, it was three days.Thus, we count a total of (2 GPUs) * (8 jobs) * (108 hours / job) = 1728 GPU hours used to train the models reported in this study, not counting our hyperparameter search.
Dataloader Sampling Weights During multimodal pretraining, we alternate samples from three data loaders with independent weights, initialized (and normalized) proportional to their sizes.For example, for the condition with 100M words and 40K images (hence, all images and 10M words of text are paired), we would have the following initial sampling weights: 0.833 (text), 0.083 (vision), 0.083 (multimodal).For maximal text encoder performance, we perform a hyperparameter search and determine a simple rule-based approach to further improve the distribution of the sampling weights: If text is not the predominant modality, we change it to the uniform distribution; otherwise, we leave the initial weights unchanged.

Modality-Specific Early Stopping
We develop custom logic to prevent the models from overfitting on any given modality.For example, when the sampling rate is 0.083 for the multimodal and text data loaders yet 0.833 for the vision data loader, the former two modalities will likely begin to overfit well before the latter.To avoid this, we detect increases in validation loss and (each time) halve the corresponding task's sampling weight.If the validation loss continues to steadily increase after three validation steps, we set the task weight to 0. To prevent catastrophic forgetting of the multimodal input, we allow the models to restart training on vision and multimodal data after a certain period of inactivity (here, 10 validation phases).
Model Selection For each of the eight input configurations, we select the best model checkpoints for evaluation based on the lowest recorded masked language modeling loss on the validation set (see Figure 4 for validation losses for every training objective and model).Table 1 shows the number of training steps for the selected checkpoints from each configuration.For additional information, we also regularly evaluate the models' pseudoperplexity on a held-out test set (see Figure 1).

Results
We evaluate the selected checkpoints from all eight training configurations on the BabyLM evaluation pipeline (Warstadt et al., 2023), including evaluations on benchmarks for grammar (BLiMP; Warstadt et al., 2020a), language understanding (GLUE and SuperGLUE; Wang et al., 2018Wang et al., , 2019)), and linguistic generalization (MSGS; Warstadt et al., 2020b).For BLiMP and pseudo-perplexity (Wang et al., 2018), we also report intermediate results for all of the training checkpoints.
Our overall results in Table 2 largely confirm earlier work finding that vision is, at best, not consistently helpful to language performance.With a data volume of 10M words, FLAVA does sometimes perform marginally better on grammar-oriented tasks in the presence of visual cues.For other evaluations and with a data volume of 100M words, we also find no consistent advantages in our experimental setting.Of those improvements we do observe, our tests deem it unlikely that they are due to crosssituational learning (see Section 4.4).

Pseudo-perplexity
For validation, Figure 1 shows the pseudoperplexity (PPPL; Wang and Cho, 2019) per token on a held-out evaluation subset of WIT throughout training.Unsurprisingly, PPPL is lower (better) for the 100M word models compared to the 10M word models.Additionally, the metric appears to converge for the 10M word models, while it may still be decreasing for 100M word models. 5The most unexpected finding is that PPPL is consistently worse as the amount of image data increases for a given amount of text data throughout training.This degradation may suggest that our multitask training procedure causes the models to sacrifice MLM performance in favor of other objectives as the proportion of visual and multimodal samples increases.

Grammaticality
We evaluate linguistic knowledge using BLiMP (Warstadt et al., 2020a), which tests the ability of models to distinguish grammatical sequences from minimally different ungrammatical ones in a zeroshot setting.
Table 2 shows the overall BLiMP performance from each condition.We notice that text quantity makes a big difference in performance.Changes in vision, on the other hand, are associated with small amounts of variation that are sometimes positive or negative.Hence, due to the lack of a consistent pattern and the small number of runs, we cannot confidently conclude that vision causes an increase or decrease in performance.
Figure 2 shows the BLiMP results for each val-5 Our scheduler triggered early stopping based on validation loss despite the apparent possibility that longer training might have been beneficial.More generally, there are many potential improvements to task scheduling and early stopping for multitask learning that we leave to future work.Figure 3: Zero-shot accuracies, in percentages, obtained on the BLiMP task for each grammatical category (x12) and FLAVA run configuration of input text volume (10M and 100M words) and input vision volume (0, 40K, 400K and 4M images).The model checkpoints used to generate these results were selected as described in Table 1.idation step throughout training.For most of the duration of training, particularly for the 100M word models, models with less image data perform better.This behavior mostly matches the pattern we observe for pseudo-perplexity, except that the differences seemingly disappear by the end of training.This result confirms earlier findings that (pseudo-)perplexity is not entirely predictive of grammatical knowledge (Hu et al., 2020).
Individual BLiMP categories are more closely compared in Figure 3. Previous work shows that phenomena related to agreement have the steepest learning curves at the 10M word scale (Zhang et al., 2021).Therefore, if the hypothesis that vision accelerates LM learning is correct, we might expect to see the greatest signs of improvement for 10M word models on this subset of test suites.Figure 3, however, shows conflicting and inconclusive results, with improvements in anaphor agreement but a slight degradation for determiner-noun agreement, and little change for subject-verb agreement.
We observe that multimodal pretraining may have a regularizing effect at smaller data scales: BLiMP performance improves at times although the pseudo-perplexity (i.e., test loss) is consistently higher (by 1-3 units) for the vision-infused mod-els.Moreover, the vision-infused models run for almost twice as long before starting to overfit (8k v.s.4k steps), gaining accuracy in areas such as anaphor agreement, filler gap dependencies, and NPI licensing, although not so on test suites such as argument structure and subject-verb agreement.

Fine-Tuning Evaluations
In addition to the above, we also use GLUE/SuperGLUE (Wang et al., 2018(Wang et al., , 2019) ) and MSGS (Warstadt et al., 2020b) to fine-tune and evaluate all eight models on a selection of downstream tasks that focus on language understanding and linguistic generalization.
As expected, the results in Table 2 show that overall GLUE performance increases (by around 5%) at higher text data scales.Within each of the two text volume groups, however, there is no reliable improvement due to the addition of vision, though vision-infused models appear to be slightly better (relatively) at lower data scales, of up to 10M words.Generally, the models perform similarly on the selected downstream tasks (performance after fine-tuning), in line with BLiMP results.
Scores on MSGS are negative for all models, for all ambiguous subtasks (i.e., those subtasks not in the control condition), as shown in Table 3.This indicates that all of our models are consistently biased towards generalizing based on shallow surface cues rather than linguistic features.

Cross-Situational Learning
To assess the symbolic grounding of our models, for every input configuration checkpoint in Table 1, we evaluate the multimodal text retrieval zeroshot accuracy on ImageNet-1k (Russakovsky et al., 2015).The goal is to select, for every given query image, the best-fitting text caption from a pool of 1000 options.To this end, we compute cosine similarities as matching scores between the queried image's representation and the representations of 1000 template-averaged6 potential captions.Lastly, we retrieve the text caption with the highest matching score for each image query.We follow Radford et al. (2021) to calculate the zero-shot accuracy.
As a baseline, we assess FLAVA pretrained on the PMD corpus and obtain top1 and top5 accuracies of 32% and 60%, respectively.The models we pretrain, however, obtain average top1 and top5 accuracies of 0.1% and 0.5%, respectively.Some of the possible factors responsible for this random guess performance could be: 1) the multitask scheduler described in Appendix 3.3 was misconfigured (this aligns with findings in Amariucai ( 2023)), 2) the smaller magnitude of the training data (WiT is a subset of PMD), 3) the weak alignment between some of the text (full paragraphs) and the corresponding images (Imagenet-1k only evaluates caption alignments), or 4) the fact that we do not pretrain the vision encoder unimodally on ImageNet-1k, as in Singh et al. (2022).

Conclusion
We perform an ablation study on a state-of-the-art, multimodal language model under varying text and vision configurations.Our training recipe avoids the problem of catastrophic forgetting of complex language, which previous approaches fell prey to, by performing multitask training on both multimodal and unimodal tasks in a more diverse domain.Nonetheless, our results largely confirm earlier work finding that vision is (at best) not consistently helpful to language performance.During pretraining at the small 10M word scale, the FLAVA architecture (Singh et al., 2022) does sometimes appear to perform marginally better on grammaroriented tasks in the presence of visual cues.However, for other evaluations and with a data volume of 100M words, we find no consistent advantages in our experimental setting.
At the small data scales that we pretrain our models in this study (up to 100M words and the corresponding images), our tests in Section 4.4 deem it unlikely that the models are benefiting from crosssituational learning.Alternatively, the extra parameters in the multimodal encoder could simply be increasing FLAVA's modeling capacity, a hypothesis that we leave for future work.Regardless, multimodal pretraining seems to exhibit a regularizing effect: although pseudo-perplexity is consistently worse for the vision-infused models, grammatical performance fluctuates and is often at least as good.
We conclude that the lack of visual input alone does little to explain the large data efficiency gap between LMs and humans observed in grammar learning, though we leave open the possibility that this conclusion will change with better architectures and techniques for integrating vision and language at training time.

Limitations
The robustness of the observations made in this report is limited by the fact that each configuration (text/vision input volume) was only run once.Future work should provide at least 5 re-runs per configuration (with different seeds), as there can be considerable variance even in different models with the same configuration (McCoy et al., 2020).Due to the computational intensity of performing re-runs, this was not possible in time for this submission.
Significant GPU resources are required to effectively train large language models, partly because of the large batch sizes and the scale of the datasets.In this work, we use ≈ 1728 GPU hours on very recent hardware (further details in Section 3.3).
Finally, there is an architectural difference between the unimodal and multimodal models in our experiments.The unimodal models are trained entirely without the visual or multimodal encoders.Although these parameters are not used by the multimodal model during evaluation on language-only tasks, they are used during training, and so they may have an indirect effect on what the language encoder learns.To test whether the potential performance improvements in grammaticality and language understanding can indeed be attributed to the visual cues or rather simply to the increased number of parameters in the multimodal encoder, future work should conduct additional baseline experiments, e.g., where the images are replaced with random noise pixels.   1 (including some occasional spikes associated with checkpoint loading), the other losses are less stable.We point out some issues with the scheduler mechanism in Sections 4.1 and 4.4.

Figure 4 :
Figure4: Validation losses for every training objective on a held-out set.While the MLM -and to a certain extent, also the MMM (Text) -losses are closely proportional to the pseudo-perplexity metric in Figure1(including some occasional spikes associated with checkpoint loading), the other losses are less stable.We point out some issues with the scheduler mechanism in Sections 4.1 and 4.4.

Table 2 :
Performance for each of the eight models on the BabyLM test suites (detailed version in Table3).
Figure 2: BLiMP performance for the two data volumes of 10M and 100M words.The training steps on the x-axis are counted across all objectives.