On the (In)Effectiveness of Images for Text Classification

Images are core components of multi-modal learning in natural language processing (NLP), and results have varied substantially as to whether images improve NLP tasks or not. One confounding effect has been that previous NLP research has generally focused on sophisticated tasks (in varying settings), generally applied to English only. We focus on text classification, in the context of assigning named entity classes to a given Wikipedia page, where images generally complement the text and the Wikipedia page can be in one of a number of different languages. Our experiments across a range of languages show that images complement NLP models (including BERT) trained without external pre-training, but when combined with BERT models pre-trained on large-scale external data, images contribute nothing.

While tasks such as VQA and VCR are multimodal in nature, there has been research on traditionally text-based tasks such as text classification Huang, 2018) and word embedding learning (Bruni et al., 2014) which has demonstrated that the addition of images boosts performance. At the same time, however, there is evidence of images providing no additional information, e.g. Caglayan et al. (2019) show that MMT models learn to ignore visual content when trained on a parallel corpus of image captions (Elliott et al., 2016). These mixed findings raise the question of when visual context is actually useful in NLP.
In this work, we take a first step towards answering this question, in focusing on the task of text classification, which has traditionally been addressed using textual data only. We identify two gaps in the literature on multi-modal NLP: (1) no results for pre-trained language models (LMs); and (2) no results for languages other than English. The first is important in terms of updating the research relative to state-of-the-art approaches, while the second relates to the question of how "languageindependent" systems actually are (Bender, 2011). We fill these gaps via a text classification task over Wikipedia articles (Sekine et al., 2019). Our main findings are: (1) while images do help in a traditional supervised learning setting, their utility disappears almost completely when combined with a pre-trained LM; and (2) this phenomenon is not restricted to English, and generalises across a variety of languages from different families.

Task Description
This research is couched in the context of a shared-task dataset released by the SHINRA project (Sekine et al., 2019), aimed at classifying Wikipedia pages into fine-grained entity classes. 1 We chose this benchmark as many Wikipedia documents contain images, and data is provided for a total of 29 typologically-diverse languages. 2 The task is not trivial as it involves classifying Wikipedia documents into a set of 219 classes, with the possibility of multiple labels for a given document. 3 1 http://shinra-project.info/ shinra2020ml/?lang=en 2 Data is also provided for Greek but we do not include it in our experiments because there was no officially preprocessed data available for this language.  The number of annotated pages for each language in the SHINRA dataset is shown in Table 1 (sorted according to the number of pages). In addition to these annotated datasets -which form the basis of the experiments in this paper -there is a large amount of evaluation data for each language. In an evaluation campaign over these evaluation datasets, we achieved first place across 4 languages: English, Italian, Spanish and Catalan (Yoshikawa et al., 2020).
The SHINRA dataset contains only textual information from the original documents. In order to add images, we extract the image links from the English Wikipedia dump of June 2020 4 using the zim library. 5 The extracted images are then linked with image links in the source documents in the SHINRA dataset, 6 resulting in about 88% pages being augmented with images (noting that images are generally shared across Wikipedia pages for different languages other than English).
Out of the 30 languages in the original SHINRA dataset, we experiment primarily with Arabic ("ar"), English ("en"), Finnish ("fi"), Hindi ("hi"), and Mandarin Chinese ("zh"), selected to span five different language families and where the dataset size is relatively large. From the SHINRA data, we randomly sample 30k documents for each language, and construct a 80%/10%/10% fixed split for training/development/test in each language. We use a maximum of four images for each document. 7

Baseline Experiments
Our first set of experiments is aimed at evaluating the empirical utility of images in the absence of pre-trained models. This is in line with previous 4 https://dumps.wikimedia.org/other/ kiwix/zim/wikipedia/wikipedia_en_all_ maxi_2020-06.zim 5 https://github.com/openzim/libzim 6 Because it is quite difficult to find correspondences between images and texts (Hessel et al., 2019), image links extracted are "document-level", instead of "sentence-level". 7 When a document has less than 4 images, we pad the representation with blank images. work over similar text classification tasks Huang, 2018).

Model and Features
As our basic learner, we use a linear-kernel support vector machine (Cortes and Vapnik, 1995, SVM). For the textual inputs, we experiment with three representations: (1) a binary bag-of-words ("BOW"); (2) sent2vec ("S2V": Pagliardini et al. (2018)); and (3) BERT (Devlin et al., 2019). In this set of experiments, we train both S2V and BERT from scratch on the SHINRA training data only. We simply use the suggested configuration provided by developers, without any task-specific hyperparameter tuning. For BERT, we use the [CLS] token as the document representation. For each document, an image representation for each of the (up to) four images is generated. Specifically, following standard practice in the computer vision community, we firstly use the SIFT algorithm (Lowe, 1999) to extract hundreds of features, then use the K-means algorithm to cluster these features and generate frequency histograms, which are so-called visual bag-of-words (VBoW), and finally use an SVM to classify these histogram features. We also experiment with Faster R-CNN (Ren et al., 2015), pre-trained on Visual Genome (Krishna et al., 2017), following the settings of Anderson et al. (2018). We ensure the dimensionality of input features for the SVM and Faster R-CNN are the same (both are 1, 024), to remove this possible representational confound. Note that this is the externally pre-trained image model across all experiments, and that none of the text models in this first set of experiments involve pre-training on external resources (something we return to in Section 4).

Results and Analysis
We report F 1 scores over the test set in Table 2. The main finding is that images improve performance in all settings, for all languages and both image representations. S2V and BERT both perform worse than the simple bag of words, because of the limited training data in each case. We would, of course, expect the models  Table 2: F 1 score of the SVM models without external pre-training of the textual models, across the five languages. "SIFT+V" refers to the combination of SIFT and Visual Bag-of-Words features. "R-CNN" corresponds to features extracted from Faster R-CNN. Figure 1: VL-BERT architecture applied to the SHINRA2020-ML task. The "opening text" segment are additional textual data obtained from the documents that are optional in our experimental setting.
to perform better with more extensive pre-training, as we return to explore in Section 4, but the focus here is on training of the textual models within the bounds of the training dataset.
Strikingly, the SIFT + Visual Bag-of-Words representation results in better performance than the pre-trained Faster R-CNN. A potential explanation is that Faster R-CNN is trained in a supervised way using Visual Genome (unlike the self-supervised setting of pre-trained BERT, for instance), over a set of labels that is not particularly well aligned with SHINRA (SHINRA includes many abstract classes such as RELIGION, NATIONALITY, and OFFENCE, whereas Visual Genome is focused on physical objects and attributes, and relations between objects; even among physical objects, SHINRA distinguishes between MEDICAL INSTI-TUTION, PUBLIC INSTITUTION, and RESEARCH INSTITUTE, most of which are represented simply as BUILDING in Visual Genome).

Adding a Pre-trained Textual Encoder
We next turn to a setting where we employ pretrained textual models. This not only better reflects the state-of-the-art in text classification, but also allows us to investigate the effect of images under such conditions.
Model As the main backbone, we employ VL-BERT (Su et al., 2020), which uses a transformer to combine textual inputs and image embeddings within a BERT-style transformer, and has been shown to perform well on multimodal tasks. The visual embeddings are obtained from the combination of pre-trained Faster R-CNN and ResNet-101 (He et al., 2016), as illustrated in Figure 1. For the text modality, the input consists of two parts: the document title, and the opening text of the Wikipedia page in the form of the first 300 tokens. The token embeddings are obtained from a pre-trained BERT model, which is fine-tuned during training. 8 The full model is plugged into a onelayer feed-forward neural network (FFNN) with a 1,024d hidden layer, and training is performed by minimizing the cross-entropy over the SHINRA category labels. The model predict one label for each page. For the case of multi-label inputs, we choose one randomly as the "correct" label. Table 3 shows the performance of VL-BERT with different combinations of textual (document title = "T" and optionally the document body = "B") and image inputs, based on pre-trained BERT ("BERT pre ").

Results and Analysis
The first thing to notice is that the image-only model is well above the majority baseline, but well below the best multimodal model without an externally pre-trained text encoder from Table 2. This shows that images provide useful information for document classification, consistent with the earlier finding that images enhance the various text-only models. However, when combined with the externally pre-trained BERT pre (over either the title only, or the title + document body), the utility of images is marginal at best. That is, the large-scale pre-training of BERT pre both boosts overall performance, but much more surprisingly, removes any advantage from including images.  Table 3: F 1 scores for pre-trained VL-BERT. "T" = document title, and "T+B" = document title + body. We reproduce the best non-trained For comparison, we restate the result for the best non pre-trained model from Table 2, along with the majority class baseline.
Influence of the size of training data One hypothesis is that images are not useful due to the size of the training data (24k instances), and in lowerresource scenarios will improve performance. To test this, we perform additional experiments varying the training data size, ranging from 4k to 24k training instances, in steps of 4k. Figure 2 plots the F 1 performance as the training set size increases. While we observe substantial improvements for the image-only approach (the bottom curve), the differences in the models with textual data are modest, and even in small-data settings, there is no real advantage in including images. We also separated the test data in terms of the number of images, and found no differences. See the Supplementary Material for details.
Results on the full SHINRA dataset In the previous experiments, we fixed the dataset size for all languages to control for training data volume. However, the SHINRA dataset includes many more documents for many of the languages. As a final experiment, we apply the VL-BERT models to the full dataset available for each language. The development and test data are also different in this configuration, so the results are not directly comparable with Tables 2 and 3.
In Table 4, we present results for BERT pre , and mostly corroborate our earlier findings: while we do see improvements when including images in the case of the titles only, their utility decreases when we add the body of text for each document.
What caused the difference? Comparing the results from Sections 3 and 4, we see two main differences: the presence of external pre-training (BERT vs. BERT pre ), and the model architecture. To determine whether the model architecture is a cause of the performance difference, we train VL-BERT   from scratch, using only text and images from the 24k training set used in Section 3.
The results in Table 5 shows that even for VL-BERT, a neural-based model that is much more complex than the linear-kernel SVM, when BERT pre is not used, images provide a gain in performance. Hence, having an externally pre-trained text encoder is the predominant determinant of whether visual content has utility in NLP tasks.

Discussion and Conclusion
We investigated the utility of images as a supplementary input for a text classification task, and found that although images have empirical utility in traditional supervised learning, when externally pre-trained language models are utilised, any advantage from the visual modality disappears. The results were remarkably consistent across different languages and different volumes of training data.
It is important to distinguish between "inherently multi-modal tasks" (e.g. VQA) and "potentially multi-modal tasks" (e.g. text classification) in drawing any broader conclusions about the (in)effectiveness of images. Here, a "potentially multi-modal task" in NLP means that the primary modality is text and the task is defined based on that single data modality, but there is potentially the option to include extra modalities such as images.
There remain a lot of open questions in more fully determining the (in)effectiveness of images for NLP tasks, even for text classification, such as: • Due to the seeming redundancy between textual and visual representations of Wikipedia pages, is there any utility in multi-modal inputs for simple NLP tasks such as text clas-  sification in the era of large-scale pre-trained language models such as BERT and GPT-3 (Brown et al., 2020)?
• What performances do humans achieve in the single-modal setting and multi-modal setting? Can we get some insights by comparing the (potentially) different performances between humans and computers?
• Apart from images, what other modalities and forms of input (e.g. audio) could be effective in building better NLP models?
• Although pre-trained image models (e.g. Faster R-CNN) contribute a lot for vision tasks (e.g. object detection) and multi-modal tasks (e.g. VQA), for "pure" NLP tasks (e.g. text classification), they appear to work no better than traditional image representation feature extractors (e.g. SIFT). Why?
• In our experiments, we use at most 4 images for each page. Could instance selection enhance image utility?
• We focused on the text classification task, in classifying Wikipedia pages into different entities. Are our observations NLP taskindependent?