Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models

In this paper, we propose a table and image generation task to verify how the knowledge about entities acquired from natural language is retained in Vision & Language (V & L) models. This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption and a table containing related knowledge of the entity. In both tasks, the model must know the entities used to perform the generation properly. We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks. We evaluated the performance on the tasks with respect to the above research question using the V & L model OFA, which has achieved state-of-the-art results in multiple tasks. Experimental results show that OFA forgets part of its entity knowledge by pre-training as a complement to improve the performance of image related tasks.


Introduction
Vision & Language (V&L), which is the fusion of vision and language tasks, has achieved great success in tasks such as caption generation from images (Xu et al., 2015) and image generation from texts (Reed et al., 2016).This progress has been driven by pre-trained V&L models that are trained on large-scale V&L datasets (Du et al., 2022).To generate appropriate captions and images for input, pre-trained V&L models need to have prior knowledge of the features of the objects they are generating (Cao et al., 2020;Yun et al., 2021).These models retain knowledge about entities in particular by inheriting parameters from pre-trained language models used in natural language processing to indirectly utilize data resources such as Wikipedia.
In this way, V&L models (Lu et al., 2019;Su et al., 2020;Li et al., 2020;Cho et al., 2021;Wang   This learning process raises a number of questions, such as whether the knowledge about entities acquired from natural language is adequately retained in the pre-trained V&L model, or whether it is enhanced by combining it with image features.These are important in understanding the limits of what can be generated by the pre-trained V&L model.
To answer these questions, we propose a task of generating tables and images of infoboxes in English Wikipedia.Figure 1 shows an example of the target infobox, in which either tables or images are generated by the proposed task.In both cases, the model must know the entities to generate them properly.
We collected about 200,000 infoboxes to construct the Wikipedia  posed task.In addition, we used OFA (Wang et al., 2022), a pre-trained V&L model that has achieved state-of-the-art performance in various V&L tasks.
Our evaluation of the table generation revealed that part of the knowledge in the V&L model acquired from natural language is lost when the V&L model is pre-trained.We also found that additional knowledge for entities was acquired by supplementing image information, which was not possible solely from textual data.
In image generation, we found that OFA can generate more accurate images by using the knowledge expressed in the table.We also found that the models trained only on natural language can infer table knowledge, which increases the diversity of generated images.Our code and dataset will be released at https://github.com/kamigaito/WikiTIG.

Vision & Language Models
Many pre-trained V&L models have achieved stateof-the-art performance on various tasks by inheriting the weights of the conventional pre-trained models for natural language and images (Lu et al., 2019;Su et al., 2020;Li et al., 2020;Cho et al., 2021;Wang et al., 2022;Saharia et al., 2022) before learning V&L datasets.Our study examines how the knowledge represented in the pre-trained model for natural language is transformed through such a learning process.We select OFA, which has achieved state-of-the-art performance in multiple V&L tasks, as our target model.
Figure 2 shows the network structure of OFA and its relation to each dataset 2 .OFA uses VQGAN (Esser et al., 2020) on the decoder to transform images into discrete sequences so that the same Transformer (Vaswani et al., 2017) is used for image and natural language generation.Because OFA inherits 2 Appendix A describes the data for the pre-training.

Task
Input Output parameters from BART (Lewis et al., 2020), which shares a similar Transformer structure, OFA should include knowledge acquired from natural language such as Wikipedia articles.Unlike the decoder, the encoder handles images directly; thus, OFA uses the output of ResNet (He et al., 2016) to embed images in addition to the embedding layer inherited from BART.

Table and Image Generation
In this section, we describe two tasks for verifying knowledge behavior in the V&L model: table generation and image generation.Both tasks are based on infoboxes in Wikipedia articles, which correspond to summary information of the Wikipedia articles comprising tables and images3 .Thus, it is suitable for verifying the knowledge about entities in Wikipedia kept in the pre-trained V&L model.
In the following subsections, we explain the details of each task.

Table Generation
In the table generation task, the target V&L model generates a table from a title and/or image of the infobox.To do this, the model generates linearized tables, similarly to table generation by descriptions (Wu et al., 2022b).In our setting, we linearize tables as shown in Figure 3 using the column separator "|" and the row separator "<>" to reuse pretrained token embeddings.The separator symbols are accompanied by spaces before and after for use in BPE tokenization.We investigate the target model by directly generating such linearized text.We use the following settings for the investigation.
Generation from titles We investigate the knowledge about entities held by V&L models by comparing tables generated from titles by pre-trained V&L models and by pre-trained models trained only on natural language.
Generation from title and images We generate tables from titles with images and compare the results with those generated from only titles.This enables us to investigate the new knowledge in pretrained V&L models transferred from images.
Metrics For comparison, we use the following evaluation metrics to measure how close the generated tables are to the actual ones.
-ROUGE: Since the linearized tables are text data and the infobox plays the role of summarizing the article, we use ROUGE (Lin, 2004), the most widely used evaluation method for automatic summarization.In our evaluation with ROUGE, we convert the column separator "|" and the row separator "<>" to spaces so that the sequence of strings is not restricted to rows and columns.
-Table -F 1 : To evaluate the tables with respect to their structure, we divide the cells by their types and then evaluate the matches with the reference table in terms of the F 1 measure for each case and average them.When calculating the matches, we apply clipping used in ROUGE to prevent the score from increasing due to the repetition of the same cell in the output4 .We treat cells of each type separately5 as follows: • Group: The infobox sometimes divides the table into groups, with the first row of each group serving as a header for the group name.The prediction performance for the group names is important for verifying what aspects of knowledge the model has about the entities.Since these rows consist of a single column, we target rows consisting of a single column in this type of cell.
• Header: The head of each row in the table consisting of more than one column is usually the header of a subsequent cell in the same row.Therefore, the prediction performance for headers is important for the same reason as for group names.to the headers.Therefore, the prediction performance of the values is important for knowing whether the model has detailed knowledge about the entity.To examine the correspondence between headers and their values, we treat a header and its corresponding value as a pair.
-Corpus-F 1 : Because the above

Image Generation
In the image generation task, the model receives a title, caption, and table to generate the corresponding image: Generation from a title and caption By using the minimum input required to generate images, we investigate the difficulty of generating them compared to other datasets.
Generation from a title, caption, and table We investigate the impact of knowledge about entities on image generation by generating images from input, including tables, and compare the results to the setting without tables.
Metrics We use the following three widely used measures for evaluating image generation.
-CLIP: The relevance of the input text to the generated images inferred by the pre-trained V&L model CLIP (Radford et al., 2021).
-Inception Score (IS): How easily a model can distinguish the differences between each image and the variety of generated images (Salimans et al., 2016).It is inferred by the pre-trained image classification model Inception-v3 (Szegedy et al., 2016).
-Frechet Inception Distance (FID): How close the generated image is to the reference image, es- timated by Inception-v3 like IS.A lower FID is more ideal.

Dataset Creation
We created the Wikipedia Table and Image Generation (WikiTIG) dataset by extracting infoboxes from the HTML dump data of the English Wikipedia8 .To ensure consistency in the format of infoboxes, we limited the extraction target to those containing a title in the first row and an image in the second row, as shown in Figure 1.
In order to use only entities with sufficient information, we targeted entities for which the table was not empty.In addition, to ensure reliable correspondence, only rows one column wide, which often describe groups, and rows two columns wide, which often consist of a header and its value, were targeted for extraction.
The target images are limited to those in jpeg, png, and gif formats.Since some captions do not include a title, we used a hyphen to join the title at the beginning of the caption in such cases.
Table 2 shows the size of each dataset.The dataset size diverges between two tasks because some infoboxes do not include captions9 .

Table Generation
Settings We chose OFA (Wang et al., 2022), a pre-trained V&L model, and BART (Lewis et al., 2020), pre-trained only in natural language, as models for comparison.For both models, we used the base settings with the hyperparameters reported in Wang et al. (2022).We performed the training three times with different seeds and reported their average scores with their standard deviations10 .
Results Table 3 shows the results for each setting in the table generation11 .When only the title is used as input, the result of BART is more accurate than that of OFA, indicating that part of the knowledge acquired from natural language is lost due to additional learning in the V&L model.The use of image information improves Table-F 1 for headers, indicating that images reinforce the knowledge of what kind of features an entity has.
In contrast, F 1 for cell values did not improve, indicating that information obtained from images does not complement detailed knowledge, such as the values corresponding to each header obtained from natural language.
The results of BART in Corpus-F 1 also suggest that BART contains more diverse knowledge internally than in other settings.This result reinforces that the V&L model forgot part of the knowledge from natural language through additional learning, and images could not fully complement them.

Image Generation
Settings Similarly to the table generation, we chose OFA for the comparison.We additionally join the reference tables (Gold) and those generated by models in §5.1 (OFA, BART) as the input in order to investigate the impact of the ability to infer table knowledge.We also used the base settings with the hyperparameters reported in Wang et al. (2022).We also performed the training three times with different seeds and reported their average scores with their standard deviations12 .
Results Table 4 shows the results for each setting in the image generation 13  in OFA is close to the result (Wang et al., 2022) in MS COCO (Chen et al., 2015) for image generation, the use of our created dataset is reasonable for training models.In addition, the input of Table (Gold) improves all metrics, indicating that the model produces higher quality images when provided with complementary knowledge about the entities.This result also indicates that OFA does not retain sufficient knowledge of the entities in English Wikipedia.
In addition, we did not observe any performance improvement in CLIP and FID when fed with automatically generated tables from BART and OFA.However, tables generated by BART improves IS with the lower performance degradation of FID than that by OFA, indicating that automatically generated tables can improve the diversity of the output images and accurate tables are more important for improving performance in image generation.

Related Work
Following the advancements in V&L models (Du et al., 2022), there have been various studies that investigate V&L models.Cao et al. (2020) conducted a comprehensive analysis of V&L models including the difference between model structures.Through their analysis, they revealed the importance of text information in V&L tasks over image information.
Several studies focused on the performance differences between V&L models and text-only models.Yun et al. (2021) investigated the improvement of linguistic representations by pre-training V&L models on PhysicalQA (PIQA) (Bisk et al., 2020) and the probing framework of (Tenney et al., 2019).They concluded that the benefit of pretrained V&L models for text-only tasks is marginal.Iki and Aizawa (2021); Hagström and Johansson (2022) compared the performance of V&L models and text-only models on the text-only benchmark, GLUE (Wang et al., 2018) and determined that the text-only model achieved higher scores than the V&L models.
However, even though various kinds of V&L models (Lu et al., 2019;Su et al., 2020;Li et al., 2020;Cho et al., 2021;Wang et al., 2022;Saharia et al., 2022) inherit language-related knowledge from pre-trained language-only models, how the knowledge is inherited has yet to be investigated.Our work clarifies this by using our created dataset, Wikipedia Table and Image Generation (WikiTIG).

Conclusion
This paper investigates how knowledge about entities are preserved in a pre-trained V&L model which is originally transferred from a pre-trained natural language model.
We analyzed a pre-trained V&L model by creating the Wikipedia Table and Image Generation (WikiTIG) dataset for generating images and tables of the infoboxes in Wikipedia.WikiTIG consists of 200,000 infoboxes and their corresponding images from English Wikipedia.
Experimental results on a pre-trained V&L model OFA (Wang et al., 2022) showed that the model forgot part of the knowledge about entities during pre-training, and the image information did not fully compensate for the forgotten knowledge.

Limitations
Regarding the Wikipedia articles used for creating our dataset Wikipedia Table and Image Generation (WikiTIG), some infoboxes may not follow the defined format and rules.This is because various users can freely edit infoboxes.Moreover, the HTML dump data published by English Wikipedia is not based on recent information.
In image generation, due to the standard settings recommended by Zhang et al. ( 2021 In addition, a table in an infobox may contain cells unrelated to image generation, and thus it may be redundant for image generation.

Ethical Considerations
In this study, we created our dataset from English Wikipedia.
The editors of English Wikipedia remove unnecessarily offensive content and compile them into an encyclopedia (https://en.wikipedia.org/wiki/Wikipedia:Offensive_material).
However, as stated on the official pages (https: //en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view#Bias_in_sources, https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources#Biased_or_opinionated_ sources), the current English Wikipedia permits the use of biased information sources.Thus, there is a possibility that our created dataset also inherits the original biases of English Wikipedia.

A Details of the datasets for pre-training OFA
OFA pre-training uses various datasets for pretraining tasks in language, vision, and vision & language modalities, as shown in Table 5.Note that 1.53% of Pile (Gao et al., 2021) listed in Table 5 contains information from English Wikipedia.Therefore, we can understand that although OFA's pre-training focuses on V&L tasks, it is also designed to prevent the knowledge acquired from natural language data from forgetting.

B Details of the metric calculation B.1 Table-F 1
Let e be an element of a target cell type.Here, we define a function M atch r,g (e) that calculates the exact match of elements in reference and generated tables as follows: where Count r (e), Count g (e) are functions that return frequencies of e in a generated table g and a reference table r, respectively.Note that M in is a function that returns the minimum value from the given one.By using M atch r,g (e), we calculate T able-F 1 as follows: where |D| denotes a number of tables, G denotes all generated tables, and R denotes all reference tables.

B.2 Corpus-F 1
Instead of M atch r,g (e), we define M atch R,G (e) as follows: where Count R (e), Count R (e) are functions that return frequencies of e in all generated tables G and all reference tables R, respectively.By using M atch R,G (e), we calculate Corpus-F 1 as follows: C Groups/Headers/Values in an infobox Figure 4 shows an example infobox that includes multiple groups.In this example, we can see two groups named with "Highest point" and "Naming".The headers "Elevation", "Prominence", "Isolation", "Listing", and "Cordinates" are grouped into "Highest point".The headers "Etymology", "Native name", and "English translation" are grouped into "Naming".The headers have corresponding values   such as the value "Holy Mother" to the header "English translation".In the evaluation, we treat values as pairs with including their corresponding headers like ("English translation", "Holy Mother") for the last row of the infobox in Figure 4.

D Details of our created dataset
Wikipedia HTML dump data contains Wikipedia articles in HTML format, so we extracted infoboxes by using BeautifulSoup 15 .Since the infoboxes contain links to the references of the main article in the form of [#number], we removed them.We filtered out table rows that have more than two columns.
In table generation, if the short side of the input image exceeded 480px, we reduced the short side to 480px while maintaining the aspect ratio.In image generation, we changed the short side of the original image to 256px while maintaining the aspect ratio and then cropped the center of the image with a 256px square.
To measure the performance of both small and large models in the future, we also created additional datasets for the Table 8: Statistics for the number of cells in tables.The notations are the same as Table 7.
short side of the image up to 256px and 384px, respectively.Similarly, we also created a dataset for image generation with both sides of the image set to 128px.
For the sake of future expansion and to avoid data confusion, we divided the collected data into test data if the remainder of the SHA256 value of the title divided by 20 is 0, development data if the remainder is 1, and training data otherwise.Please see Table 2 for the size of the dataset.
Table 6 shows the frequencies of each type of cells used for F 1 in §3.1.This result indicates  that all types of cells have large number of type frequencies.
Table 7 shows the statistics of frequencies for values in each header.Note that in Table 7, we do not take into account groups for the calculation different from the F 1 in §3.1.From the table, we can understand that frequencies of values for each header have large variances.
Table 8 shows the statistics for the number of cells for each table.This result indicates that tables in infoboxes have the various number of cells.
Taking into account these results, we can understand that predicting cells based only on a label classification setting is difficult due to the various and diverse characteristics of the infobox tables.
To strictly comply with the license, we will only release text data to the public in the dataset release.For images, we will provide their URLs and preprocessing scripts for reproducing our dataset.

E Details of experimental settings
For both tasks, we modified the publicly available implementation 16 by the authors of OFA.Since the released OFA uses the number of words after splitting by spaces for determining the maximum token length, we modified the OFA to use subwords to specify the maximum token length in the same way as BART.We set the maximum length for input and output in table and image generation to 1024 subwords.In addition, from the perspective of investigating the characteristics of the model and dataset, we used only maximum likelihood estimation for training and did not perform reinforcement learning.We ran training of each model three times with different seeds 0, 1, and 2.

E.1 Table Generation
To avoid an unfair comparison of BART and OFA due to different implementations, we transferred BART's weight parameters17 to OFA and ran BART on OFA.We used the hyperparameters in the summarization of OFA for generation from titles.We also used the hyperparameters in captioning of OFA for generation from images.For a fair comparison, we used the captioning settings for all inferences.When the input includes titles, we used the prompt What is the infobox of " {EN-TITY_NAME} "?.When the input only includes images, we used the prompt What is the infobox of the image?.We performed the text-only experiments with four RTX 3090s in one day and the imageincluded experiments with four RTX A6000s in one day.

E.2 Image Generation
Basically, we inherited the hyperparameters used in OFA, but due to learning time, we set the beam size to 1 when generating images in the development data after each epoch in training.We used beam size 24 for testing, the same as in the original setting.We used the prompt What is the complete image?Caption: {CAPTION} to generate images.When using tables, we combined the input with the delimiter <> at the end of the original input.We performed each experiment with four RTX A6000s in two days.

F Generated examples
F.1 Tables Table 9 shows the generated tables in the test data.In the first row regarding "Low Pike", BART generated a table for the mountain, whereas OFA generated a table for a city in the United Kingdom.This result is along with the result of the automatic evaluation that BART's prediction performance of values is better than other methods.However, even BART did not specify the detailed location of the mountain.This result indicates the difficulty of storing large amounts of geographic information in a pre-trained model.
In the second row regarding "Ferruginous Pygmy-owl", BART wrongly recognized it as a bunting ("Emberizidae"), at least a bird, and OFA wrongly recognized it as a pterosaur ("Pterodactylidae").Thus, this is a case that the forgotten knowl-edge about the entity was not completed with the image.
In the third row regarding "Achlys (plant)", both models recognized it as a plant ("Plantae"), and OFA precisely predicted its division as "Magnoliopsida" by the image.However, both models could not predict further details.This result indicates the difficulty of identifying plants with diverse species.
In the fourth row regarding "Giant's Castle", BART wrongly recognized it as a video game by its misleading name, even though OFA at least recognized it as a building in New York.The result is a case that the image supports the table generation by completing the knowledge about the entity.However, this support is not enough to generate precise information.

F.2 Images
Table 10 shows the generated images in the test data.In the first row, regarding "Upper Lake (Bhopal)", we can see both settings generated images along with the caption.Since such landscape photographs do not require the depiction of details, it is clear that images can be generated without detailed knowledge.
In the second row regarding "May Lake", only w/ Tab.generated a lake with the mountain corresponding to the information in the table that shows the lake is at a high place.This result indicates that the table information can support generating images based on correct knowledge.
In the third row regarding "Littoral Rockthrush", we can see that both w/ Tab. and w/o Tab.struggled to generate bird images.However, even in this difficult situation, w/ Tab.generated a more precise image than w/o Tab. by using the table information.This result is along with our automatic evaluation results that table information can improve image generation performances.
In the fourth row regarding "Gießen (region)", we can understand from this result that using a table alone is insufficient to generate precise images of geographic information.
We can see interesting results in the fifth row regarding "Giant's Castle", which is a mountain.Both w/o Tab. and w/ Tab.wrongly generated large castles due to the misleading name "Giant's Castle".Furthermore, w/ Tab.generated a large castle that looks like a mountain based on the knowledge of 3,315 meters in the table.This result indicates a limit to disambiguation based solely on the table.

Figure 1 :
Figure 1: An infobox of a Wikipedia article 1 .In this study, we validate the V&L model by generating images and tables in infoboxes.

Figure 2 :
Figure 2: Learning process of OFA.We investigate how OFA retains knowledge about entities acquired from pre-training on Wikipedia articles.
);Ramesh et al. (2021);Wang et al. (2022);Wu et al. (2022a), our image generation task requires generating a cropped fixed-size square image instead of the original aspect ratio.

Title Image Caption Table Table Generation Title Caption (Table )
Table and Image Generation (WikiTIG) dataset necessary to perform the pro-

Table 1 :
Outline of each task.See Figure1for the parts of the infobox to which each term refers.
Alternative names | Fish supper / Fish 'n' chips <> Course | Main dish <> Place of origin | England <> Region or state | Northwestern Europe <> Serving temperature | Hot <> Main ingredients | Battered and fried fish with deep-fried chips Figure 3: This example is a linearized version of the table in Figure 1.

Table 2 :
The data size for each task in the WikiTIG dataset.

Table 3 :
Tablegeneration results.Bold font denotes the highest score, and ↑ denotes that the higher the score, the more optimal.± denotes the standard deviation of the score.Both means the input contains both a title and image.Underline indicates that the score improvement is statistically significant from the second-highest one (p < 0.05) 7 .

Table 4 :
. Since the CLIP value Image generation results.↓denotesthat the lower the score, the more optimal the result.+denotesadditionally used input to the title and captions.The parenthesis denotes the origin of the table.Other notations are the same as in Table3.

Table 5 :
Datasets used for pre-training OFA.

Table 6 :
Frequencies for each type of cells in each data split.

Table 7 :
table generation with the Statitics of frequencies for values in each header.Std.denotes standard deviation, Max and Min denote maximum and minimum frequencies, respectively.

Table 9 :
Tables generated by BART and OFA with title and image input and those of references.