Self-Supervised Multimodal Opinion Summarization

Recently, opinion summarization, which is the generation of a summary from multiple reviews, has been conducted in a self-supervised manner by considering a sampled review as a pseudo summary. However, non-text data such as image and metadata related to reviews have been considered less often. To use the abundant information contained in non-text data, we propose a self-supervised multimodal opinion summarization framework called MultimodalSum. Our framework obtains a representation of each modality using a separate encoder for each modality, and the text decoder generates a summary. To resolve the inherent heterogeneity of multimodal data, we propose a multimodal training pipeline. We first pretrain the text encoder–decoder based solely on text modality data. Subsequently, we pretrain the non-text modality encoders by considering the pretrained text decoder as a pivot for the homogeneous representation of multimodal data. Finally, to fuse multimodal representations, we train the entire framework in an end-to-end manner. We demonstrate the superiority of MultimodalSum by conducting experiments on Yelp and Amazon datasets.


Introduction
Opinion summarization is the task of automatically generating summaries from multiple documents containing users' thoughts on businesses or products. This summarization of users' opinions can provide information that helps other users with their decision-making on consumption. Unlike conventional single-document or multiple-document summarization, where we can obtain the prevalent annotated summaries (Nallapati et al., 2016;See et al., 2017;Paulus et al., 2018;Liu et al., 2018;Liu and Lapata, 2019;Perez-Beltrachini et al., 2019), opinion summarization is challenging; it is difficult to find summarized opinions of users. Accordingly, (a) Unimodal framework (b) Multimodal framework Figure 1: Self-supervised opinion summarization frameworks studies used an unsupervised approach for opinion summarization (Ku et al., 2006;Paul et al., 2010;Carenini et al., 2013;Ganesan et al., 2010;Gerani et al., 2014). Recent studies (Bražinskas and Titov, 2020;Amplayo and Lapata, 2020;Elsahar et al., 2021) used a self-supervised learning framework that creates a synthetic pair of source reviews and a pseudo summary by sampling a review text from a training corpus and considering it as a pseudo summary, as in Figure 1a.
Users' opinions are based on their perception of a specific entity and perceptions originate from various characteristics of the entity; therefore, opinion summarization can use such characteristics. For instance, Yelp provides users food or menu images and various metadata about restaurants, as in Figure 1b. This non-text information influences the review text generation process of users (Truong and Lauw, 2019). Therefore, using this additional information can help in opinion summarization, especially under unsupervised settings (Su et al., 2019;Huang et al., 2020). Furthermore, the training process of generating a review text (a pseudo summary) based on the images and metadata for self-supervised learning is consistent with the ac-tual process of writing a review text by a user.
This study proposes a self-supervised multimodal opinion summarization framework called MultimodalSum by extending the existing selfsupervised opinion summarization framework, as shown in Figure 1. Our framework receives source reviews, images, and a table on the specific business or product as input and generates a pseudo summary as output. Note that images and the table are not aligned with an individual review in the framework, but they correspond to the specific entity. We adopt the encoder-decoder framework and build multiple encoders representing each input modality. However, a fundamental challenge lies in the heterogeneous data of various modalities (Baltrušaitis et al., 2018).
To address this challenge, we propose a multimodal training pipeline. The pipeline regards the text modality as a pivot modality. Therefore, we pretrain the text modality encoder and decoder for a specific business or product via the self-supervised opinion summarization framework. Subsequently, we pretrain modality encoders for images and a table to generate review texts belonging to the same business or product using the pretrained text decoder. When pretraining the non-text modality encoders, the pretrained text decoder is frozen so that the image and table modality encoders obtain homogeneous representations with the pretrained text encoder. Finally, after pretraining input modalities, we train the entire model in an end-to-end manner to combine multimodal information.
Our contributions can be summarized as follows: • this study is the first work on self-supervised multimodal opinion summarization; • we propose a multimodal training pipeline to resolve the heterogeneity between input modalities; • we verify the effectiveness of our model framework and model training pipeline through various experiments on Yelp and Amazon datasets.

Related Work
Generally, opinion summarization has been conducted in an unsupervised manner, which can be divided into extractive and abstractive approaches. The extractive approach selects the most meaningful texts from input opinion documents, and the abstractive approach generates summarized texts that are not shown in the input documents. Most previ-ous works on unsupervised opinion summarization have focused on extractive approaches. Clusteringbased approaches (Carenini et al., 2006;Ku et al., 2006;Paul et al., 2010;Angelidis and Lapata, 2018) were used to cluster opinions regarding the same aspect and extract the text representing each cluster. Graph-based approaches (Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Zheng and Lapata, 2019) were used to construct a graph-where nodes were sentences, and edges were similarities between sentences-and extract the sentences based on their centrality. Although some abstractive approaches were not based on neural networks (Ganesan et al., 2010;Gerani et al., 2014;Di Fabbrizio et al., 2014), neural network-based approaches have been gaining attention recently. Chu and Liu (2019) generated an abstractive summary from a denoising autoencoder-based model. More recent abstractive approaches have focused on self-supervised learning. Bražinskas and Titov (2020) randomly selected N review texts for each entity and constructed N synthetic pairs by sequentially regarding one review text as a pseudo summary and the others as source reviews. Amplayo and Lapata (2020) sampled a review text as a pseudo summary and generated various noisy versions of it as source reviews. Elsahar et al. (2021) selected review texts similar to the sampled pseudo summary as source reviews, based on TF-IDF cosine similarity. We construct synthetic pairs based on Bražinskas and Titov (2020) and extend the self-supervised opinion summarization to a multimodal version.
Multimodal text summarization has been mainly studied in a supervised manner. Text summaries were created by using other modality data as additional input (Li et al., 2018(Li et al., , 2020a, and some studies provided not only a text summary but also other modality information as output (Zhu et al., 2018;Chen and Zhuge, 2018;Zhu et al., 2020;Li et al., 2020b;Fu et al., 2020). Furthermore, most studies summarized a single sentence or document. Although Li et al. (2020a) summarized multiple documents, they used non-subjective documents. Our study is the first unsupervised multimodal text summarization work that summarizes multiple subjective documents.

Problem Formulation
The goal of the self-supervised multimodal opinion summarization is to generate a pseudo sum-mary from multimodal data. Following existing self-supervised opinion summarization studies, we consider a review text selected from an entire review corpus as a pseudo summary. We extend the formulation of Bražinskas and Titov (2020) to a multimodal version. Let R = {r 1 , r 2 , ..., r N } denote the set of reviews about an entity (e.g., a business or product). Each review, r j , consists of review text, d j , and review rating, s j , that represents the overall sentiment of the review text. We denote images uploaded by a user or provided by a company for the entity as I = {i 1 , i 2 , ..., i M } and a table containing abundant metadata about the entity as T . Here, T consists of several fields, and each field contains its own name and value. We set j-th review text d j as the pseudo summary and let it be generated from R −j , I, and T , where R −j = {r 1 , ..., r j−1 , r j+1 , ..., r N } denotes source reviews. To help the model summarize what stands out overall in the review corpus, we calculate the loss for all N cases of selecting d j from R, and train the model using the average loss. During testing, we generate a summary from R, I, and T .

Model Framework
The proposed model framework, MultimodalSum, is designed with an encoder-decoder structure, as in Figure 1b. To address the heterogeneity of three input modalities, we configure each modality encoder to effectively process data in each modality. We set a text decoder to generate summary text by synthesizing encoded representations from the three modality encoders. Details are described in the following subsections.

Text Encoder and Decoder
Our text encoder and decoder are based on BART (Lewis et al., 2020). BART is a Transformer (Vaswani et al., 2017) encoder-decoder pretrained model that is particularly effective when fine-tuned for text generation and has high summarization performance. Furthermore, because the pseudo summary of self-supervised multimodal opinion summarization is an individual review text (d j ), we determine that pretraining BART based on a denoising autoencoder is suitable for our framework. Therefore, we further pretrain BART using the entire training review corpus (Gururangan et al., 2020). Our text encoder obtains e D -dimensional encoded text representations h text from D −j and the text decoder generates d j from h text as follows: where D −j = {d 1 , ..., d j−1 , d j+1 , ..., d N } denotes the set of review texts from R −j . Each review text consists of l D tokens and h text ∈ R (N −1)×l D ×e D .

Image Encoder
We use a convolutional neural network specialized in analyzing visual imagery. In particular, we use ImageNet pretrained ResNet101 (He et al., 2016), which is widely used as a backbone network. We add an additional linear layer in place of the image classification layer to match feature distribution and dimensionality with text modality representations. Our image encoder obtains encoded image representations h img from I as follows: where W img ∈ R e I ×e D denotes the additional linear weights. h img obtains R M ×l I ×e D , where l I represents the size of the flattened image feature map obtained from ResNet101.

Table Encoder
To effectively encode metadata, we design our table encoder based on the framework of data-to-text research (Puduppully et al., 2019). The input to our table encoder T is a series of field-name and field-value pairs. Each field gets e T -dimensional representations through a multilayer perceptron after concatenating the representations of field-name and field-value. The encoded table representations h table is obtained by stacking each field representation into F and adding a linear layer as follows: where n and v denote e T -dimensional representations of field name and value, respectively, and W f ∈ R 2e T ×e T , b f ∈ R e T are parameters. By stacking l T field representations, we obtain F ∈ R 1×l T ×e T . The additional linear weights W table ∈ R e T ×e D play the same role as in the image encoder, and h table ∈ R 1×l T ×e D .

Model Training Pipeline
To effectively train the model framework, we set a model training pipeline, which consists of three steps, as in Figure 2. The first step is text modality pretraining, in which a model learns unsupervised summarization capabilities using only text modality data. Next, during the pretraining for other modalities, an encoder for each modality is trained using the text modality decoder learned in the previous step as a pivot. The main purpose of this step is that other modalities have representations whose distribution is similar to that of the text modality. In the last step, the entire model framework is trained using all the modality data. Details of each step can be found in the next subsections.

Text Modality Pretraining
In this step, we pretrain the text encoder and decoder for self-supervised opinion summarization. As this was an important step for unsupervised multimodal neural machine translation (Su et al., 2019), we apply it to our framework. For the set of reviews about an entity R, we train the model to generate a pseudo summary d j from source reviews R −j for all N cases as follows: loss = N j=1 log p(d j |R −j ). The text encoder obtains h text ∈ R (N −1)×l D ×e D from D −j , and the text decoder aggregates the encoded representations of N − 1 review texts to generate d j . We model the aggregation of multiple encoded representations in the multi-head self-attention layer of the text decoder. To generate a pseudo summary that covers the overall contents of source reviews, we simply average the N − 1 single-head attention results for each encoded representation (R l D ×e D ) at each head (Elsahar et al., 2021).
The limitation of the self-supervised opinion summarization is that training and inference tasks are different. The model learns a review generation task using a review text as a pseudo summary; however, the model needs to perform a summary generation task at inference. To close this gap, we use a rating deviation between the source reviews and the target as an additional input feature of the text decoder, inspired by Bražinskas et al. (2020). We define the average ratings of the source reviews minus the rating of the target as the rating deviation: sd j = N i =j s i /(N − 1) − s j . We use sd j to help generate a pseudo summary d j during training and set it as 0 to generate a summary with average semantic of input reviews during inference. To reflect the rating deviation, we modify the way in which a Transformer creates input embeddings, as in Figure 3. We create deviation embeddings with the same dimensionality as token embeddings and add sd j × deviation embeddings to the token embeddings in the same way as positional embeddings.
Our methods to close the gap between training and inference tasks do not require additional modeling or training in comparison with previous works. We achieve noising and denoising effects by simply using rating deviation embeddings without variational inference in Bražinskas and Titov (2020). Furthermore, the information that the rating deviation is 0 plays the role of an input prompt for inference, without the need to train a separate classifier for selecting control tokens to be used as input prompts (Elsahar et al., 2021).

Other Modalities Pretraining
As the main modality for summarization is the text modality, we pretrain the image and table encoders by pivoting the text modality. Although the data of the three modalities are heterogeneous, each encoder should be trained to obtain homogeneous representations. We achieve this by using the pretrained text decoder as a pivot. We train the image encoder and the table encoder along with the text decoder to generate a review text of the entity to which images or metadata belong: I or T → d j ∈ R. The image and table encoders obtain h img and h table from I and T , respectively, and the text decoder generates d j from h img or h table . Note that we aggregate M encoded representations of h img as in the text modality pretraining, and the weights of the text decoder are made constant. I or T corresponds to all N reviews, and this means that I or T has multiple references. We convert a multiplereference setting to a single-reference setting to match the model output with the text modality pretraining. We simply create N single reference pairs from each entity and shuffle pairs from all entities to construct the training dataset (Zheng et al., 2018). As the text decoder was trained for generating a review text from text encoded representations, the image and table encoders are bound to produce similar representations with the text encoder to generate the same review text. In this way, we can maximize the ability to extract the information necessary for generating the review text.

Training for Multiple Modalities
We train the entire multimodal framework from the pretrained encoders and decoder. The encoder of each modality obtains an encoded representation for each modality, and the text decoder generates the pseudo summary d j from multimodal encoded representations h text , h img , and h table . To fuse multimodal representations, we aim to meet three requirements. First, the text modality, which is the main modality, is primarily used. Second, the model works even if images or metadata are not available. Third, the model makes the most of the legacy from pretraining. To fulfill the requirements, multi-modality fusion is applied to the multi-head self-attention layer of the text decoder. The text decoder obtains the attention result for each modality at each layer. We fuse the attention results for multiple modalities as follows: where ma text , ma img , and ma table denote each modality attention result from h text , h img , and h table , respectively. symbolizes elementwise multiplication and e D -dimensional multimodal gates α and β are calculated as fol- Note that α or β obtains the zero vector when images or metadata do not exist. It is common to use sigmoid as an activation function φ. However, it can lead to confusion in the text decoder pretrained using only the text source.
Because the values of W are initialized at approximately 0, the values of α and β are initialized at approximately 0.5 when sigmoid is used. To initialize the gate values at approximately 0, we use ReLU(tanh(x)) as φ(x). This enables the continuous use of text information, and images or metadata are used selectively.
6 Experimental Setup

Datasets
To evaluate the effectiveness of the model framework and training pipeline on datasets with different domains and characteristics, we performed experiments on two review datasets: Yelp Dataset Challenge 1 and Amazon product reviews (He and McAuley, 2016). The Yelp dataset provides reviews based on personal experiences for a specific business. It also provides numerous images (e.g., food and drinks) uploaded by the users. Note that the maximum number of images, M , was set to 10 based on the 90 th percentile. In addition, the dataset contains abundant metadata of businesses according to the characteristics of each business. On the contrary, the Amazon dataset provides reviews with more objective and specific details about a particular product. It contains a sin-gle image provided by the supplier, and provides relatively limited metadata for the product. For evaluation, we used the data used in previous research (Chu and Liu, 2019;Bražinskas and Titov, 2020). The data were generated by Amazon Mechanical Turk workers who summarized 8 input review texts. Therefore, we set N to 9 so that a pseudo summary is generated from 8 source reviews during training. For the Amazon dataset, 3 summaries are given per product. Simple data statistics are shown in Table 1, and other details can be found in Appendix A.1.

Experimental Details
All the models 2 were implemented with Py-Torch (Paszke et al., 2019), and we used the Transformers library from Hugging Face (Wolf et al., 2020) as the backbone skeleton. Our text encoder and decoder were initialized using BART-Large and further pretrained using the training review corpus with the same objective as BART. e D , e I , and e T were all set to 1,024. We trained the entire models using the Adam optimizer (Kingma and Ba, 2014) with a linear learning rate decay on NVIDIA V100s. We decayed the model weights with 0.1. For each training pipeline, we set different batch sizes, epochs, learning rates, and warmup steps according to the amount of learning required at each step. We used label smoothing with 0.1 and set the maximum norm of gradients as 1 for other modalities pretraining and multiple-modalities training. During testing, we used beam search with early stopping and discarded hypotheses that contain twice the same trigram. Different beam size, length penalty, and max length were set for Yelp and Amazon. The best hyperparameter values and other details are described in Appendix A.2.

Comparison Models
We compared our model to extractive and abstractive opinion summarization models. For extractive models, we used some simple baseline models (Bražinskas and Titov, 2020). Clustroid selects one review that gets the highest ROUGE-L score with the other reviews of an entity. Lead constructs a summary by extracting and concatenating the lead sentences from all review texts of an entity. Random simply selects one random review from an entity. LexRank (Erkan and Radev, 2004) is an extractive model that selects the most salient 2 Our code is available at https://bit.ly/3bR4yod sentences based on graph centrality.
For abstractive models, we used non-neural and neural models. Opinosis (Ganesan et al., 2010) is a non-neural model that uses a graph-based summarizer based on token-level redundancy. Mean-Sum (Chu and Liu, 2019) is a neural model that is based on a denoising-autoencoder and generates a summary from mean representations of source reviews. We also used three self-supervised abstractive models. DenoiseSum (Amplayo and Lapata, 2020) generates a summary by denoising source reviews. Copycat (Bražinskas and Titov, 2020) uses a hierarchical variational autoencoder model and generates a summary from mean latent codes of the source reviews. Self & Control (Elsahar et al., 2021) generates a summary from Transformer models and uses some control tokens as additional inputs to the text decoder.

Results
We evaluated our model framework and model training pipeline. In particular, we evaluated the summarization quality compared to other baseline models in terms of automatic and human evaluation, and conducted ablation studies.

Automatic Evaluation
To evaluate the summarization quality, we used two automatic measures: ROUGE-{1,2,L} (Lin, 2004) and BERT-score (Zhang et al., 2020). The former is a token-level measure for comparing 1, 2, and adaptive L-gram matching tokens, and the latter is a document-level measure using pretrained BERT (Devlin et al., 2019). Contrary to ROUGEscore, which is based on exact matching between n-gram words, BERT-score is based on the semantic similarity between word embeddings that reflect the context of the document through BERT. It is approved that BERT-score is more robust to adversarial examples and correlates better with human judgments compared to other measures for machine translation and image captioning. We hypothesize that BERT-score is strong in opinion summarization as well, and BERT-score would complement ROUGE-score.
The results for opinion summarization on two datasets are shown in Table 2. MultimodalSum showed superior results compared with extractive and abstractive baselines for both token-level and document-level measures. From the results, we Clustroid (Bražinskas and Titov, 2020) Table 2: Opinion summarization results on Yelp and Amazon datasets. R-1, R-2, R-L, and F BERT refer to ROUGE-{1,2,L}, and BERT-score, respectively. The best models are marked in bold, and the second-best models are underlined. * indicates that our model shows significant gains (p < 0.05) over the second-best model based on paired bootstrap resampling (Koehn, 2004). All the reported scores are based on F1.
Gold Wow, where to start? Some of the best sweet foods I've ever had. I wasn't sure what to try, so I tried a few things, and oh my goodness they were delicious. That's not all though, they serve drinks too so I got a latte and that was good too. There is a lot of variety here to choose from that'll make any sweet tooth salivate. Definitely a good place! Copycat If you're looking for a sweet tooth this is the place to go if you want a delicious dessert. I had the lemon meringue pie and it was delicious. The only thing I didn't like was that I could eat half of it, but it was a little pricey for what you get.
Self & Control If you're a fan of the Matos Buffet, this is a good place to visit. If you want to have a sweet tooth this is the place to be. The desserts are delicious and they have a good variety of desserts to choose from. The only thing I don't like about this place is that you have to wait in line for a long time to get in. Other than that, you can't really go wrong with any of the desserts in the buffet. The drinks are good and the desserts are yummy too. They also have desserts that are not too sweet. I'm not a huge fan of buffets, but this is one of my favorite buffets. MultimodalSum This is a cute little bakery located in the M resort. I had the chocolate croissant and it was very good. The croissants were soft and moist and the filling was delicious. I also had a chocolate chip cookie which was also good. I would definitely recommend this place if you are in the area. conclude that the multimodal framework outperformed the unimodal framework for unsupervised opinion summarization. In particular, our model achieved state-of-the-art results on the Amazon dataset and outperformed the comparable model by a large margin in the R-L representing the ROUGE scores on the Yelp dataset. Although Self & Control showed high R-2 score, we attributed their score to the inferred N -gram control tokens used as additional inputs to the text decoder.
Sample summaries on the Yelp dataset are shown in Table 3. They were generated from source reviews on Baby Cakes bakery. Copycat misused "sweet tooth" and generated "lemon mernigue pie" that was not mentioned in the source reviews. Self & Control generated a summary about a buffet by totally misunderstanding one sentence from source reviews: "If you love the desserts in Studio B Buffet in the M Hotel but don't want to wait in the massive buffet line or even eat in the buffet, Baby Cakes in the M Hotel is really nice fix." Furthermore, "Matos Buffet" is a non-existent word. On the contrary, MultimodalSum generated a good summary with a rich description of chocolate croissants. Although "chocolate chip cookie" was not found in the source reviews, our model generated it from cookie images. Note that the term can be found in other reviews that were not used as source reviews. Additional sample summaries on two datasets are shown in Appendix A.5.

Human Evaluation
To evaluate the quality of summarization based on human criteria, we conducted a user study. We assessed the quality of summaries using Best-Worst Scaling (BWS; Louviere et al. (2015)). BWS is known to produce more reliable results than raking scales (Kiritchenko and Mohammad, 2017) and is widely used in self-supervised opinion summarization studies. We recruited 10 NLP experts and asked each participant to choose one best and one worst summary from four summaries for three criteria. For each participant's response, the best model received +1, the worst model received -1, and the rest of the models received 0 scores. The final scores were obtained by averaging the scores of all the responses from all participants. For Overall criterion, Self & Control, Copycat, MultimodalSum, and gold summaries scored -0.527, -0.113, +0.260, and +0.380 on the Yelp dataset, respectively. MultimodalSum showed superior performance in human evaluation as well as automatic evaluation. We note that human judgments correlate better with BERT-score than ROUGE-score. Self & Control achieved a very low human evaluation score despite its high ROUGEscore in automatic evaluation. We analyzed the summaries of Self & Control, and we found several flaws such as redundant words, ungrammatical expressions, and factual hallucinations. It generated a non-existent word by combining several subwords. It was particularly noticeable when a proper noun was generated. Furthermore, Self & Control generated an implausible sentence by copying some words from source reviews. From the results, we conclude that both automatic evaluation and human evaluation performances should be supported to be a good summarization model and BERT-score can complement ROUGE-score in automatic evaluation. Details on human evaluation and full results can be found in Appendix A.3.

Effects of Multimodality
To analyze the effects of multimodal data on opinion summarization, we analyzed the multimodal gate. Since the multimodal gate is a e Ddimensional vector, we averaged it by a scalar value. Furthermore, as multimodal gates exist for each layer of the text decoder, we averaged them to measure the overall influence of a table or images when generating each token in the decoder. An example of aggregated multimodal gates is shown in Figure 4. It shows the table and images used for generating a summary text, and the multimodal gates for a part of the generated summary are expressed as heatmaps. As we intended, table and image information was selectively used to generate a specific word in the summary. The aggregated value of the table was relatively high for generating "Red Lobster", which is the name of the restaurants. It was relatively high for images, when generating "food" that is depicted in two images. Another characteristic of the result is that aggregated values of the table were higher than those of the image: mean values for the table and image in the entire test data were 0.103 and 0.045, respectively. This implies that table information is more used when creating a summary, and this observation is valid in that the table contains a large amount of metadata. Note that the values displayed on the heatmaps are small by and large, as they were aggregated from e D -dimensional vector.

Ablation Studies
For ablation studies, we analyzed the effectiveness of our model framework and model training pipeline in Table 4. To analyze the model framework, we first compared the summarization quality with four versions of unimodal model framework, as in the first block of Table 4. BART denotes the model framework in Figure 1a, whose weights are the weights of BART-Large. It represents the lower bound of our model framework without any training. BART-Review denotes the model framework whose weights are from further pretrained BART using the entire training review corpus. Unimodal-Sum refers to the results of the text modality pretraining, and we classified it into two frameworks according to the use of the rating deviation.
Surprisingly, using only BART achieved comparable or better results than many extractive and abstractive baselines in Table 2. Furthermore, further pretraining using the review corpus brought performance improvements. Qualitatively, BART with further pretraining generated more diverse words and rich expressions from the review corpus. This proved our assumption that denoising autoencoderbased pretraining helps in self-supervised multimodal opinion summarization. Based on the BART-Review, UnimodalSum achieved superior results. Furthermore, the use of rating deviation improved the quality of summarization. We conclude that learning to generate reviews based on wide ranges of rating deviations including 0 during training helps to generate a better summary of the average semantics of the input reviews.
To analyze the effect of other modalities in our model framework, we compared the summarization quality with three versions of multimodal model frameworks, as in the second block of Table 4. We removed the image or table modality from MultimodalSum to analyze the contribution of each modality. Results showed that both modalities improved the summarization quality compared with UnimodalSum, and they brought additional improvements when used altogether. This indicates that using non-text information helps in selfsupervised opinion summarization. As expected, the utility of the table modality was higher than that of the image modality. The image modality contains detailed information not revealed in the table modality (e.g., appearance of food, inside/outside mood of business, design of product, and color/texture of product). However, the information is unorganized to the extent that the utility of the image modality depends on the capacity of the image encoder to extract unorganized information. Although MultimodalSum used a representative image encoder because our study is the first work on multimodal opinion summarization, we expect that the utility of the image modality will be greater if unorganized information can be extracted effectively from the image using advanced image encoders.
For analyzing the model training pipeline, we removed text modality or/and other modalities pretraining from the pipeline. By removing each of them, the performance of MultimodalSum declined, and removing all of the pretraining steps caused an additional performance drop. Although Multi-  modalSum without other modalities pretraining has the capability of text summarization, it showed low summarization performance at the beginning of the training due to the heterogeneity of the three modality representations. However, MultimodalSum without text modality pretraining, whose image and table encoders were pretrained using BART-Review as a pivot, showed stable performance from the beginning, but the performance did not improve significantly. From the results, we conclude that both text modality and other modalities pretraining help the training of multimodal framework. For the other modalities pretraining, we conducted a further analysis in the Appendix A.4.

Conclusions
We proposed the first self-supervised multimodal opinion summarization framework. Our framework can reflect text, images, and metadata together as an extension of the existing self-supervised opinion summarization framework. To resolve the heterogeneity of multimodal data, we also proposed a multimodal training pipeline. We verified the effectiveness of our multimodal framework and training pipeline with various experiments on real review datasets. Self-supervised multimodal opinion summarization can be used in various ways in the future, such as providing a multimodal summary or enabling a multimodal retrieval. By retrieving reviews related to a specific image or metadata, controlled opinion summarization will be possible.

A.1 Dataset Preprocessing
We selected businesses and products with a minimum of 10 reviews and popular entities above the 90 th percentile were removed. The minimum and maximum length of the words were set as 35 and 100 for Yelp, and 45 and 70 for Amazon, respectively. We set the maximum number of tokens as 128 using the BART tokenizer for training, and we did not limit the maximum tokens for inference. For the Amazon dataset, we selected 4 categories: Electronics; Clothing, Shoes and Jewelry; Home and Kitchen; Health and Personal Care. As Yelp dataset contains unlimited number of images for each entity, we did not use images for popular entities above the 90 th percentile. On the other hand, Amazon dataset contains a single image for each entity. Therefore, we did not use images only when meaningless images such as non-image icon or update icon were used or the image links had expired. For Yelp dataset, we selected name, ratings, categories, hours, and attributes among the metadata. We used the hours of each day of the week as seven fields and used all metadata contained in attributes as each field. For some attributes ('Ambience', 'BusinessParking', 'GoodForMeal') that have subordinate attributes, we used each subordinate attribute. Among the fields, we selected 47 fields used by at least 10% of the entities. We set the maximum number of categories as 6 based on the 90 th percentile, and averaged the representations of each category. For ratings, we converted it to binary notation consisting of 4 digits (2 2 , 2 1 , 2 0 , 2 −1 ). For hours, we considered (open hour, close hour) as a 2-dimensional vector, and conducted K-means clustering. We selected four clusters based on silhouette score: (16.5, 23.2), (8.7, 17.1), (6.4, 23), and (10.6, 22.6). Based on the clusters, we converted hours into a categorical type.
For Amazon dataset, we selected six fields: name, price, brand, categories, ratings, and description. We set the maximum number of categories as 3 based on the 90 th percentile, and averaged the representations of each category. Furthermore, as each category consists of hierarchies with a maximum of 8 depths, we averaged the representations of hierarchies to get each category representation. For price and ratings, we converted them to binary notation consisting of 11 and 4 digits, respectively, after rounding them to the nearest 0.5 to contain digit for 2 −1 . As some descriptions consist of many  tokens, we set the maximum number of tokens as 128. We regarded each token in description as each field, so we got total 5 + 128 fields.

A.2 Experimental Details
Our image encoder is based on ResNet101. ResNet101 is composed of 1 convolution layer, 4 convolution layer blocks, and 1 fully connected layer block. Among them, 4 convolution layer blocks play an important role in analyzing image. Through each convolution layer block, the size of the image feature map is reduced to 1/4, but it gets high-level features. To maintain the ability to extract low-level features of the image, we set the model weights up to the second convolution layer block not to be trained further. We only used up to the third convolution layer block to increase the resolution of feature maps without using too highlevel features for image classification. In this way, l I was set to 14 × 14 and e I was set to 1,024.
To use the knowledge of text modality in table encoder, we obtained field name embeddings by summing the BART token embeddings for the tokens contained in the field name. Because various data types can be used for field value, we used different processing methods for each data type. Nominal values were handled in the same way as the field name. Binary and ordinal values were processed by replacing them with nominal values of corresponding meanings: 'true' and 'false' were used for binary values, and 'cheap', 'average', 'expensive', and 'very expensive' were used for 'RestaurantsPriceRange'. Numerical values were converted to binary notation, and we obtained the representations by summing embeddings corresponding to the place, where the place value is 1. For other categorical values, we simply trained embeddings corresponding to each category.
We set each hyperparameter value different for each step in the model training pipeline, as in Table 5. We set the batch size according to the memory usage and set other values according to the amount of learning required. Hyperparameter ranges for epochs and lr (learning rate) were [3,5,10,15,20] and [1e-03, 1e-04, 5e-05, 1e-05, 5e-06],

A.3 Human Evaluation
For human evaluation, we randomly selected 30 entities from Yelp test data, and used three criteria: Grmmaticality (the summary should be fluent and grammatical), Coherence (the summary should be well structured and well organized), and Overall (based on your own criteria, select the best and the worst summary of the reviews). Results for three criteria are shown in Table 6. Self & Control achieved very poor performance for all criteria due to its flaws that were not revealed in the automatic evaluation. Surprisingly, MultimodalSum outperformed gold summaries for two criteria; however, its overall performance lagged behind Gold. As our model was initialized from BART-Large that had been pretrained using large corpus and further pretrained using training review corpus, it may have generated fluent and coherent summaries. It seems that our model lagged behind Gold in Overall due to various criteria other than those two. The fact that Gold scored lower than Copycat in Grammaticality may seem inconsistent with the result from Bražinskas and Titov (2020). However, we assumed that this result was due to a combination of the four models in relative evaluation. The ranking for Copycat and Gold may have changed in absolute evaluation.

A.4 Analysis on Other Modalities Pretraining
To analyze the various models for the other modalities pretraining, we evaluated the performance of the reference review generation task that generates corresponding reviews from images or a table. For evaluation, we used the data that were not used for training data: we left 10% of the data for Yelp and 5% for Amazon. We chose two com-  Table 7. For each model, the pretrained decoder generated a review from image or table encoded representations. We measured the average ROUGE scores between the generated review and N reference reviews. The first finding was that results of table outperformed those of image. It indicates that table has more helpful information for generating reference review. The second finding was that our method based on the text decoder outperformed the Triplet based on the text encoder. Especially, Triplet achieved very poor performance for image because it is hard to match M images to N reference reviews for metric learning. On the contrary, our method achieved much better performance by pivoting the text decoder. Triplet showed good performance on table because it is relatively easy to match 1 table to N reference reviews; however, our method outperformed it. We conclude that our method lets the image and table encoder get proper representations to generate reference reviews regardless of the number of inputs. Table 8, 9 show sample summaries generated from our model and baseline models on Yelp and Amazon datasets. Full summaries from our model are available at https://bit.ly/3bR4yod.

Review 1
The fresh water catfish is probably the best I've every had. The service was outstanding. I would recommend this little secret to everyone.

Review 2
I loved everything about this place!! Great food, great decor, and great service. The best collard greens I have ever had. We had fried oysters for a starter and although I have never had them before so I have nothing to compare them with they were very tasty. The warm hush puppies with the honey butter was delicious!! I had the crab legs which were perfect and plentiful. My sister had the all you can eat fried catfish that was also cooked perfectly. A great experience all around!!

Review 3
Amazing food and great service! The hospitality was out of this world. Will definitely be back soon.
The wait was less than 5 minutes at 7pm on a Friday night, amazing!! The staff was very kind and the waitresses were very attentive and helpful. We tried the frog legs, catfish, alligator bites, crab legs, gumbo and of course the hush puppies! Everything was outstanding. What a hidden gem! Review 4 I love this place the food amazing the staff helpful ....must try green tomatos ...fresh water fish ;ˆ) Review 5 We love this place the catfish is good the hush puppies with that honey butter are awesome the french fries the gumbo what else is good there the alligator tail mostly everything on the menu. I guess the only bad thing I can say is sometimes it's like a 20 minute wait in the drive-through but it's well worth it when your food is hot Because tonight I got to go home and warm it up it's not hot enough, Even though they're still open for another hour that was a bummer Review 6 Really tasty catfish, shrimp and fixin's. Our friend took us to the sister location on Nellis a couple of months ago, but this location was more convenient to our hotel. No worries, this place was just as good! Excellent service, and the salad bar is a nice touch as well. As a Bostonian, I'm pretty particular about seafood. The Hush Puppy fits the bill. Very satisfied!

Review 7
First Time here and the food, staff was awesome. Manager came over and gave us samples of the fried catfish, super nice.
Review 8 I never eat catfish. It's nasty to me until I tasted the saltwater catfish!!! Greens are on point. The hushpuppy are bomb with honey butter!!!!! Gator bites where are ok.

Copycat
This place is awesome! The food was great, the service was great. We had the catfish po'boy and it was delicious. The only reason I didn't give 5 stars is because of the fact that they don't deliver.
Self & Control I love this place. The service is awesome. The hush puppies are to die for. I love the honey butter. I can't wait to go back and try it again. The only thing I don't like about the place is the wait. It can be a little long, but it's worth it. It's a little on the pricey side, but you're getting what you pay for. Love the hot butter, the hush puppies, the French fries, the gumbo, the catfish and the gumbo. Everything is so yummy and the service is top notch. Try it out, you won't be disappointed.

MultimodalSum
This place is a hidden gem. The food is great and the service is even better. I had the all you can eat catfish and it was delicious. The hush puppies are the best I've ever had. I will definitely be back.

Gold
Yummy and delicious catfish. You gotta try it. Friendly staff and service is good too. You can tell they know their seafood and how to prepare and cook it to perfection. The staff also answered any questions I had. The Hush Puppies are tasty too. Review 1 I usually wear size 37, but found a 38 feels better in this sandal. I absolutely love this sandal. So supportive and comfortable, although at first I did get a blister on my big toe. Do not let this be the deciding factor. It stretched out and is now fabulous. I love it so much that I bought it in three colors.

Review 2
This is a really cute shoe that feels very comfortable on my high arches. The strap on the instep fits my feet very well, but I have very slim feet. I can see how it would be uncomfortably tight on anyone with more padding on their feet.

Review 3
I love these sandals. The fit is perfect for my foot, with perfect arch support. I don't think the leather is cheap, and the sandals are very comfortable to walk in. They are very pretty, and pair very well with pants and dresses.

Review 4
My wife is a nurse and wears dansko shoes. We were excited to try the new crimson sandal and normally order 39 sandal and 40 closed toe. Some other reviews were right about a narrow width and tight toe box. We gave them a try and passed a great pair of shoes to our daughter with her long narrow feet, and she loves them...

Review 5
Finally, a Dansko sandal that's fashion forward! It was love at first sight! This is my 4th Dansko purchase.
Their sizing, quality and comfort is very consistent. I love the stying of this sandal and I'm pleased they are offering bolder colors. Another feature I love is the Dri-Lex topsole -it's soft and keeps feet dry.
Review 6 I really love these sandals. my only issue is after wearing them for a while my feet started to swell as I have a high instep and they were a little tight across the top. I'm sure they will stretch a bit after a few wears Review 7 I have several pairs of Dansko clogs that are all size 39 and fit perfectly. So I felt confident when I ordered the Tasha Sandal in size 39. I don't know if a 40 would be too large but the 39 seems a little small. Otherwise, I love them. They are very cushiony and comfortable! Review 8 I own many Dansko shoes and these are among my favorites. They have ALL the support that Dansko offers in its shoes plus they are very attractive. I love the the heel height and instant comfort. They look great with slacks and dresses, dressed up or not...

Copycat
This is my second pair of Dansko clogs and I love them. They are very comfortable and I can wear them all day without any discomfort. I would recommend them to anyone looking for a comfortable sandal.
MultimodalSum I love these sandals. They are very comfortable and look great. The only thing I don't like is that they are a little tight across the top of my foot. I have a high instep and the strap is a little too tight. I am hoping they will stretch out a bit.
Gold 1 I love these sandals, Dansko has made a really great product! I had to return my first pair (39) for being a bit tight and small, but I went a size higher (40) and it is perfect, they are so comfortable! If they do stretch out like other reviews say, they will still fit and look great.
Gold 2 I love these Dansko Tasha sandals! They are comfortable and the style is really cute. The only warning I have is that they seem to run narrow: you may want to buy a larger size if you have wide feet. Also, they seem to stretch as you wear them, so don't get discouraged by a few blisters on first wearing.

Gold 3
These Dansko shoes are amazingly comfortable and hug the shape of my feet well, but I did have to wear them for a bit to stretch them out. They felt a little tight at first, but now they are perfect. I feel they're true to size so I'd recommend ordering these in your normal shoe size.