Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information

Automatic commenting of online articles can provide additional opinions and facts to the reader, which improves user experience and engagement on social media platforms. Previous work focuses on automatic commenting based solely on textual content. However, in real-scenarios, online articles usually contain multiple modal contents. For instance, graphic news contains plenty of images in addition to text. Contents other than text are also vital because they are not only more attractive to the reader but also may provide critical information. To remedy this, we propose a new task: cross-model automatic commenting (CMAC), which aims to make comments by integrating multiple modal contents. We construct a large-scale dataset for this task and explore several representative methods. Going a step further, an effective co-attention model is presented to capture the dependency between textual and visual information. Evaluation results show that our proposed model can achieve better performance than competitive baselines.


Introduction
Comments of online articles can provide rich supplementary information, which reduces the difficulty of understanding the article and enhances interactions between users. Therefore, achieving automatic commenting is necessary since it can contribute to improving user experience and increasing the activeness of social media platforms. Due to the importance described above, some work (Qin et al., 2018;Lin et al., 2018; has explored this task. However, these efforts are all focus on automatic commenting based solely on textual content. In real-scenarios, online 1. (Beautiful flowers! I can't move my eyes from them.) 2.
(Peach blossoms seem to be a little less pretty without any green grass as background.) 3.
(It would be better if there is more greenness.) Figure 1: An example in the constructed dataset. Red words indicate the content that is not included in the text but depicted in the images.
articles on social media usually contain multiple modal contents. Take graphic news as an example, it contains plenty of images in addition to text. Other contents except text are also vital to improving automatic commenting. These contents may contain some information that is critical for generating informative comments. In addition, compared to plain text, these contents of other modalities are more attractive to the reader, making it easily become the focus of comments. Toward filling this gap, we propose the task of cross-model automatic commenting (CMAC), which aims to generate comments by integrating information of multiple modalities. We construct a large-scale cross-model comments dataset, which consists of 24,134 graphic news. Each instance is composed of several news photos, news title, news body, and corresponding high-quality comments. Figure 1 visually shows a sample in the dataset.
Since the comments depend on the contents of multiple modalities, how to integrate these multimodal information becomes the focus. In fact, there exist intrinsic interactions between these input multimodal information. Various modalities can benefit from each other to obtain better representations. For instance, in the graphic news, images can help to highlight the important words in the text, while text also contributes to focusing on key regions of images. Therefore, we present a coattention model so that the information of multiple modalities can mutually boost for better representations. Experiments show that our co-attention model can substantially outperform various baselines from different aspects.
The main contributions of this work are summarized as follows: • We propose the task of cross-modal automatic commenting (CMAC) and construct a large-scale dataset.
• We present a novel co-attention model, which aims at capturing intrinsic interactions between multiple modal contents.
• The experiments show that our approach can achieve better performance than competitive baselines. With multiple modal information and co-attention, the generated comments are more diverse and informative.

Cross-Modal Comments Dataset
We introduce our constructed cross-modal comments dataset from the following aspects.
Data collecting We collect data from the photo channels of a popular Chinese news website called Netease News 2 . The crawled news cover various categories including entertainment, sports, and more. We tokenize all texts into words, using a python package Jieba 3 . To guarantee the quality of the comments, we reserve comments with the length between 5 to 30 words and remove useless symbols and dirty words. Besides, we filter out short articles with less than 10 words or 3 images in its content, while unpopular articles with less than 10 pieces of comments are also removed. Finally, we acquire a dataset with 24,134 pieces of news. Each instance contains the news title and its body, several images and a list of high-quality  comments. On average, each news in the dataset contains about 39 human-written comments.
Data Statistics The dataset is split according to the corresponding news. The comments from the same news will appear solely in the training or testing set to avoid overfitting. In more detail, we split the data into 19,162, 3,521 and 1,451 news in the training, development, and testing sets, respectively. The corresponding number of comments is 746,423, 131,175 and 53,058, respectively. The statistics of the final dataset are presented in Table 1 and Figure 2 shows the distribution of the lengths for comments in both wordlevel and character-level.
Data Analysis High-quality testing set is necessary for faithful automatic evaluation. Therefore, we randomly selected 200 samples from the testing set for quality evaluation. Three annotators with linguistic background are required to score comments and readers can refer to Section 4.3 for the evaluation details. Table 2 shows the evaluation results. The average score for overall quality is 7.6, showing that the testing set is satisfactory.

Textual Encoder and Visual Encoder
The textual encoder aims to obtain representations of textual content x. We implement it as a GRU model (Cho et al., 2014), which computes the hidden representation of each word as follows: where e(x i ) refers to the embedding of the word x i . Finally, the textual representation matrix is de- where |x| is the total number of textual representations and d 1 is the dimension of h x i . We apply ResNet (He et al., 2016a) as visual encoder to obtain the visual representation 5 where |v| is the number of visual representations and d 2 is the dimension of h v i .

Co-Attention Mechanism
We use co-attention mechanism to capture the intrinsic interaction between visual content and textual content. The two modal information are connected by calculating the similarity matrix S ∈ R |v|×|x| between H v and H x . Formally, where W ∈ R d 2 ×d 1 is a trainable matrix and S ij denotes similarity between the i-th visual representation and the j-th textual representation. S is normalized row-wise to produce the vision-to-text attention weights A x , and column-wise to produce the text-to-vision attention weights A v : where softmax(·) means row-wise normalization. Hence we can obtain the vision-aware textual rep-5 Multiple representations can be extracted from an image.
Similarly, the text-aware visual representationŝ H v ∈ R |x|×d 2 can be obtained by: Since H x and H v mutually guide each other's attention, these two sources of information can mutually boost for better representations.

Decoder
The decoder aims to generate the desired comment y via another GRU model. Since there exists information from multiple modalities, we equip decoder with multiple attention mechanisms. The hidden state g t+1 of decoder at time-step t + 1 is computed as: where semicolon represents vector concatenation, y t is the word generated at time-step t and c x t is obtained by attending to H x with g t as query, where A refers to the attention mechanism. Readers can refer to Bahdanau et al. (2015) for the detailed approach. c v t ,ĉ x t , andĉ v t are obtained in a similar manner by replacing H x in Eq. (8) with H v ,Ĥ x , andĤ v , respectively. Finally, the decoder samples a word y t+1 from the output probability distribution as follows: where U is a weight matrix. The model is trained by maximizing the log-likelihood of ground-truth y * = (y * 1 , · · · , y * n ) and the loss function is: where y * <t denotes the sequence (y * 1 , · · · , y * t−1 ).

Extension to Transformer
We also extend our approach to Transformer (Vaswani et al., 2017). In detail, we adopt selfattention to implement the textual encoder. The representation of each word can be written as: which means that the multi-head attention component attends to the text x with the query x i . We strongly recommend readers to refer to Vaswani et al. (2017) for the details of self-attention. The decoder is also implemented with selfattention mechanism. More specifically, the hidden state of decoder at time-step t is calculated as: Inside the decoder, there are five multi-head attention components, using y t as query to attend to y, H x , H v ,Ĥ x , andĤ v , respectively.

Settings
The batch size is 64 and the vocabulary size is 15,000. The 512-dim embeddings are learned from scratch. The visual encoder is implemented as ResNet-152 (He et al., 2016a) pretrained on the ImageNet. For the Seq2Seq version of our approach, both textual encoder and decoder is a 2layer GRU with hidden size 512. For the transformer version, we set the hidden size of multihead attention to 512 and the hidden size of feedforward layer to 2,048. The number of heads is set to 8, while a transformer layer consists of 6 blocks. We use Adam optimizer (Kingma and Ba, 2015) with learning rate 10 −3 and apply dropout (Srivastava et al., 2014) to avoid over-fitting.

Baselines
We adopt the following competitive baselines: Seq2Seq: We implement a series of baselines based on Seq2Seq. S2S-V (Vinyals et al., 2015)   only encodes images via CNN as input. S2S-T (Bahdanau et al., 2015) is the standard Seq2Seq that only encodes texts as input. S2S-VT (Venugopalan et al., 2015) adopts two encoders to encode images and texts respectively. Transformer: We replace the Seq2Seq in the above baselines with Transformer (Vaswani et al., 2017). The corresponding models are named Trans-V, Trans-T, and Trans-VT, respectively.

Evaluation Metrics
We adopt two kinds of evaluation methods: automatic evaluation and human evaluation.
Automatic evaluation: We use BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) to evaluate overlap between outputs and references. We also calculate the number of distinct n-grams (Li et al., 2016) in outputs to measure diversity.
Human evaluation: Three annotators score the 200 outputs of different systems from 1 to 10. The evaluation criteria are as follows. Fluency measures whether the comment is fluent. Relevance evaluates the relevance between the output and the input. Informativeness measures the amount of useful information contained in the output. Overall is a comprehensive metric. For each metric, the average Pearson correlation coefficient is greater than 0.6, indicating that the human scores are highly consistent. Table 3 and Table 4 show the results of automatic evaluation and human evaluation, respectively. We perform analysis from the following aspects.

Experimental Results
The effectiveness of co-attention Both Table 3 and Table 4 show that our model can substantially outperform competitive baselines in all metrics.  For instance, the Transformer version of our approach achieves a 13% relative improvement of BLEU-1 score over Trans-VT. This illustrates that our co-attention can contribute to generating highquality comments. The co-attention mechanism brings bidirectional interactions between visual information and textual information, so that two information sources can mutually boost for better representations, leading to improved performance.
The universality of co-attention Results show that both the Seq2Seq and Transformer version of our approach can outperform various baselines based on the same architecture. This shows that our co-attention has excellent universality, which can be applied to various model architectures.
The contribution of visual content According to Table 3 and Table 4, although the images contribute less to generating high-quality comments than texts, they still bring a positive impact on the generation. This illustrates that visual content contains additional useful information, which facilitates the generation of informative comments. Therefore, integrating multi-modal information is necessary for generating high-quality comments, which is also an important value of our work.

Related Work
In summary, this paper is mainly related to the following two lines of work.
Automatic article commenting. One similar task to CMAC is automatic article commenting. Qin et al. (2018) is the first to propose this task and constructs a large-scale dataset. Lin et al. (2018) proposes to retrieve information from usergenerated data to facilitate the generation of comments. Furthermore,  introduces a retrieval-based unsupervised model to perform generation from unpaired data. However, different from the article commenting that only requires extracting textual information for generation, the CMAC task involves not only the modeling of textual features but also the understanding of visual images, which poses a greater challenge to the intelligent systems.
Co-attention. We are also inspired by the related work of co-attention mechanism. Lu et al. (2016a) introduces a hierarchical co-attention model in visual question answering to jointly attend to images and questions. Xiong et al. (2017) proposes a dynamic co-attention network for the question answering task and Seo et al. (2017) presents a bi-directional attention network to acquire query-aware context representations in machine comprehension. Tay et al. (2018a) proposes a co-attention mechanism based on Hermitian products for asymmetrical text matching problems. Zhong et al. (2019) further presents a coarse-grain fine-grain co-attention network that combines information from evidence across multiple documents for question answering. In addition, the co-attention mechanism can also be applied to word sense disambiguation (Luo et al., 2018), recommended system (Tay et al., 2018b), and essay scoring (Zhang and Litman, 2018).

Conclusion
In this paper, we propose the task of cross-modal automatic commenting, which aims at enabling the AI agent to make comments by integrating multiple modal contents. We construct a largescale dataset for this task and implement plenty of representative neural models. Furthermore, an effective co-attention model is presented to capture the intrinsic interaction between multiple modal contents. Experimental results show that our approach can substantially outperform various competitive baselines. Further analysis demonstrates that with multiple modal information and co-attention, the generated comments are more diverse and informative.