Understanding Social Media Cross-Modality Discourse in Linguistic Space

The multimedia communications with texts and images are popular on social media. However, limited studies concern how images are structured with texts to form coherent meanings in human cognition. To fill in the gap, we present a novel concept of cross-modality discourse, reflecting how human readers couple image and text understandings. Text descriptions are first derived from images (named as subtitles) in the multimedia contexts. Five labels -- entity-level insertion, projection and concretization and scene-level restatement and extension -- are further employed to shape the structure of subtitles and texts and present their joint meanings. As a pilot study, we also build the very first dataset containing 16K multimedia tweets with manually annotated discourse labels. The experimental results show that the multimedia encoder based on multi-head attention with captions is able to obtain the-state-of-the-art results.


Introduction
The growing popularity of multimedia is revolutionizing the communications on social media.The conventional text-only form has been expanded to cross modalities involving texts and images in information exchange.For multimedia messages, the language understanding acquires more than making sense of both visual and textual semantics; it also matters to figure out what glues them together to exhibit the coherent senses in human's mind.
Nevertheless, most progress made in social media language understanding relies on texts to learn the message-level semantics (Shen et al., 2018;Nguyen et al., 2020), largely ignoring the rich meanings conveyed in images (Cai et al., 2019a;Wang et al., 2020b).Other recent multimodal studies focus on model designs to combine visual and textual signals (Park et al., 2019;Li et al., 2020; * Corresponding author Yu et al., 2021), ignoring the insights from how humans understand the implicit structure underlying a multimedia post.
In light of these concerns, we consider images as an integral part of social media language and propose a novel concept of cross-modality discourse, which defines how human readers structure the coherent meanings from image and text modalities.Our work is inspired by Vempala and Preotiuc-Pietro (2019) examining the information overlap between images and texts, whereas we take a step further to characterize how multimedia messages make sense to humans, which is beyond a simple yes-or-no prediction to whether new thing is observed.To the best of our knowledge, we are the first to extend discourse -a pure linguistic concept -to define the linguistic roles played by images and their pragmatic relations with texts to shape the coherent meanings.
In general, cross-modality discourse is defined by the operations adopted in human perception to couple image and text semantics.Readers may first extract the information from the images acquired to complete the cross-modality understanding, either in form of the local objects (entities) or global scenes (Rayner, 2009).Then, the extracted entities or scenes are represented in texts, named as the images' subtitles, which can further contribute to structure the entity-level or scene-level discourse with the matching texts in the multimedia contexts.Concretely, for entity-level discourse, it is detailed into insertion, projection, and concretization, according to whether the entity is omitted, described, or mapped; similarly, scene-level restatement and extension are employed to reflect whether the story in one modality recurs or continues in the other.
To illustrate the definitions above, Figure 1 shows five multimedia Twitter posts.As can be seen from (a), readers may concentrate on the object "strawberry" and insert its name into the texts omitting the entity.As for (b), the "coffee" object should be extracted from the image to concretize the word "coffee" in the text.In (c), the word "court" in text is linked with the "gavel" object.
The image in (d) helps restate the texts scene (a dog holds hands in the car).In (e), the global scene works as an extension to texts and completes the story: "We are one step closer to summer following the track towards beach.".On the contrary, the image-text relations in Vempala and Preotiuc-Pietro ( 2019) are limited to whether images add new meanings to texts, which is nonetheless insufficient to reflect how language is understood in multimedia contexts.
As a pilot study of cross-modality discourse, we also present the very first dataset to explore the task.It is collected from Twitter and contains 16K highquality multimedia posts with manual annotations on their discourse labels. 1 We believe our task and the associated dataset, being the first of its kind, will be potentially beneficial to help machines gain the ability to understand social media language with multimodal elements.
To that end, we present a framework to learn the discourse structure across texts and images.Inspired by the recent advances in multimodal learning (Wang et al., 2020b;Yu et al., 2020), we employ the multi-head attention mechanism (Vaswani et al., 2017) to explore the visual-textual representations reflecting cross-modality interactions.Besides, to characterize subtitles for discourse learning, image captions generated from model trained on COCO captioning dataset (Lin et al., 2014b) are leveraged as additional features.
For empirical studies on cross-modality discourse, we conduct comprehensive experiments on our dataset.The comparison results on classification show the challenges for machines to infer discourse structure and it is beyond the capability of advanced multimodal encoders to well handle our task.Nevertheless, exploring correlations of texts, captions, and visual-textual interactions helps exhibit the state-of-the-art performance in both the intra-class and overall evaluation.We further examine the effects of varying modalities and text length and find that text signals are crucial for discourse inference while joint effects of texts, images, and captions present the best results.At last, the qualitative analysis demonstrates how the multi-head attention in our model interprets discourse structure.

Related Work
Our paper crosses the lines of multimedia learning and discourse analysis in natural language processing.Here comes more details.
It is seen that most progress to date made in this line focus on advancing methodology designs for general purposes (Su et al., 2020;Zhou et al., 2020) or specific applications (Wang et al., 2020b) to better capture the matched semantics across varying modalities.However, their effectiveness over social media data would be inevitably compromised resulted from the intricate image-text interactions (Vempala and Preotiuc-Pietro, 2019).We thus borrow the insights from human perception to interpret image-text relations from the linguistic viewpoints and propose the task to learn discourse structure in multimedia contexts.It is a fundamental research exhibiting the potential to help the models gather cross-modality understanding capability and might benefit various downstream applications.
We are also related with previous categorization tasks on social media to understand image-text relations, such as information overlap (Vempala and Preotiuc-Pietro, 2019), point-of-interest types (Villegas and Aletras, 2021), author purposes (Kruk et al., 2019), object possessions (Chinnappa et al., 2019), and so forth.Besides, interestingly, the "discourse" concept is also employed to examine the image-text relations in cooking recipes (Alikhani et al., 2019).Compared with these studies concatenating visual and textual embeddings in a "common" space, we craft text-formed subtitles to convey visual stories and explore how they shape the coherent meanings with the post texts in linguistic space.It will consequently allow deep semantic learning to capture the implicit structure holding image and text modalities, while the existing models might be incapable to gather senses of language understanding via simple feature concatenation.
Discourse Analysis.This work is related to prior studies on text-level discourse structures.The popular tasks in the styles of either RST (Rhetorical Structure Theory) (Mann and Thompson, 1988;Liu et al., 2019) or PDTB (Penn Discourse Tree Bank) (Prasad et al., 2008;Xu et al., 2018) explore the rhetorical relations of discourse units (e.g., phrases or sentences) that cohesively connect them form a sense of coherence.These studies have demonstrated their helpfulness in diverse stream of NLP applications (Choubey et al., 2020), such as sentiment analysis (Bhatia et al., 2015), text categorization (Ji and Smith, 2017), and microblog sum-marization (Li et al., 2018).Nevertheless, limited work examines a social media image as a discourse unit of the pragmatic structure in multimedia contexts, which is a gap to be filled in this work.

Study Design
In this section, we first define the task to predict cross-modality discourse in §3.1.Then, we introduce how we construct the dataset in §3.2, followed by the data analysis in §3.3 and the potential applications in §3.4.

Task Definition
In our task, the input is an image-text pair from a multimedia post on social media, following the previous practice (Vempala and Preotiuc-Pietro, 2019).For each pair, the goal is to output a label from a predefined set that cover the major categories of cross-modality discourse on social media.Our intuition is that images are relatively more eye-catching and likely to be processed before the texts.For image understandings, the previous findings from psychological experiments (Rayner, 2009) point out that humans may first recognize and extract the meanings from global scenes to fill the information gap in contexts; if the gap still exists, they may go back to capture the local objects.Based on that, we first coarsely categorize the discourse label set into the level of entity (object) and scene, depending on whether object or scene is extracted to make sense of the joint meanings of images and texts.
To further elaborate the label design, the extracted information from an image (as an object or scene) is mapped to the text modality to form the subtitle, which allows us to formulate how human senses structure the coherent meaning with subtitles and post texts.
For entity-level discourse, three cases are examined: the entity is omitted, mentioned or linked in the texts.For the absent entity (e.g., Fig. 1(a)), the subtitle, in form of entity name, should be inserted into the post text to complete the meanings of a message, while the entity in Fig. 1(b) is concertized by the object in images.And the entity in Fig. 1(c) is implicitly projected into the relevant object.We henceforth design entity-level insertion, concretization, and projection to describe the above three cases, respectively.
Similarly, scene-level discourse can be separated into restatement and extension categorizes.The former refers to image serving as texts description (e.g., Fig. 1(d)) and for the latter, posts presenting image scenes to elaborate the story left as white space in the texts (e.g., Fig. 1(e)).

Data Collection and Annotation
Our dataset is gathered from Twitter2 , which is drawing attentions to research digital communications (Mozafari et al., 2019;Nikolov and Radivchev, 2019;Müller et al., 2020) and exhibits prominent use of multimedia posts (Vempala and Preotiuc-Pietro, 2019;Wang et al., 2020b).We first crawled the raw data using Twitter streaming API3 and removed non-English posts and those with texts only or multiple images.Afterwards, to better model discourse from the noisy Twitter data (Vempala and Preotiuc-Pietro, 2019), we removed samples that might hinder the learning of non-trivial discourse signals.Here, four types of "bad" image-text pairs might provide tremendous noise in the learning, which are shown in Fig. 2.
The first type refers to image portraits with some quotes to share the insights of life (henceforth portraits), where images and texts are not coherently related (from linguistic viewpoints) and discourse structure are unable to be defined for them.Moreover, many of them contain authors' selfies, which might raise privacy concerns.The second type of posts, namely background, relies on external knowledge to capture the meanings (e.g., Fig. 2(b)), which is beyond the capability of language understanding given only the images and the matching texts.For the third, we consider low-quality images (e.g., low resolution and blurred ones like Fig. 2(c)), from which it is hard to capture the visual meanings.The last one refers to OCR subtitles (Fig. 2(d)), where the subtitles appear in the images as optical characters.It may result in a degeneration of cross-modality discourse to text-level discourse and render the learning of trivial features.
In the data annotation, we first selected 25 typical examples corresponding to each discourse label and provide them together with the annotation guidelines (with the detailed description of each label) for quality control.Then, two postgraduate students majoring in linguistics were recruited to manually label the discourse categories, given an image-text pair."Bad" samples falling in the above four types should also be indicated in the annotation process.The inter-annotator agreement is  79.8% and we only kept the data with labels agreed by both annotators to ensure the feature learning quality in noisy data.At last, posts in "bad" types were taken away and the final dataset presents 16k multimedia tweets with manual labels in five discourse categories.

Data Analysis
Here we conduct a preliminary analysis of our dataset and show the statistics in Table 1.There exhibits imbalanced labels, where concretization and extenstion labels are relatively more popular compared to the other three.This indicates the diverse preferences of Twitter users in the way they choose to structure texts and images and the potential challenge for models to handle our task.
For the text length, it is seen that most tweets contain limited words, challenging the models to capture essential features from textual signals.Interestingly, we compare our statistics with other text-only Twitter datasets in previous work (Wang et al., 2019) and find our multimedia tweets have 30% fewer words on average.This implies that authors may tend to put less content in the text of multimedia posts, and figure the missing information in images for compensation.We also notice that insertion and extension discourse exhibit relatively shorter texts on average.It is probably because they exhibit the omitted content in texts, which presents in images.
To further characterize text length in our dataset, Fig. 3 shows the word number distribution of tweet  texts with varying labels.All the curves demonstrate the sparse distribution over text length, owing to the freestyles of social media writings.Insertion and extension curves first peak at 8 words while the others at 10-12, then all present long tails afterwards.This again shows that texts in multimedia posts may provide limited content and those in insertion and extension contain fewer words.

Potential Applications
In this subsection, we further discuss the potential downstream applications of our task and datasets, which might inspire the design of future work.A straightforward application is microblog summarization -an important task to distill the salient content from massive social media data.As many state-of-the-art summarization models only allow textual input while multimedia posts are prominent on social media, it may require the compression of these posts into text for easy processing.It is different from the traditional image captioning task (Anderson et al., 2018;Rennie et al., 2017;Huang et al., 2019), where the generated captions are translated from images.For a social media post, the text cannot trivially be seen as a "translation" of image, because of possibly ambiguous imagetext interactions therein.Considering crucial roles played by discourse analysis in summarization (Xu et al., 2020), it is not hard to envision that our cross-modality discourse, describing how image and text structure coherence, would contribute to the research of multimedia summarization.In addition, cross-modality discourse can be viewed as a fundamental task and might be helpful to other downstream tasks on social media (e.g., multimodal NER (Yu et al., 2020), multimodal crisis events classfication (Abavisani et al., 2020a), multimodal sarcasm detection (Cai et al., 2019b), multimodal sentiment analysis (Truong and Lauw, 2019), and multimodal hashtag prediction (Wang et al., 2020c)).However, most previous efforts focus on the leverage of visual and lingual representations yet ignore the linguistic essence that glue the two modalities.Recently, some work pro-pose multitask learning to consider image-text relations in multimodal learning.For example, Sun et al. (2021) investigate the relation propagation between text and image to improve the accuracy of NER in tweets.Ju et al. (2021) utilize multimodal relation types as auxiliary labels to explore multimodal aspect-sentiment analysis.The positive results from these studies imply the potential of cross-modality discourse (as a linguistic description of image-text relations) to benefit a wide range of multimodal applications.Besides, the training data of image-text relation used in (Sun et al., 2021;Ju et al., 2021) is the TRC dataset proposed by Vempala and Preotiuc-Pietro (2019).Compared to the TRC dataset, our proposed discourse dataset exhibits a tremendously larger scale (i.e., 16K VS 4.5K) and fine-grained labels for image-text relation, as shown in Fig. 1.We therefore believe our dataset would also helpfully advance the performance of various multimodal models.

The Discourse Learning Framework
In this section, we describe our framework that couples the signals from images and texts to predict their discourse labels.As shown in Fig. 4, the model architecture leverages representations learned from texts, images, and image captions (to reflect subtitles), which will be introduced in §4.1.Then, we will discuss how we combine multi-modality representations §4.2.At last, §4.3 presents how we predict the discourse labels and design the training processes.et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015).The output of the last convolutional layer in ResNet-101 is extracted as the representation of the input image.The size of the feature map is first reduced to M ×M ×2048 and then reshaped into M 2 × 2048.Each 1 × 2048 vector represents the visual features in a corresponding image area and is projected to the same dimension of text feature h by liner layer.The post-level visual feature is denoted as

Image
, where v i refers to an 1 × 2048 vector that represents the feature of an area in the image.
Image Caption Encoding In order to capture more semantic information from images, we further exploit image captions (henceforth captions) as an additional modality.Our intuition is that captions may inject essential visual semantics underlying images into a descriptive language in texts (Xu et al., 2015).They are potentially helpful to reflect the rich interactions between image objects and discover subtitle-style clues as essential discourse indicators.We first employ the model presented by Anderson et al. (2018) to predict the captions of each image.The captioning model is pre-trained on the COCO captioning dataset (Lin et al., 2014b), which mostly consists of natural pictures outside social media domain.Then, we encode the token sequence of captions following the same process of text encoding (discussed above) and yield caption representation: Here N indicates the number of tokens in the caption, h i refers to the i-th hidden state of the Bertweet encoder.

Integrating Multimodal Representations
As pointed out in previous work (Wang et al., 2020b), modalities on social media data exhibit much more intricate interactions compared with the widely-studied vision-language datasets (Lin et al., 2014a;Young et al., 2014).To allow the framework to attend various types of cross-modality interactions, we employ multi-head attentions (Vaswani et al., 2017) to comprehensively explore the interactions between the encoded image features (H img ) and max-pooled text representations ( Htext ).Concretely, we set text features as the query Q, image features as the key and value K, V, and compute the multi-head attention M A(•) as follows: where n is the number of heads, [•] indicates the concatenation operations, and the attention of the j-th head is: Here d k is the normalization factor, θ(•) means softmax function.W O , W Q j , W K j , W V j are learnable variables.The attended images (in aware of texts) are denoted as Ĥimg , which further serves as the context to help explore the discourse clues from captions and texts.
For discourse modeling, the encoded texts ( Htext ) are compared with captions (carrying subtitle-style features) to infer how the subtitles can be structured with texts.To that end, we first employ a multi-head attention mechanism to encode text-aware attended caption Ĥcap , which captures salient contents from captions to indicate discourse categories.Furthermore, Ĥcap are concatenated with Htext to model their structure; also concatenated are the attended images Ĥimg as the image-text interaction contexts for cross-modality discourse learning.

Discourse Prediction and Model Training
The discourse labels are predicted with a multi-layer perceptron (MLP) fed with H = [ Ĥcap ; Htext ; Ĥimg ], the integrated feature vectors, which is further activated with a softmax function to predict the likelihood over the four discourse labels.For training, recall that in Table 1, we observe the severe label imbalance on our task.To deal with the issue, we adopt weighted cross-entropy loss, whose weights are set by the proportions of labels in training data.

Experimental Setup
Model Settings.The length of tweet texts (L) and captions (N ) are both capped at 20 by truncation.The batch size is set to 100, the learning rate to 5 × 10 −5 .The head number of all multi-head attention layers are set to 6.For image encoding, Baselines and Comparisons.We first consider two text-level discourse parsers proposed in Qin et al. (2016) and Rutherford and Xue (2016), where we extend their text encoders into multimodal encoders to fit the image-text pairs.Then, we compare with a popular multimodal classifier (Nam et al., 2017) that employs a dual attention network to fuse the visual and textual features.Besides, we evaluate varying sets of feature combinations in our model Test + Image, Text + Caption, and Text + Image + Caption (the full set).Recall that our framework employs multihead attention to integrate features learned from different modalities.In experiments, we also test the performance of other modality fusion alternatives based on simple feature concatenation (CONCATFUSE), the conventional attention mechanism (ATTENTION), the co-attention mechanism (CO-ATTENTION).

Experimental Discussions
This section first presents the main comparison results ( §6.1).Then, we discuss model sensitivity to varying modalities and text length in §6.2.Finally, §6.3 presents a case study to provide more insights.

Main Comparison Results
Table 2 shows the main comparison results of various multimodal encoders.The following observations can be drawn.
First, all models do not exhibit good F1.This indicates that cross-modality discourse prediction is a challenging task.A good understanding for that cannot be gained by trivially adapting discourse parsers to the multimodal settings or applying the existing vision-language encoders.Second, results on the two entity-level discourse labels (i.e., insertion and concretization) are relatively better than scene-level, indicating that local objects are easier to be captured than global scenes.Among all the labels, models perform the best in concretization, probably attributed to its richer data samples for feature learning (as shown in Table 1).And models obtains worst results in projection.The reasons might be that additional knowledge are needed for models to learn the implicit relation between the object and the entity.
Last, images, texts, and captions all contribute to building automatic discourse understanding.Joint modeling of the three modalities enables the corresponding models to outperform their text+image and text+caption counterparts.

Sensitivity to Modalities and Text Length
Varying Modalities.To further examine the effects of varying modalities, we compare the F1 scores of our full model with its caption-only, image-only, and text-only ablations in Fig. 5(a).
It is seen that text modality contributes relatively more to discourse modeling observed from all labels, especially for insertion, where Name Entities are omitted and makes the text style easy to recognize.Nevertheless, the joint effects of images, texts, and captions together present the best performance over all labels.
Varying Text Length.As discussed above, text features are crucial to predict cross-modality discourse.Here we further examine the effects of text length on model performance and the results of our full model are shown in Fig. 5(b).Better scores are observed for longer texts as richer contents can be captured.This again demonstrates the essential signals provided by texts to infer cross-modality discourse.

Qualitative Analysis
Discussions above mostly concern caption and text modalities.Here we present a case study to probe into how the model reflects discourse indicators over vision signals.
Case Study.Visual features are analyzed by the heatmap (in Fig. 6) visualizing the text-aware atten-  tion weights over images (Eq.3), which is captured from image-text interactions.As can be seen, attentions are able to highlight salient regions that signal the essential semantic links with the texts, e.g., the entities (dog and jeep) in (a) and (b).It is also observed that the attention would vary in their focus in regions: for entity level discourse, it tends to concentrate on the some parts of a salient object (entity), while for scene level, attention also examines the background to capture the global view.

Conclusion
We have presented a novel task to learn crossmodality discourse that advances models to gain social media language understanding capability in multimedia contexts.To handle the intricate imagetext interactions, the visual semantics are first converted into text-formed subtitles and then compared with post texts to explore deep syntactic relations in linguistic space.For empirical studies, we further contribute the first dataset presenting 16K humanannotated tweets with discourse labels for imagetext pairs.The main comparison results on our dataset have shown the effectiveness of multi-head attentions in exploring interactions among text, image, and caption modalities.Further discussions demonstrate our potential to produce meaningful representations indicating implicit image-text structure.These discourse features, conveying essential linguistic clues consistent with human senses, may largely benefit the future advances of automatic cross-modality understanding on social media.

Limitations
Class imbalance is one of the main limitations of this work.As illustrated in Table 1, Concretization is the majority category which occupies 66.0% of the dataset, while the minority categories, e.g.Projection and Insertion only account for 4.3% and 5.2% respectively.Although such uneven distribution reflects the real scenario of image-text relationships among tweets, future work should acquire a larger amount of minority categories for better interpretation of image-text relationships.
Cross-lingual and multi-platform studies should also be considered in later studies.It would be interesting and insightful to investigate the crossmodality discourse categories distribution among different languages.Are there any cultural traits that affect the use of image and text?Meanwhile, social media platforms can also exhibit preference for image and text usage.For example, will users on Instagram prefer to omit the Name Entities (Insertion category) than Twitter users?
A more concrete model, e.g.vision-language Transformers, could also be employed to encode the text, caption, and image jointly.Current model runs efficiently on single NVIDIA RTX3080Ti GPU, while the training consumption of visionlanguage Transformers could be costly and requires larger dataset.Future studies could explore the trade-off between computation cost and classification performance.

Ethical Considerations
We declare our dataset will cause no ethics problem.First, we follow the standard data acquisition process regularized by Twitter API.We downloaded the data for a purpose of academic research and is consistent with the Twitter terms of use.Then, we thoroughly navigated the data and ensured that no content will rise any ethics concerns, e.g.toxic languages, human face images, and censored images.Next, we perform the data anonymization to protect the user privacy.For the language use, we only keep the posts with English text.For the human annotations, we recruited the annotators as part-time research assistants with 16 USD/hour payment.

Figure 1 :
Figure 1: The five cross-modality discourse labels and their examples.The rows from top to bottom display their texts, images, the image-text relation labels in Vempala and Preotiuc-Pietro (2019), and our cross-modality discourse categories.The labels in Vempala and Preotiuc-Pietro (2019) concern whether new meanings are added by images to texts, whereas ours define the linguistic roles of images and their pragmatic relations with texts for coherence.

Figure 2 :
Figure 2: Examples tweets of the four "bad" types.(a) Portrait image with quotes in texts.(b) Background is externally required for understanding (rocket trajectory scenes here).(c) Low-quality image where objects could be barely observed.(d) OCR Subtitle ("Thought Of The Day") appear in the image in optical characters.

Figure 3 :
Figure 3: Text lengths (token number) distribution of posts with varying discourse labels.

Figure 4 :
Figure 4: Our framework to learn cross-modality discourse via representations encoded from texts (bottom), captions (upper left), and images (upper right).The encoded captions and texts are compared at output layer in visual-textual contexts.

Figure 5 :
Figure 5: Full model performance compared with varying modality ablations in (a) and its results over varying text length (b).X-axis: insertion, concretization, projection, restatement, and extension; Y-axis: F1 scores.For each label, bars from left to right show the caption only, image only, text only ablations, and the full model in (a) and the tweet texts capped at 5, 10, 15, and 20 in (b).
jeep wrangler sport 2014 sport used (c) Projection: drilling equipment T: european oil majors adapt to low oil; break even in 2017 (d) Restatement: moon behind a tree T: moon rising behind a tree (e) Extension: beautiful sky and trees with yellow leaves T: fall in ohio Note: T indicates the tweet text.Illuminated areas indicate higher attention weights.Texts with red represent the image content.

Figure 6 :
Figure 6: Visualization of multi-head attention heatmaps over sample images.

Table 1 :
Statistics of the total data and that with each label: Ins: Insertion; Con: Concretization; Pro: Projection; Res: Restatement; Ext: Extension.Len: average word number in texts.Num: tweet number.
(Nguyen et al., 2020)ionsTexts Encoding.Here we describe how to learn text features.The text encoder is based on the bottom 6-layers of pre-trained Bertweet(Nguyen et al., 2020).It is fed with an L-length token sequence and embed its representations into a sequential hidden states H text = (h 1 , ..., h L ), where each element reflects a token embedding.H text further goes through a max-pooling layer and produces Htext to represent the text.

Table 2 :
Comparison results of the baselines and our variants.Scores with * represent the significance tests of our full model over the baseline models with p-value<0.05.image feature map size M is set to 14.For text and comment encoding, the representations are extracted from the bottom 6-layers of the Bertweet model, which are further fine-tuned in training.In the setup, we randomly split 80%, 10% and 10% for training, validation, and test.For evaluation, we report F1 scores in the prediction of each label and the weighted F1 to measure the overall results.