STANKER: Stacking Network based on Level-grained Attention-masked BERT for Rumor Detection on Social Media

Rumor detection on social media puts pre-trained language models (LMs), such as BERT, and auxiliary features, such as comments, into use. However, on the one hand, rumor detection datasets in Chinese companies with comments are rare; on the other hand, intensive interaction of attention on Transformer-based models like BERT may hinder performance improvement. To alleviate these problems, we build a new Chinese microblog dataset named Weibo20 by collecting posts and associated comments from Sina Weibo and propose a new ensemble named STANKER (Stacking neTwork bAsed-on atteNtion-masKed BERT). STANKER adopts two level-grained attention-masked BERT (LGAM-BERT) models as base encoders. Unlike the original BERT, our new LGAM-BERT model takes comments as important auxiliary features and masks co-attention between posts and comments on lower-layers. Experiments on Weibo20 and three existing social media datasets showed that STANKER outperformed all compared models, especially beating the old state-of-the-art on Weibo dataset.


Introduction
Social media like Sina Weibo is an indispensable part of life, while rumors have severe consequences in political decision-making or manipulating public opinions (Lazer et al., 2018). Therefore, evil-doers can create and spread rumors on social media conveniently on a massive scale at a low cost (Ma et al., 2020), which provokes a text classification task called rumor detection .
For text classification tasks, including rumor detection, transformer-based models like Bidirectional Encoder Representations from Transform-ers (BERT) (Devlin et al., 2018), which achieved impressive results, have a significant performance variation when fine-tuned on small datasets (Risch and Krestel, 2020). Thus researchers proposed ensembles of multiple BERT models (Risch and Krestel, 2020;Fajcik et al., 2019;Liu et al., 2019a) to provide more robust predictions, but a big ensemble size makes the fine-tuning computationally expensive, for the training time and the inference time increase linearly with the ensemble size.
Moreover, the attention mechanism, which is a key of Transformer (Vaswani et al., 2017), makes the computational complexity scale quadratically with the input sequence length. It facilitates complex models to learn the contextual representation of a word via attending to other words. Encouragingly, a few studies indicated that not all attention is necessary (Gordon et al., 2020): partial attention can be pruned (Gordon et al., 2020;Michel et al., 2019) or masked (Liu et al., 2020;Yang et al., 2019) depending on specific tasks, because BERT learns different features at different levels.
Apart from the computation cost, the input length limitation is another obstacle in using pretrained language models (LMs) to detect rumors. The content of social media posts is text shorter than 140 words with rich auxiliary features, e.g., comments and user profiles. Among these features, comments are semantically relevant to a source post and support or deny the original claim (Wei et al., 2019;Bian et al., 2020). However, social media posts often have comments whose total length exceeds the input-length limitation of LMs, demanding pre-processing like truncation. Unfortunately, as the classical pre-processing for inputting long texts into LMs, truncation discards valuable information in the truncated part 2 . Meanwhile, although Longformer (Beltagy et al., 2020) was proposed recently to tackle long input sequences, excessive attention interactions may degrade the overall performance.
To alleviate these problems, we propose the STANKER (Stacking neTwork bAsed-on atteNtion-masKed BERT), which adopts two level-grained attention-masked BERT models as base encoders, stacked with a final dense prediction layer with softmax activation that maps the 768-dimensional vectors to two outputs for this binary classification. Contributions of this paper: •The only recent five-year Chinese social media rumor detection dataset, Weibo20, is built.
•We devise a new variate of BERT with linguistic noises targeted layer-grained technique, Level-Grained Attention-Masked BERT (viz. LGAM-BERT), which masks insignificant co-attention between source posts and comments on lower-layers.
•To make full use of comments, we select relative influential comments according to chronological order and a sentimental intensity ranking, thus producing two different training sets for base learners in the ensemble. Differences in training sets lead to the diversity of base learners, which contributes to the efficiency of the ensemble network. Experimental results on four datasets show that STANKER outperforms existing methods. STANKER is the best approach for Weibo20, and its accuracy on Ma-Weibo (Ma et al., 2016) is higher than the old SOTA, Ma-RvNN (Ma et al., 2020). Furthermore, STANKER is the best of all compared methods on Twitter15 and Twitter16. Unlike previous ensemble models (Risch and Krestel, 2020;Liu et al., 2019a), the training cost of the STANKER is low due to the minimal ensemble size.

Rumor Detection
A rumor is a statement whose authenticity is certified to be false or unverified (Difonzo and Bordia, 2007). Considering the tremendous number of Twitter and Weibo users, even a little promotion of the rumor detecting accuracy is precious. Rumor detection, framed as text classification tasks, can be cracked by either traditional machine learning approaches (Vicario et al., 2019;Gravanis et al., 2019) or deep neural networks (Meel and Vishwakarma, 2020), and comments or replies, as auxiliary features, are widely used.
Recent deep-learning based studies include: Wang embedded source posts and comments with sentimental features and then inputted them into a two-layer Gated Recurrent Unit (GRU) network (Wang and Guo, 2020); Kumar applied a tree LSTM to predict rumors with tree-structured replies (Kumar and Carley, 2019); Bian fed posts and replies into a Graph Convolution Network (GCN) to take advantage of propagation features, and later extended GCN to be Bi-directional GCN (viz. Bi-GCN) to explore the structures of wide dispersion on rumor detection (Bian et al., 2020); Zhang encoded replies in a temporal order through an LSTM component ; Riedel profited from the cosine similarity of news content and comments while setting a threshold of similarity to filter those irrelevant comments (Riedel et al., 2017); Lu put user profiles into GCNs to extract propagation features (Lu and Li, 2020). Discouragingly, the only available dataset for Chinese social media rumor detection, Ma-Weibo (Ma et al., 2015), was collected five years ago, unlike similar tasks whose unique datasets are recently proposed (Wang et al., 2021;Mathew et al., 2021;Ana-Cristina et al., 2021).

Ensemble Strategy
Ensemble strategy can achieve better performance than a single model; also, the diversity of base learners is crucial (Zhou, 2012). All three types of ensembling algorithms, which are bagging, boosting, and stacking, improve performance, while recent studies standing on the shoulder of BERT further showed their advantages.
Bagging. Risch proposed an ensemble of multiple fine-tuned BERT models based on bagging and found that the F1-score drastically increased when ensembling up to 15 models, but the returns diminished for more models (Risch and Krestel, 2020). Boosting. Sharma recognized question entailment using two Sci-BERT models, stacked with a gradient boosting classifier (Sharma and Roychowdhury, 2019). Huang integrated multi-class boosting into BERT and used Transformer as the base classifier to choose more challenging training sets to fine-tune NLP tasks . Stacking. Stacking algorithms were proposed to accelerate BERT training via transferring knowledge (Gong et al., 2019;. Liu proposed an architecture by blending 25 BERT models (Liu et al., 2019a). Wu combined feature engineering and an ensemble stacked with SVM, Random Forest, and Naive Bayes (Wu et al., 2020).

Attention Mechanism
The self-attention mechanism is central and indispensable to SOTA Transformer models, including BERT (Vaswani et al., 2017), but not all attention is necessary: Gong found that in most layers, the self-attention distribution will concentrate locally around its position and the start-of-sentence token (Gong et al., 2019); Jawahar showed that BERT captures a rich hierarchy of linguistic information, with surface features (e.g., the presence of words in the sentence ) in lower layers, syntactic features (e.g., the sensitivity to word order) in middle layers and semantic features (e.g., the tense) in higher layers (Jawahar et al., 2019).
Thus, attention masking or pruning methods have been proposed: 1) Liu introduced a visible matrix to limit the attention area of each token in their knowledge-enabled language representation model (K-BERT) (Liu et al., 2020). 2) Yang trained the permutation language model with twostream attention: content stream attention, which is the same as the standard self-attention, and query stream attention, which does not have access information about the content (Yang et al., 2019). 3) Beltagy proposed the Longformer with an attention mechanism that is a drop-in replacement for the standard self-attention and scales linearly with the sequence length (Beltagy et al., 2020). 4) Gordon found that low levels of weight pruning do not affect pre-training loss or transfer to downstream tasks at all (Gordon et al., 2020).

Problem Statement
Let S = {s 1 , s 2 , ..., s |S| } be a set of source posts. Each s i ∈ S is a short text composed of a word (in English) or character (in Chinese) sequence < w i 1 , w i 2 , ......w i l i >, given l i as the length of s i . Each s i ∈ S is associated with a set of comment texts (viz. replies) is a word or character sequence. Each s i is also associated with a binary label y i ∈ {0, 1} to represent its truthfulness, where y i = 1 indicates s i is a rumor and y i = 0 means s i is not.
Suppose the dataset is symbolized as D = {d 1 , d 2 , ..., d |D| } where each d i ∈ D is a tuple {s i , C i , y i }. Given d i , our goal is to predict the truthfulness y i of source post s i , i.e., binary classification. Due to the nature of social media, we regard s i as primary data and C i as auxiliary data.

Overall Structure
The overall structure of STANKER is shown in Figure 1. We select relatively valuable comments in the pre-processing according to chronological order and a sentimental intensity ranking (see Section 4.4), thus producing two different training sets for base learners. In training, we devise two Level-Grained Attention-Masked BERT (LGAM-BERT) models as base learners, which mask co-attention between source posts and comments on low-layers of BERT (see Section 4.3). Since the first token [CLS] summarizes the information from input tokens using a global attention mechanism, we extract the embedding representation of [CLS] (viz. a 768-dimensional vector) in the last layer of two LGAM-BERT models and concatenate them. The final prediction layer is a dense network with softmax activation that maps the concatenated vector to two outputs for this binary classification.

Stacking Ensemble
The basic idea of stacking is multi-stage training (Wu et al., 2020). In STANKER, the stacking ensemble strategy uses the pre-processed training data to train primary learners at the first stage and then combines their final representations to form a meta data set for training the meta learner at the second stage. The benefit of this stacking strategy is two-fold. On the one hand, BERT is a strong classifier, so integrating it or its variants as primary learners will provide a start-up ensemble with high accuracy. On the other hand, extracting the embedding representation of [CLS], instead of the binary prediction result, will train the meta learner in a high-dimensional feature space.

LGAM-BERT
The detailed design of LGAM-BERT is shown in Figure 2. An attention function can be formulated as querying a dictionary with key-value pairs. The Transformer is a stack of multiple self-attention blocks (Vaswani et al., 2017). Inspired by masking self-attention (Liu et al., 2020;Yang et al., 2019), we present a new mask strategy that masks coattention at low-levels of BERT. The co-attention concept was first proposed by (Lu et al., 2016) Figure 1: The overall structure of STANKER answer texts in Question-Answer (QA) tasks. Similarly, given a sentence set separated by [SEP], we suggest that self-attention attends words in the same sentence, while co-attention attends words in different sentences. See Figure 5 for such an example. After pre-processing, a comment-rich sentence set, where the source-post sentence and all comment sentences are separated by [SEP], is inputted to LGAM-BERT. Precisely, for a pre-defined splitting layer k, we mask co-attention from the bottom layer to the k th layer but calculate the standard attention from the k + 1 th layer to the top layer. The k is a super-parameter learned in the training process.
For our problem, since source posts and comments are not coherent texts, BERT may suffer from linguistic noise, via learning basic features (e.g., surface and syntactic features) from nearby texts on lower-layers (Jawahar et al., 2019). The LGAM strategy is novel. It masks co-attention between posts and comments on whole levels (viz. level-grained), while previous strategies only consider some local areas from the single-level aspect (Liu et al., 2020;Yang et al., 2019;Beltagy et al., 2020).
To support this, we conducted an interesting experiment to illustrate attention distance level-bylevel on BERT. We calculated the accumulated distance between a token and its top 10 most-attended tokens, visualizing with the heat-maps. We found that tokens prefer to attend nearer words at low levels on BERT, while more distant words at high levels. From the view of the attention mechanism, erroneous predictions occur when the predictor attends inappropriate words. This phenomenon pro-vides some expandability to our LGAM strategy. See Figure 7 and 8 in Appendix B for the details.

Comment Selection
The input layer first appends relevant comments to it for a given source post, transforming the original sentence into a comment-rich sentence set. When the length of the sentence set exceeds the input limitation, we select comments using some strategies instead of simple truncation 4 . On the one hand, we sort comments according to their replying time and prioritize comments that respond earlier. On the other hand, we calculate sentiment scores of comments and select those with high scores.
Formally, we adopt a sentiment dictionary Dict to score all comments (Rao et al., 2021). if a word w is in Dict, then score w is a pre-defined score; otherwise, it is set to be 0. Given a comment c, its sentiment score score c is an average on score w for all w ∈ c. Then, we sort all comments according to sentiment scores and pick up the top ones until exceeding the input-length limitation.
Besides, we find that there exist highly similar comments, especially on Weibo datasets, are a waste of the tight input space. Therefore, we use the DBSCAN algorithm (Ester et al., 1996), a density-based spatial clustering algorithm, to reduce redundancy before selection. DBSCAN can remove similar comments and repeated words, making comments more compact. Figure 4 in Appendix A is an example.

Formal Description
Given a source post Then, we extract the first element of L i and R i respectively, which is the embedding of [CLS] in the last layer. The contextual post representation PR i derived by: (3) Finally, we feed PR i to a fully-connected network (FCN) and output the prediction via softmaxing.
The standard attention mechanism (Vaswani et al., 2017) is defined as: where Q is a query vector, K is a key vector, and V is a value vector. Inspired by mask-self-attention (Liu et al., 2020), we define a visible matrix M of tokens: where means that Q i and K j are injected from the same sentence and means that Q i and K j are injected from different sentences. Figure 6 in Appendix A gives an example of the visible matrix. All co-attention is masked except for [CLS], which sees every token and summarizes the global information. Thus, attention-mask (viz. AM) can be: This equation sets an attention to be zero by adding the dot product sum and a negative value. Next, level-grained attention-mask can be derived as follows. Suppose that there are n layers on BERT, and H i is the output representation of the i th layer (1 ≤ i ≤ n) and H 0 = E [s;CS] is the embedding of the input sequence, given s is a source post and CS is its comment set. Let k be 5 It is the concatenation of word embedding and position embedding following the original BERT.
LGAM-BERT the number of the splitting layer shown in Figure 2, H i can be derived by: where A is the standard attention function in Formula (4) and AM is the attention-mask function in Formula (6). Finally, the H n is L in Formula (1) or R in Formula (2).

Datasets
The experiments were conducted on four datasets (Weibo20, Ma-Weibo, Twitter15, and Twitter16). Weibo20 is constructed by ourselves, while the other three are widely used in the research line of rumor detection. Table 1 displays the basic statistics. Considering the average length of items, we allow at most 128 tokens for the post area and 384 tokens for the comment area on two Weibo datasets, and 64 tokens for the post area, and 312 tokens for the comment area on two Twitter datasets. •Weibo20 (ours). We collected 6068 Chinese posts published on Sina Weibo 6 in the last five years (i.e., 2016-2020), along with comments. We obtained user information and comments via Weibo API 7 . "#" means "number", "Avg." means "average", "len." means "length" and "cmt." means "comment". The length is the total number of tokens. A token is a word in an English sentence or a character in a Chinese sentence. 2 Recently, Sina added a restriction on the length of collected data via API (viz. at most 200 comments per post). Therefore, the average length of comment sets on Weibo20 is much smaller than that on Ma-Weibo. The annotation process of Weibo20 is as follows. First, we collected 4411 rumors with their corresponding comments from the official Sina Weibo community management center 8 , which gives a factual basis to testify against each rumor. Then, after data cleaning, which excludes redundant rumors and rumors without comments, only 3034 rumors were left. To balance the corpus, while assuming posts on trending topics that Weibo officially recommends are facts, we collected 3034 recommended posts with their corresponding comments as negative samples (viz. non-rumors). Further, we tried our best to balance the number of rumors and non-rumors on all 15 topics. The topic distribution is shown in Table 9 in Appendix A. We also experimented on two Twitter datasets (Ma et al., 2017). We choose only "true" and "fake" labels as the ground truth. Since the original data does not contain comments, we obtained user information and comments via Twitter API.

Experimental Setting
We implemented LGAM-BERT based on pretrained BERT-base 9 . The machine learning platform employed in the experiments is TensorFlow 1.14 with Python 3.6.7. Exerting a Xeon E5-2680(v2) CPU and an RTX 2080/3090 ti GPU, STANKER ran fast on Ubuntu 18.04.4 LTS.
The training process of STANKER has two stages. In the first stage, we fine-tune the two LGAM-BERT models, given a dataset. In the second stage, we freeze all the parameters of LGAM-BERT and learn the parameters of the final prediction layer. The learning rate was set to 2e-5 on all datasets. We ran eight epochs on Weibo datasets and 20 epochs on Twitter datasets. We adopted the tokenizer (Che et al., 2020), a Chinese sentiment dictionary (Xu et al., 2008) on Weibo datasets and an English sentiment dictionary (Mohammad and Turney, 2013) on Twitter datasets.

Compared Methods
We compared STANKER with 12 competitive methods on four datasets. These methods can be divided into four categories, as shown in Table 3. We ran the source code of all compared methods, except for GCAN 10 . We used the same setting presented in the original papers for a fair comparison. Apart from source post data, auxiliary data used by each method in our experiments is shown in Table 2.
•SVM-TS (Ma et al., 2015). A SVM based method. . We re-produced the idea (Risch and Krestel, 2020) via bagging two original BERT models and randomly selecting comments. 10 GCAN did neither release a complete version of source code in the provided link https://github.com/l852888/GCAN, nor give any result on Chinese microblog datasets in their original paper. C & S comments 512 1 "C" means "chronological", "S" means "sentimental", and "R" means "random". The length limit is the allowed largest number of input tokens. "-" means "no limit". 2 Ma-RvNN, CNN, and Bi-GCN use comment contents and propagation paths built by reply-user orders. •Geng-Ensemble (Geng et al., 2019). An ensemble network is composed of three RNN-based learners, aggregating results by majority voting.
•STANKER (ours). We presented our best model in the experiments by probing important design choices of STANKER. Table 3 shows primary experimental results of all compared methods on four datasets. We reported the average result on 5-fold cross-validation. The best model of STANKER has the following design choices: using chronological comments and sentimental comments, utilizing attention mask via setting the best splitting layer k, and stacking with the final FCN composed of 128 hidden units. The ablation study in Section 5.5 and 5.6 will further explain the contribution of design choices. Preliminary conclusions are: •Among all tested methods, STANKER achieved the highest classification accuracy and the F1 score on four datasets. •Both BERT and RoBERTa are SOTA on general text classification tasks. However, compared with BERT, STANKER gained an up to 1.4% accuracy improvement on Weibo datasets and 3.5% accuracy improvement on Twitter datasets. •Both PLAN and Longformer are good at processing long sequences. However, STANKER performed better than any of them, which indicates that using all comments is not the best option. •Graph-structured models include Ma-RvNN, CNN, Bi-GCN, and GCAN. Ma-RvNN, the recent SOTA on Ma-Weibo, uses tree structures for propagation paths. CNN jointly learns text and propagation structure representation. Bi-GCN trains graph convolution networks. GCAN proposes graph-aware co-attention networks. Bi-GCN performed best among these four models; however, STANKER was superior to all of them.

Primary Results
•We compared STANKER with related ensemble models proposed in the recent two years. Both STANKER and Bagging-BERT(2) performed better than Wu-Stacking and Geng-Ensemble, which indicates the advantage of integrating BERT models. Further, STANKER performed better than Bagging-BERT(2), which indicates the advantage of taking our LGAM-BERT models.

Ablation Study
There were two experiment sets in the ablation study. We tested the contribution of design choices of STANKER in two modes: a single-model mode and an ensemble mode. We reported the average accuracy on each dataset. •In the single-model mode, we designed the "BERT_N" models, where N = 0, 1, 2, 3. We used the training subset that contained only one kind of comment: sentimental(S) or chronological(C). As shown in Table 4, "BERT_1" performed best, which reveals that the LGAM strategy is effective even for a single model. Besides, the result of "BERT_3" showed that the DBSCAN algorithm is more effective on Weibo datasets than on Twitter datasets and more useful for sentimental comments than for chronological comments. •In the ensemble mode, we utilized two LGAM-BERT models and tested the performance of STANKER by removing a separate component or their combinations. As shown in Table 5, there were three findings. First, the overall performance degraded most when running "STANKER w/o C+S", which revealed the importance of comments as auxiliary data. Take the Weibo20 dataset as an example. Given only source posts, a STANKER model only achieved an accuracy of 0.9457. However, added by C+S comments, this model got a much higher accuracy of 0.9672. Second, the performance of "STANKER w/o LGAM" was second to last, which indicated the LGAM strategy contributed more to STANKER than other components. Third, both "STANKER w/o S" and "STANKER w/o C" degraded, which indicated that adopting diverse comments is more effective.

Attention Mask Strategy Analysis
In this experiment, we tested the super-parameter k, the splitting layer on LGAM-BERT shown in Figure 2. Thanks to the implementation on BERT-base, we tested all values of k (viz. from 0 to 12), attempting to find out an "oracle" value. Experimental     results on four datasets were shown in Table 6. We found that, even though there was some volatility, the accuracy increased when setting a big value to k; however, the returns diminished for bigger and bigger values. Particularly, when k = 10, we got the highest accuracy in six-eighth cases. Therefore, we found an approximate "oracle" value, i.e., k = 10. As a result, we set k = 10 whenever adopting the LGAM strategy in STANKER.

Training Efficiency
In this part, we reported the training time of all compared methods. As shown in Table 7, as an ensemble model, the training cost of STANKER was low. It spent a little more time than Bagging-BERT(2) due to the pre-processing. However, our model got up to 0.8% improvement on Weibo datasets and 1.4% on Twitter datasets over Bagging-BERT(2). Also, STANKER ran faster than most non-ensemble models, e.g., Longformer.

Early Detection
The earlier a model can detect rumors, the more practical it is . Therefore, we conducted experiments for early detection. We collected comments every five minutes (viz. a checkpoint) and fed them to each detection model. Figure 3 showed that, as comments accumulated over time, our model was the earliest to reach a maximum classification accuracy. This result reveals the early-detection ability of STANKER.

Sentiment Dictionaries
Finally, we reported the results of using different sentiment dictionaries on STANKER. In total, we tested three Chinese dictionaries (Xu's lexicon (Xu et al., 2008), TsingHua lexicon (Li and Sun, 2007), and NTUSD (Ku and Chen, 2007)) and four En-    (Hu and Liu, 2004), and HowNet lexicon (Zhu et al., 2006). These dictionaries have different sizes and sentiment levels. For polarity-only dictionaries (e.g., Bing Liu's lexicon), we set the sentiment value of a positive word to be 1 and that of a negative word to be -1. Further, with the same sentiment score, a shorter sentence has higher sentimental intensity. The accuracy scores were reported in Table 8. The experimental findings demonstrated non-significant improvement when using different sentiment dictionaries. However, Xu's lexicon and EmoLex performed best, respectively.

Conclusion
Rumor control is one of the principal tasks of the Cybersecurity & Infrastructure Security Agency (CISA) 12 . Even 1% of the number of rumors posted or forwarded by the 521 million active Sina Weibo users will be a big event. For rumor detection, 11 http://sentistrength.wlv.ac.uk/ 12 https://www.cisa.gov/rumorcontrol existing ensemble models did not realize their full potential. To alleviate this, we build a new Weibo dataset and propose a new ensemble model which achieved the best results on all tested datasets. The novelty of our method does not rely on the overall architecture but on its novel proposal of LGAM-BERT models with comments to the original post as auxiliary data. We model co-attention between source posts and comments and propose a strategy that masks co-attention on lower layers of BERT. Unlike previous studies, we employ the masking strategy on the whole attention layer instead of on random text spans. Although the impact of each used component is not significant, a convincing set of experiments shows STANKER has superior performance when compared to numerous other SOTA methods on four different datasets. Our future work includes considering more features as auxiliary data, e.g., user profiles, and testing the LGAM strategy on more NLP tasks, e.g., dialog generation or text summarization.
Appendices A Examples Figure 4 shows an example of how the DB-SCAN algorithm removes repeated words. Before precessing, redundant words exit. E.g., three "Speechless" and two "Gross" (circled in red). After precessing, only one copy is kept (circled in green). Figure 5 shows an example of the selfattention and the co-attention, given a source post sentence and two comment sentences separated by [SEP]. In Figure 5, brown lines indicate the self-attention inside the source post sentence; gray lines signal the self-attention inside a comment sentence; blue lines highlight the co-attention between the source post sentence and a comment sentence. Figure 6 shows an example of the visible matrix for masking co-attention, given a source post sentence and four comment sentences. The blank areas indicate the invisible areas. All co-attention is masked, except for that of the [CLS]. We keep all co-attention of [CLS] because it has to see each token to summarize the global information.

B Attention Study
We conducted an interesting experiment to illustrate attention distance. We calculated the accumulated distance between a token and its top 10 most-attended tokens, visualizing with the heat-maps. Figure 7 and Figure 8 show the heat-maps on Ma-Weibo and Weibo20 as an example. Each figure has two branches: the chronological branch (viz. using the training sub-set containing only chronological comments) and the sentimental branch (viz. using the training sub-set containing only sentimental comments). Given a token t, let t 1 , ...t 10 be its top 10 most-attended tokens. We use a function called Distance to return the distance between two tokens in an input sequence. Then, the average attention-distance sum (ADS) is defined as follows: where n is the total number of tokens on a dataset. We list all ADS values layer by layer with the growth of training depth, as shown in Figure 7 and Figure 8. We set every 50 steps as a checkpoint to illustrate training depth. Larger ADS values indicate higher attention weights. Further, the deeper the color is, the farther the attention distance is. This phenomenon reveals that tokens prefer to attend nearer words at low levels on BERT, while more distant words at high levels. This test provides some expandability to our LGAM strategy.
Further, Table 10 lists the top 10 mostattended tokens for each dataset. The list provides evidential words for the prediction and some guidance for the saliency analysis. Another interesting finding is the differences between the word clouds of four datasets (see Figure 9), which adjusts the necessity of building an updated social media rumor detection dataset.