Semantic Oppositeness Assisted Deep Contextual Modeling for Automatic Rumor Detection in Social Networks

Social networks face a major challenge in the form of rumors and fake news, due to their intrinsic nature of connecting users to millions of others, and of giving any individual the power to post anything. Given the rapid, widespread dissemination of information in social networks, manually detecting suspicious news is sub-optimal. Thus, research on automatic rumor detection has become a necessity. Previous works in the domain have utilized the reply relations between posts, as well as the semantic similarity between the main post and its context, consisting of replies, in order to obtain state-of-the-art performance. In this work, we demonstrate that semantic oppositeness can improve the performance on the task of rumor detection. We show that semantic oppositeness captures elements of discord, which are not properly covered by previous efforts, which only utilize semantic similarity or reply structure. We show, with extensive experiments on recent data sets for this problem, that our proposed model achieves state-of-the-art performance. Further, we show that our model is more resistant to the variances in performance introduced by randomness.


Introduction
Social media changed the ecosystem of the World Wide Web by making it possible for any individual, regardless of their level of knowledge of web technologies, to create and maintain profiles online. At the same time, various social media provided these individuals with means to tap into the infor-mation disseminated by others (e.g., Facebook by adding friends, Twitter by following). By virtue of other mechanisms, such as Facebook pages and Twitter lists, the reach of each individual was then extended to the range of thousands-to-millions of users. New content, in the form of posts, is created on social media sites each passing second.
The rapidity of this post creation is such, that it is possible to claim that social media reflect a near real-time view of the events in the real world (Veyseh et al., 2019). While it was, indeed, beneficial in terms of volume of data, to have private individuals be content creators and propagators of information, this created significant issues, from the perspective of veracity of the data. This gave rise to a challenge of detecting fake news and rumors (which, in this study, we refer to as the task of rumor detection). The need for rumor detection has come to the forefront, in light of its momentous impacts on political events (Jin et al., 2017) and social (Jin et al., 2014) or economic (Domm, 2013) trends. Manual intervention on this task would require extensive analysis of and reasoning about various sources of information, resulting in long response times, which are intolerable, given the impact of these rumors, and the rate at which they spread. Thus, automatic rumor detection, toward which we contribute in this paper, has become an important area of contemporary research. Cao et al. (2018) define any piece of information, of which the veracity status was questionable at the time of posting, as a rumor. They further claim that a rumor may later be verified to be true or false by other authorized sources. We follow their definition in this work; thus, we define the task of rumor detection as: Given a piece of information from a social network, predict whether the piece of information is a rumor or not using the conversations which were induced by the said piece of information. The initial piece of information could be a tweet or a user post, and the induced conversation would be the replies from other users (which we use as contextual information). Following the conventions in the literature, in this work, we refer to a main post and its replies as a thread.
In this paper, we utilize the semantic oppositeness proposed by (de Silva and Dou, 2019) to improve the rumor detection task, which has so far been restricted to only considering semantic similarity. We further prove that semantic oppositeness is well-suited to be applied to this domain, under the observation that rumor threads are more discordant than those of non-rumors. We further observe that, within rumor threads, false rumor threads continue to be clamorous; while true rumor threads settle into inevitable acquiescence. We claim that semantic oppositeness can help in distinguishing this behavior as well.
We propose word-level self-attention mechanism for the semantic oppositeness to augment the tweet level self-attention mechanism for the semantic similarity. We model the explicit and implicit connections within a thread, using a relevancy matrix. Unlike a regular adjacency matrix, our relevancy matrix recognizes the coherence of each sub-tree of conversation rooted at the main post, while acknowledging that, by definition, for this task, the main tweet must be directly related all the rest of the tweets, regardless of the degrees of separation that may exist between them. We conduct extensive experiments to compare our proposed model with the state-of-the-art studies conducted on the same topic. To the best of our knowledge, this work is the first to utilize semantic oppositeness in rumor detection. In summary, our contributions in this paper include: • We introduce a novel method for rumor detection, based on both semantic similarity and semantic oppositeness, utilizing the main post and the contextual replies.
• We model the explicit and implicit connections within a thread, using a relevancy matrix, which is then used to balance the impact semantic similarity and semantic oppositeness have on the overall prediction.
• We conduct experiments on recent rumor detection data sets and compare with numerous state-of-the-art baseline models to show that we achieve superior performance.
The remainder of this paper is organized as follows: Section 2 presents related work, and then Section 3 provides a formal definition of the problem, along with our proposed solution. It is followed by Section 4 discussing experiments and results. Finally the Section 5 concludes the paper.

Related Work
Semantic oppositeness is the mathematical counterpart of semantic similarity (de Silva and Dou, 2019). While implementations of semantic similarity (Jiang and Conrath, 1997;Wu and Palmer, 1994) are more widely used than those of semantic oppositeness, there are a number of studies which work on deriving or using semantic oppositeness (de Silva et al., 2017;Paradis et al., 1982;Mettinger, 1994;Schimmack, 2001;Rothman and Parker, 2009;de Silva, 2020). However, it is noted that almost all of these studies are reducing oppositeness from a scale to either bipolar scales (Schimmack, 2001) or simple anonymity (Paradis et al., 1982;Jones et al., 2012). The study by de Silva et al. (2017) proves that this reduction is incorrect and proposes an alternative oppositeness function. Their follow-up study, de Silva and Dou (2019) creates a word embedding model for this function. In this study, we use the oppositeness embeddings created by them.
Rumor detection task has been approached on three fronts, according to Cao et al. (2018): feature engineering, propagation-based, and deep learning. In the feature engineering approach, posts are transformed into feature representations by handdesigned features and sent to a statistical model to be classified. In addition to textual information, structural evidences (Castillo et al., 2011;Yang et al., 2012) and media content (Gupta et al., 2012) are also utilized. Given that this approach depends heavily on the quality of the hand-designed feature sets, it is neither scalable, nor transferable to other domains. The propagation-based approach is built on the assumption that the propagation pattern of a rumor is significantly different to that of a non-rumor. It has been deployed to detect rumors in social networks (Ma et al., 2017). However, this method does not pay any heed to the information in the post content itself. As expected, deep learning approach, automatically learns effective features (Ma et al., 2016(Ma et al., , 2018Veyseh et al., 2019). Ma et al. (2016) claim that these discovered features capture the underlying representations of the posts, and hence, improve the generalization performance, while making it easy to be adapted into a new domain or a social medium for the purpose of rumor detection. Yang et al. (2020) propose a slide window-based system for feature extraction. None of these state-of-the-art work attempt to check rumour veracity akin to attempts by Hamidian and Diab (2019a) and Derczynski et al. (2017). Instead, they attempt to do classification on the already established baseline. Thus, our work also follow the approach of the former rather than the latter. The work by Hamidian and Diab (2019b) does focus on rumor detection and classification. However, they are not using the data sets common to the state-of-the-art work mentioned above to evaluate their approach.
Our work is most related to the rumor detection model on Twitter by means of deep learning to capture contextual information (Veyseh et al., 2019). However, we also derive inspiration from earlier work on the same topic (Ma et al., 2018), which utilized the tree-like structures of the posts, and the work by de Silva and Dou (2019), which introduced the oppositeness embedding model. The early work by Ma et al. (2018) uses Recursive Neural Networks (RvNN) for the construction of the aforementioned tree-like structures of the posts, based on their tf-idf representations.
The following work by Veyseh et al. (2019) acknowledges the usefulness of considering the innate similarities between replies, but further claims that only considering the replies along the treelike structure only exploits the explicit relations between the main posts and their replies, and thus ignores the implicit relations among the posts from different branches based on their semantics. Under this claim, they disregard the tree-like structure entirely. In our work, we preserve the idea of considering semantic similarities to discover the implicit relationships among posts, as proposed by (Veyseh et al., 2019).
However, we augment the model and reintroduce the explicit relationships proposed by Ma et al. (2018) in a balancing of information between implicit and explicit. Further, we note that all these prior works have been solely focused on the similarity between the posts and have ignored the oppositeness metric. To the best of our knowledge, we are the first to utilize oppositeness information in the rumor detection task.

Methodology
We use a recent work (Veyseh et al., 2019) on rumor detection as our baseline. Their work, in turn, was heavily influenced by the earlier work on rumor detection in Twitter (Ma et al., 2018). A tweet set I is defined as shown in Equation 1, where R 0 is the initial tweet and R 1 , R 2 , . . . , R T are replies, such that T is the count of replies. Each tweet R i is a sequence of words W 1 , W 2 , ..., W n , such that n is the count of words. We tokenize the tweets; and in this work, tokens and words are used interchangeably. We also define the relevance matrix M , which carries the information of the tree structure of the tweet tree in Equation 2, where A B denotes that A and B belong to the same tree in the forest obtained by eliminating the initial tweet. We show the process in The entire data set is represented by D.
Following the convention of (Veyseh et al., 2019) which is our baseline, we classify each pair (I, M ) into four labels: 1) Not a rumor (NR); 2) False Rumor (FR); 3) True Rumor (TR); and 4) Unrecognizable (UR), It should be noted that the distinction between "False Rumor" and "True Rumor" is drawn from the truthfulness of R 0 .
In simpler terms, we can represent Veyseh et al. In this work what we do with the above described relevance matrix M is to augment the implicit relationship consideration using the high level structure of the explicit relationships, hence bringing in the best-of-both-worlds. In summary, the set of edges in the relevancy matrix is a super-set of the set of edges in the adjacency matrix. In addition to the edges that were in the adjacency matrix, the relevancy matrix also has edges that carry implicit connection information. Thus, by definition, the relevancy matrix is more Figure 1: Relevance matrix building: 1) Original tweet reply tree; 2) Obtain the forest by temporarily removing the root (main tweet); 3) Consider each tree in the forest to be fully connected graphs, and obtain the relevance matrices; 4) Obtain the full Relevance matrix by putting together the matrices from the previous step and considering the main tweet to be connected to all the other tweets.  descriptive of the thread compared to the adjacency matrix.

Formal Definition of Tweet Representation
Each tweet will have different number of words n; thus, we pad the short tweets with a special token, until all the tweets have the same word length N as defined by 3.
We build the representative oppositeness list O using the oppositeness embeddings created by (de Silva and Dou, 2019) such that, for the i-th tweet R i , with words W i1 , W i2 , ..., W iN , the oppo- Each word in each tweet is then converted to a representative vector by means of a set of pretrained word embeddings, such that for the i-th tweet R i , with words W i1 , W i2 , ..., W iN is converted e i1 , e i2 , ..., e iN . We then apply max-pooling operation over the word embeddings along each dimension, resulting in a representative vector h i coupled to R i , as shown in Equation 4. At this point, note that the tweet set I of each pair P , which used to be I = (R 0 , R 1 , R 2 , . . . , R T ), has been replaced by I = (h 0 , h 1 , h 2 , . . . , h T ). It is this new representation which is passed to the fol-lowing steps.

Similarity-Based Contextualization
As discussed earlier, the Twitter data is organized as a tree rooted at the main tweet R 0 in each instance. The earlier work by Ma et al. (2018) proved that, in rumor detection, it is helpful to capture these relations among the main tweet and the replies. The subsequent work by Veyseh et al. (2019) noted that only considering explicit reply relation between the main tweet and other tweets neglects the implicit relations among the tweets, arising from their semantic similarities (i.e., by the virtue of discussing the same topic, tweets in two separate branches may carry mutually useful information). Following this hypothesis, they exploited such implicit semantic relations for the purpose of improving the performance of the rumor detection task. However, in doing so, they abandoned the information garnered from the tree structure. In this work we propose to continue to use the implicit information, but to augment it with the information derived from the tree structure.
We follow the self-attention mechanism of (Veyseh et al., 2019), which was inspired by the transformer architecture in (Vaswani et al., 2017), to learn the pairwise similarities among tweets for capturing the semantic relations between the tweets. The process starts with calculating the key (k i ) and query (q i ) vectors for each tweet, based on its representation h i , as shown in Equation 5. (W and b follow the traditional notation of weights and bias).
With the key and query vectors, we calculate the similarity a ij between i-th and j-th tweets, using the dot product as shown in Equation 6, where γ is a normalization factor.

Oppositeness-Based Contextualization
Unlike in the case of similarity vectors, which were reduced to a single dimension at this point, the oppositeness representations are still at two dimensions. Thus the self-attention of oppositeness between tweets is handled at a word level, rather than at the sentence level. We build key (k i ) and query (q i ) vectors for each word based on its representation o i , as shown in Equation 7. (W and b follow the traditional notation of weights and bias).
Since the oppositeness embedding of (de Silva and Dou, 2019) is based on Euclidean distance, with the key and query vectors, we calculate the oppositeness op ix,jy between x-th word of i-th tweet and y-th word of j-th tweet using the Euclidean distance, as shown in Equation 8 where k ix is the key vector for x-th word of i-th tweet, q jy is the query vector for y-th word of j-th tweet, and Euclidean distance d(, ) is calculated across the size of the oppositeness embedding.
To obtain the abstract tweet-level oppositeness, we apply element-wise average-pooling on the OP i,j matrix, as shown in Equation 9, to create the oppositeness matrix O " , where EA is Elementwise Average operation, δ is the oppositeness embedding count of the i-th tweet, and is the oppositeness embedding count of the j-th tweet. Note that the dimensions of the oppositeness matrix O " is the same as the relevance matrix M .
Next we create the oppositeness mask Ω by average-pooling O " along rows and columns, as shown in Equation 10, where the definition of EA, is the same as Equation 9 and similar to Equation 3, n i and n j are natural lengths of the i-th and j-th tweets respectively.

Deriving Overall Thread Representations
Similar to the oppositeness mask Ω, we create the relevance mask Ψ by sum-pooling M along rows and columns, as shown in Equation 11, where ES, is Elementwise Sum operation, and similar to Equation 3, n i and n j are natural lengths of the i-th and j-th tweets respectively. ψ i,j = ES(m i,0 , m i,1 , ..., m i,n j ) +ES(m 0,j , m 1,j , ..., m n i ,j ) At this point we diverge from (Veyseh et al., 2019) in two ways and utilize the related relevance mask M as a weighting mechanism, with proportion constant α (where 0 < α < 1), as well as the oppositeness mask Ω, to obtain augmented attention a i,j as shown in Equation 12.
We utilize the augmented similarity values a i,j for each tweet pair in the thread to compute abstract representations for the tweets based on the weighted sums, as shown in Equation 13.
Next, we apply the max-pooling operation over the processed tweet representation vectors h i to obtain the overall representation vector h for the input pair P . h = Elementwise M ax(h 0 , h 1 , h 2 , ..., h T ) (14) Finally, the result is sent through a 2-layer feedforward neural network capped with a softmax layer, with the objective of producing the probability distribution P (y|R 0 , R 1 , R 2 , . . . , R T ; θ) over the four possible labels, where θ is the model parameter. On this, we optimize the negative loglikelihood function, in order to train the model, as shown in Equation 15, where y * is the expected (correct) label for I.

Main Tweet Information Preservation
The Veyseh et al. (2019) study noted that the model by Ma et al. (2018) treats all tweets equally. This was deemed undesirable, given that the main tweet of each thread incites the conversation, and thus, arguably, carries the most important content in the conversation, which should be emphasized, to produce good performance. To achieve this end, it was proposed to bring forward the information in the main tweet independently of and separately from that of the collective twitter thread, in order to provide a check. We, in this work, also provide this sanctity check, to enhance the obtained results. The basic idea is that, by virtue of definition, if a main tweet is a rumor (or not), unique trait and information pertaining to that class should be in the main tweet itself. Thus, the latent label (L thread ) obtained by processing the thread representation h above should be the same as a potential latent label (L main ) obtained by processing the representation of the main tweet h 0 . To calculate L main , we use a 2-layer feed-forward neural network with a softmax layer in the end, where it assigns the latent labels drawn from K possible latent labels. Next, we use another 2-layer feed-forward neural network with a softmax layer in the end, assigning the same K number of possible latent labels as shown in the negative log-likelihood function to match it with the thread.
Finally, the loss function to train the entire model is defined as in Equation 18, where the L label is obtained from Equation 15, and β is a hyperparameter which controls the contribution of the main tweet information preservation loss to final loss.

Experiments
We use the Twitter 15 and Twitter 16 data sets introduced by Ma et al. (2017) for the task of rumor detection. Some statistics of the data sets as given by Ma et al. (2017) are shown in Table 1. We use Glove (Pennington et al., 2014) embedding to initialize the word vectors and oppositeness embedding (de Silva and Dou, 2019) to initialize the oppositeness embeddings. Both embedding vectors are of size 300. Key and query vectors in Equations 5 and Equations 7 employ 300 hidden units. The rumor classifier feed-forward network has two layers of 200 hidden units. The feed-forward layer in the main tweet information preservation component has two layers, each with 100 hidden units, and it maps to three latent labels. The proportion constant α, which balances the explicit and implicit information, is set at 0.1. The loss function uses a trade-off parameter of β = 1, with an initial learning rate of 0.3 on the Adagrad optimizer. For the purpose of fair results comparison, we follow the convention of using 5-fold cross validation procedure to tune the parameters (such as node and layer counts) set by Ma et al. (2018).

Comparison to the State-of-the-Art Models
We compare the proposed model against the stateof-the-art models on the same data sets. The performance is compared by means of overall accuracy and F1 score per class. We observe that there are two types of models against which we compare. The first type are the feature-based models, which used feature engineering to extract features for Decision Trees (Zhao et al., 2015;Castillo et al., 2011), Random Forest (Kwon et al., 2013, and SVM (Ma et al., 2015;Wu et al., 2015;Ma et al., 2017). The second type of models are deep learning models, which used Recurrent Neural Networks or Recursive Neural Networks to learn features for rumor detection. We compare our model to GRU  Tables 2 and 3, respectively. It is evident from these tables that, in the rumor detection task, the deep learning models outperform feature-based models, proving that automatically learning effective features from data is superior to hand-crafting features. We also note that the Semantic Oppositeness Graph, along with the Semantic Graph, and other RvNN models with GRU-RNN, generally do well, which attests to the utility of structural information, be it in the form of reply structure or be it in the form of semantic relations, in helping to improve performance. We further notice that Veyseh et al. (2019) which uses implicit information, outperforms TD-RvNN (Ma et al., 2018), which only uses explicit information. Semantic Oppositeness Graph, which uses explicit information, implicit information, and semantic oppositeness outperforms all the other models in accuracy, while outperforming all the other models in three out of four classes, in terms of F1 Score. The one class in which Semantic Oppositeness Graph loses out to Veyseh et al. (2019) is in the case of the Unrecognizable (UR) class. We argue that this is not an issue, given that the unrecognizable class consists of tweets which were too ambiguous for human annotators to tag as one of: not a rumor (NR), false rumor (FR), or true rumor (TR). We assert that Tables 2 and 3 clearly demonstrate the effectiveness of the proposed Semantic Oppositeness Graph method in the task of rumor detection.

Model Stability Analysis
While comparing our system with Veyseh et al.
(2019), which we use as our main baseline, we noticed that their system has a high variance in results, depending on the random weight initialization. This was impactful in such a way that in some random weight initializations, the accuracy of their system could fall as low as 24% from the reported high 70% results in their paper. Given that we use their system as our baseline and the basis for our model, we decided to do a stability analysis between their system and ours. For this purpose, we created 100 random seeds and trained four models with each seed, resulting in a total of 400 models. The models were: 1) Veyseh et al.  Tables 2 and 3). Each result is reported in the format of (µ, σ) for the 5 fold cross-validation to explore how random weight initialization affects the two models.
From the results in Tables 4 and 5, it is evident that our Semantic Oppositeness Graph has higher mean values for accuracy, not a rumor (NR), false rumor (FR), and true rumor (TR), while having comparably reasonable values for Unrecognizable (UR) class. But more interesting are the standard
For a demonstration, consider the subset of three words increase, decrease, and expand from the example given by de Silva and Dou (2019). If the main tweet (R 0 ) were to say "A will increase B", R 1 replied with "A will decrease B", and R 2 replied with "A will expand B", then the purely semantic similarity based model will position R 0 and R 1 closer than R 0 and R 2 , given that the word contexts in which increase and decrease are found are more similar than the word contexts in which increase and expand are found. This would result in the neural network having to learn the opposite semantics between increase and expand by itself, during the training, making it more vulnerable to issues of bad initial weight selection. This, in turn, will result in greater variance among the trained models as is the case of Veyseh et al. (2019). However, a system with an oppositeness component will already have the opposite semantics between increase and decrease, as well as increase and expand already calculated. Thus, such a system would have pre-knowledge that the word pair increase and decrease, despite being used in more common contexts, is more semantically opposite than the word pair increase and expand, which is used in less common contexts. Hence the neural network does not have to learn that information from scratch during the training, resulting in it being less vulnerable to issues of bad initial weight selection. Analogously, this, in turn, will result in lesser variance among the trained models; hence, explaining the better stability demonstrated by Semantic Oppositeness Graph in comparison to Veyseh et al. (2019) in Tables 4 and 5.

Impact of the Oppositeness Component
Finally, to emphasize the effect the oppositeness component has on the model, we draw the t-SNE diagrams for the final representations of the threads. Figure 3a shows the data points clustering when the model is trained without the oppositeness component, and Fig. 3b shows the data points clustering when the model is trained with the oppositeness component. Note that all other variables, including the seed for the weight initializer, are the same in the two models. These diagrams prove that the oppositeness component helps improve the separability of the classes. Specially note how the False Rumor and True Rumor classes are now more clearly separated. We postulate that this derives from the fact that the oppositeness component would help in distinguishing the continuous discord happening in a False Rumor thread from the subsequent general agreement in a True Rumor thread.

Conclusion
Rumors and fake news are a significant problem in social networks, due to their intrinsic nature of connecting users to millions of others and giving any individual the power to post anything. We introduced a novel method for rumor detection, based on semantic oppositeness, in this paper. We demonstrated the effectiveness of our method using data sets from Twitter. Compared to previous work, which only used explicit structures in the reply relations or semantic similarity, our model learns both explicit and implicit relations between a main tweet and its replies, by utilizing both semantic similarity and semantic oppositeness. We proved, with extensive experiments, that our proposed model achieves state-of-the-art performance, while being more resistant to the variances in performance introduced by randomness.