Cross-Topic Rumor Detection using Topic-Mixtures

There has been much interest in rumor detection using deep learning models in recent years. A well-known limitation of deep learning models is that they tend to learn superficial patterns, which restricts their generalization ability. We find that this is also true for cross-topic rumor detection. In this paper, we propose a method inspired by the “mixture of experts” paradigm. We assume that the prediction of the rumor class label given an instance is dependent on the topic distribution of the instance. After deriving a vector representation for each topic, given an instance, we derive a “topic mixture” vector for the instance based on its topic distribution. This topic mixture is combined with the vector representation of the instance itself to make rumor predictions. Our experiments show that our proposed method can outperform two baseline debiasing methods in a cross-topic setting. In a synthetic setting when we removed topic-specific words, our method also works better than the baselines, showing that our method does not rely on superficial features.


Introduction
Recently, there has been much interest in detecting online false information such as rumors and fake news. Existing work has explored different features including network structures (Ma et al., 2019a), propagation paths (Liu and Wu, 2018), user credibility (Castillo et al., 2011) and the fusion of heterogeneous data such as image and text (Wang et al., 2018). However, these proposed algorithms still cannot be easily deployed for real-world applications, and one of the key reasons is that, just like many other NLP problems, rumor or fake news detection models may easily overfit the training data and thus cannot perform well on new data. The * Corresponding author problem can be more serious with deep learning solutions, because deep neural networks tend to learn superficial patterns that are specific to the training data but do not always generalize well (Wang et al., 2018).
In this work, we study the task of rumor detection and focus on the problem of adapting a rumor detection model trained on a set of source topics to a target topic, which we refer to as cross-topic rumor detection. In a recent study by Khoo et al. (2020), the authors compared the performance of rumor detection in an in-topic setting and an outof-topic setting. They found that their model could achieve 77.4% macro F-score on the in-topic testing data but the performance of the same classifier dropped to 39.5% when applied to out-of-topic testing data, which describe events different from the training events.
In this paper, we propose a method inspired by the "mixture of experts" paradigm, abbreviated as "MOE". Understanding that the rumor prediction model may work differently for different topics, we assume that the prediction result on an instance is dependent on the topic distribution of that instance. While a standard method is to train topic-specific classifiers and then use the topic distribution to combine these topic-specific classifiers, we propose a different approach where the topic distribution is used to linearly combine a set of vectors representing different topics. This gives us a "topic-mixture" given an example. This topic-mixture vector of the example is concatenated with the vector representation of the example itself and used as the input to a neural network model for rumor label prediction.
We implement our method on top of a state-ofthe-art StA-HiTPLAN model and conduct experiments using the PHEME dataset. Compared with two baseline methods that also perform debiasing, we find that our method can achieve clearly better cross-topic performance. We also experiment with modified within-topic data where we intentionally remove topic-specific words. This creates a setting where it is hard for models to rely on topic-specific words to make rumor predictions. We find that our method can also outperform the baselines substantially.

Performance Degradation in
Cross-Topic Rumur Detection In this section, we present a case study on the PHEME dataset to quantify the degree of overfitting of an existing model by analyzing the influence of topic-specific words. Concretely, we use the PHEME dataset, which has five topics. We use four topics during training and the remaining one for out-of-domain testing. After obtaining a trained hierarchical transformer model (Vaswani et al., 2017), we perform post-hoc testing by applying it to different topics, with K topic-specific words masked to examine the performance drop. Here the topic-specific words are identified based on log-odds ratio with Dirichlet prior (Monroe et al., 2008), and we regard these topic-specific words as possible spurious patterns. It is a common way to identify words that are statistically over-represented in a particular population compared to others. For the in-domain testing, we split the data as 7:2:1 for training, testing and validation. Experiments are performed using K ∈ {20, 50, 100, 200}.

Results:
The partial results are shown in Table 1. It is noteworthy that the accuracy drops from 67.69% to 36.7% when we only mask the top-20 frequent event-aware words in in-domain set -the model is highly sensitive to event sensitive patterns. Besides, the little dropping in accuracy with the out-of-domain setting when we mask top-20 outof-domain words may indicate that we mask some training unseen words compared with non-mask setting. These experiments confirm our hypothesis that the baseline classifier is primarily learning topical correlations, and motivate the need for a debiased classification approach which we will describe next.

Notation
Let x be an input, which is a thread represented as a sequence of tokens. We assume that x consists of a sequence of posts x = x 1 , x 2 , . . . , x T chronologically ordered, in which x 1 represents a source post and x i (i > 1) represents a reply post. Let y be the rumor label (e.g., true rumor, false rumor, etc.) we want to predict. We assume that the training data come from a set of M different topics, and we use Our goal is to train a rumor detection classifier using the labeled data from the M topics such that the classifier can work well on a target example.

Mixture Of Experts
Our idea is inspired by Mixture of Experts models (Jacobs et al., 1991). Specifically, we assume that each example x has a distribution over the M training topics. Let t be a variable denoting topic. We model p(y|x) as follows: Normally, to model p(t|x) and p(y|x, t), we can train parameterized models p(t|x; θ 1 ) and p(y|x, t; θ 2 ) using our training data, because our examples have clear topic labels. However, if the number of topics is large, or the number of training instances for each topic is small, training such topic-specific models may not work well. Moreover, if we train independent models for each training topic and combine their out-of-domain testing result as a whole, the result may be unsatisfactory because each model may be overfitting a specific topic. Our initial experimental observation also verifies that independent training method works well on in-topic setting but does not perform well on out-of-topic setting. Here we explore an alternative approach as described below.
We assume that x and t are both represented as vectors (which we will explain later). We can then use the following neural network model to model p(y|x, t): , where θ y are vectors to be learned and ⊕ means vector concatenation. Now we can make an approximation of Eqn.
(1) as follows: where t i is a vector representation of topic i. We can see that instead of computing p(y|x, t = i) for each i, and then use p(t = i|x) to obtain a weighted sum of these p(y|x, t = i), we first get a sum of the vector representations of different topics weighted by p(t|x), and then use this weighted sum to compute p(y|x).
To obtain a vector representation of x, we can use BERT to process the sequence of tokens in x and then use the vector representing the [CLS] token at the top layer as x. For each topic t, since we have instances of x belonging to each topic, here we explore two ways of deriving t i for topic i: (1) We use the average of the vectors x belonging to topic i to form t i . We refer to this as Avg.
(2) We use the parameters at the top layer of the topic classification model p(t|x) as vector representations for the different topics. We refer to this as Param. During test time, since our instance x does not have a t associated with it, we use a topic classification model trained on the training data where each example has its correct topic labeled to estimate p(t|x).

Implementation Details
We follow the model architecture StA-PLAN in (Khoo et al., 2020) as our backbone. StA-PLAN is a hierarchical transformer which contains 12 post-level multi-head attention layer (MHA) and 2 token-level MHA layers. As claimed in (Khoo et al., 2020) that BERT (Devlin et al., 2018) did not improve results and was time-consuming, we apply GLOVE-300d (Pennington et al., 2014) to embed each token in a post. The initial learning rate was set as 0.01 with 0.3 dropout and we used the ADAM optimizer with 6000 warm start-up steps. Batch size is set as 256 for all cross-validation tasks.

Dataset
We use the public PHEME dataset (Zubiaga et al., 2016) for our evaluation. PHEME was collected based on 9 breaking news stories and can be categorised into four classes: true rumor, false rumour, unverified rumour and non-rumour. Following the setting in (Kumar and Carley, 2019), we select five breaking events from PHEME and split them into two sets. Four events are chosen for training and in-domain testing, and the remaining one is used as out-of-domain testing set.

Baselines and Our Methods
We consider a state-of-the-art model and some baselines that are also addressing cross-domain issues.
StA-HiTPLAN: Replicating (Khoo et al., 2020), we train a hierarchical transformer model which is a state-of-the art model and can be viewed as a feature extractor in the following experiments.
Ensemble-based model (EM): Following (He et al., 2019;Clark et al., 2019), we take topical words as bias features and introduce an auxiliary bias only model f b taking bias priori features as input. Then using this bias only model to train a robust model through an ensemble model. We firstly obtain the class distribution p b (y|x) using this biased model. Then we train an ensemble model that combines the former biased model with a robust model through this function:p(y|x) = T (p(y|x) + p b (y|x)). In the testing stage, only the robust model p(y|x) is used for prediction. Adversary-based model (AM): This is a common way to learn domain-invariant features. We implement a recent work (Wang et al., 2018;Ma et al., 2019b) and replace their Bi-LSTM with (Khoo et al., 2020) as backbones for fair comparison. The parameter of the gradient reversal layer is set as 1.
MOE-Avg and MOE-Param: These are our proposed models, where MOE-Avg and MOE-Param are according to our descriptions given in Section 3.2.

Cross-Topic and In-Topic Settings
We use two settings to evaluate the effectiveness of our method for cross-topic rumor detection. The first setting is the standard setting where we train on a set of source topics and test the performance of the model on a different target topic. For PHEME dataset, we use 4 topics as training topics and the remaining topic as the test topic. We repeat this 5  Table 2: Average accuracy and macro-F scores (%) of the in-topic setting on PHEME. Orig. refers to the original data. Mask-k refers to the setting where we artificially mask k topic-specific words.
times with different split of training/test topics, and report the average performance. We refer to this as the "cross-topic" setting. We also experiment with a second in-topic setting, where we train and test on the same topic, but we artificially remove topicspecific words. We refer to this as our "in-topic" setting. In Table 2, these are labeled as Mask-20, Mask-30 and Mask-50, depending on how many topic-specific words we mask (i.e., remove).

Results and Analysis
We present our experiments on the PHEME dataset in Table 2 and Table 3. Several observations can be made from the experiment results: 1) From Table 3, we can see that MOE-Avg and MOE-Param are both effective strategies that mitigate the topic overfitting problem. The accuracy improves from 34.41% to 41.24% and 41.33%, respectively, when we only intervene feature without modify the backbone network. 2) Adversarial training model AM works better than ensemble methods EM in the early stage but deteriorates after we mask more than 50 event sensitive words. One reason is ensemble-based model depends on the bias only model : the model is sensitive to the choice of bias, and seems more robust when we mask more irrelevant words. 3) Instead of unstable adversarial training method, we show that MOE-Avg and MOE-Param can make the model robust to topic bias and increase generalization ability. 4) Instead of using the average of the vector representation of x for those x belonging to the same topic, we also aggregate the final layer parameters of topic classifier. MOE-Param works slightly better than MOE-Avg method. More attention can be given to how to better represent a topic embedding in future work.

Conclusion and Future work
In this work, we propose a new cross-topic rumor detection task base on mixture of experts, which can reinforce the generalization capacity of a model  Table 3: Average accuracy and macro-F score (%) on PHEME data for the cross-topic setting.
when adapting to new topics. we suggest that: 1) instead of training an unstable adversarial component or removing bias directly from semantic contents, the mixture of experts provides us with another way to increase generalization ability. 2) in this work, we use feature concatenation and train one classifier rather than several expert classifiers, and utilize a fixed confidence score. In the future, we can learn adaptive weights to make the model more flexible. For example, we could use variational inference methods to dynamically learn the best mixture of topics for a given held-out topic.