Bi-Directional Recurrent Neural Ordinary Differential Equations for Social Media Text Classification

Classification of posts in social media such as Twitter is difficult due to the noisy and short nature of texts. Sequence classification models based on recurrent neural networks (RNN) are popular for classifying posts that are sequential in nature. RNNs assume the hidden representation dynamics to evolve in a discrete manner and do not consider the exact time of the posting. In this work, we propose to use recurrent neural ordinary differential equations (RNODE) for social media post classification which consider the time of posting and allow the computation of hidden representation to evolve in a time-sensitive continuous manner. In addition, we propose a novel model, Bi-directional RNODE (Bi-RNODE), which can consider the information flow in both the forward and backward directions of posting times to predict the post label. Our experiments demonstrate that RNODE and Bi-RNODE are effective for the problem of stance classification of rumours in social media.


INTRODUCTION
Information disseminated in social media such as Twitter can be useful for addressing several real-world problems like rumour detection, disaster management, and opinion mining.Most of these problems involve classifying social media posts into different categories based on their textual content.For example, classifying the veracity of tweets as False, True, or unverified allows one to debunk the rumours evolving in social media [18].However, social media text is extremely noisy with informal grammar, typographical errors, and irregular vocabulary.In addition, the character limit (240 characters) imposed by social media such as Twitter make it even harder to perform text classification.
Social media text classification, such as rumour stance classification 1  [12,13,19] can be addressed effectively using sequence labelling models such as long short term memory (LSTM) networks [1,5,10,11,16,[18][19][20].Though they consider the sequential nature of tweets, they ignore the temporal aspects associated with the tweets.The time gap between tweets varies a lot and LSTMs ignore this irregularity in tweet occurrences.They are discrete state space 1 Rumour stance classification helps to identify the veracity of a rumour post by classifying the reply tweets into different stance classes such as Support, Deny, Question, Comment , , models where hidden representation changes from one tweet to another without considering the time difference between the tweets.Considering the exact times at which tweets occur can play an important role in determining the label.If the time gap between tweets is large, then the corresponding labels may not influence each other but can have a very high influence if they are closer.
We propose to use recurrent neural ordinary differential equations (RNODE) [14] and developed a novel approach bi-directional RN-ODE (Bi-RNODE), which can naturally consider the temporal information to perform time sensitive classification of social media posts.Neural ordinary differential equation (NODE) [2] is a continuous depth deep learning model that performs transformation of feature vectors in a continuous manner using ordinary differential equation solvers.NODEs bring parameter efficiency and address model selection in deep learning to a great extent.Recurrent NODE [14] extends NODE to time-series where hidden states associated with the elements in the sequence are assumed to evolve continuously over time.They generalize RNNs to consider the temporal information present in the sequence data and allow the hidden representation to change according to this temporal information.
We propose RNODE to perform sequence labeling of posts occurring continuously over time in social media.It can consider the varying inter-arrival times in the posts and update the hidden representation according to it for classifying the posts.In addition, we propose a novel model, bi-directional RNODE (Bi-RNODE), which considers not only information from the past but also from the future in predicting the label of the post.Here, continuously evolving hidden representations in the forward and backward directions in time are combined and used to predict the post label.We show the effectiveness of the proposed models on the rumour stance classification problem in Twitter using the RumourEval-2019 [4] dataset.We found RNODE and Bi-RNODE can improve the social media text classification by effectively making use of the temporal information and is better than LSTMs and gated recurrent units (GRU) with temporal features.

BACKGROUND 2.1 Problem Definition
We consider the problem of classifying social media posts into different classes.Let us consider our data set D to be a collection of  posts, D = {  }  =1 .Each post   is assumed to be a tuple containing details about the post such the textual content x  (one can consider other features as well such as number of re-posts and reactions), time of the post   and the label associated with the post   , thus   = {(x  ,   ,   )}.Our aim is to develop a sequence classification model which consider the temporal information   along with x  for arXiv:2112.12809v1[cs.CL] 23 Dec 2021 classifying a social media post.In particular, we consider the rumour stance classification problem in Twitter where one classify tweets into different classes such as Support, Query, Deny, and Comment, thus   ∈ Y = {, , ,  }.

Neural Ordinary Differential Equations
Neural ordinary differential equations (NODE) [2] were introduced as a continuous depth alternative to Residual Networks (ResNets) [8].ResNets uses skip connections to avoid vanishing gradient problems when networks grow deeper.Residual block output is computed as h  +1 = h  +  (h  ,    ), where  is a neural network parameterized by   involving stacked layers with non-linear activation functions and h  representing the hidden representation at depth .This update is similar to a step in the Euler numerical technique used for solving ordinary differential equations (ODE) of the following form.

𝑑h(𝑡) 𝑑𝑡
=  (h(), ,   ) Sequence of residual block operations in ResNets can be seen as a solution to the ODE with h() representing the hidden representation at any time  and the ODE trajectories defined through the neural network  .Consequently, NODEs can be interpreted as a continuous equivalent of ResNets modelling the evolution if hidden representations over time.
For solving ODE, one can use fixed step-size numerical techniques such as Euler, Runge-Kutta or adaptive step-size methods like Dopri5 [6].Solving an ODE requires one to specify an initial value (h(0)) and can compute the value at  using an ODE solver  (    , h(0), 0, ).We can consider initial value h(0) as input x or a transformation of x using a downsampling block.The ODE (1) is solved until some end-time  to obtain the final hidden representation h( ).A fully connected neural network (FCNN) transforms the final representation h( ) to the output ŷ.For classification problems cross-entropy loss is used to update the weights of NODE using back-propagation.For NODE models, efficient back-propagation and gradient computations were proposed using adjoint sensitivity method [2,17].

BI-DIRECTIONAL RECURRENT NODE
The popular techniques for sequence classification such as LSTMs consider the sequential nature of the data but ignores the temporal features associated with the data in its standard setting.The posts occur at irregular intervals of time, with more posts occurring at certain period.The influence of consecutive posts might depend on this time gap with the influence typically decreasing over time.Instead of an LSTM model which perform single step transformation it will be beneficial to use a model where the number of transformations depend on the time gap.
We propose to use recurrent neural ordinary differential equations (RNODE) [14] to address the drawbacks of RNN based models in classifying irregularly occurring posts in social media.RNODE is developed for time-series data and can naturally consider the time associated with the posts make perform the transformations of the hidden representation to reflect the same.In RNODE, the transformation of a hidden representation h( −1 ) at time  −1 to h(  ) at time   is governed by an ODE similar to (1), with  being a neural network (NN) transformation.Unlike standard LSTMs As this integral is intractable, RNODE uses a numerical technique (e.g., Euler method) to obtain the transformation.The number of update steps in the numerical technique is determined by the time gap   −  −1 between the consecutive posts.
The hidden representation h ′ (  ) and input post x  at time   are passed through neural network transformation (RNNCEll()) to obtain final hidden representation h(  ), i.e., h(  ) = RNNCell(h ′ (  ), x  ).The process is repeated for every element (x  ,   ) in the sequence.The hidden representations associated with the elements in the sequence are then passed to a neural network (NN()) to obtain the sequence of outputs corresponding to the post labels.Figure 1(a) provides the detailed architecture of the RNODE model.
Bi-directional RNNs [15] such as Bi-LSTMS [7] were proven to be successful in many sequence labeling tasks in natural language processing such as POS tagging [9].They use the information from the past and future to predict the label while standard LSTMs consider only from the past.We propose Bi-directional RNODE (Bi-RNODE), which uses the sequence of input observations from past and from the future to predict the post label at any time .It assumes the hidden representation dynamics are influenced not only by the past posts but also by the futures posts.Unlike Bi-LSTMs, Bi-RNODE consider the exact time of the posts and their inter-arrival times in determining the transformations in the hidden representations.Bi-RNODE consists of two RNODE blocks, one performing transformations in the forward direction (in the order of posting Algorithm 1: Pseudo code for RNODE and Bi-RNODE approach to predict class labels.The input data points(, t) where  = {x  }  =1 , t = {  }  =1 are sorted in increasing order of their timestamps.
Initialize: h(0) = 0,  0 = 0,  = {} if bidirectional: Set  ′ to contain  in reverse order.Set t ′ to contain t in reverse order, where = aggregate( ,  ) // concatenate or average return NN( ) // return predicted post labels times) and the other in the backward direction (in the reverse order of posting times).The hidden representations  and   computed by forward and backward RNODE respectively are aggregated either by concatenation or averaging to obtain a final hidden representation and is passed through a NN to obtain the post labels.Bi-RNODE is useful when a sequence of posts needs to be classified together, and can be restrictive for an online classification of individual posts.Algorithm 1 and Figure 1

EXPERIMENTS
To demonstrate the effectiveness of the proposed approaches, we consider the stance classification problem in Twitter and RumourEval-2019 [4] data set.This Twitter data set consists of rumours associated with eight events.Each event has collection of tweets labelled with one of the four labels -Support, Query, Deny and Comment.We picked four events Charliehebdo, Ferguson, Ottawashooting and Sydneysiege to conduct experiments.
Features : For dataset preparation, each data point x  associated with a Tweet includes text embedding, retweet count, favourites count, punctuation features, sentiment polarity, negative and positive word count, presence of hashtags, user mentions, URLs, and entities etc. from the tweet information.Using pre-trained word2vec vectors 2 , each word is represented as an embedding of size 15.The text embedding of the tweet is obtained by concatenating the word embeddings.Each event data is split into train, validation, and test datasets with the ratio 60:20:20 in the order of time at which tweet 2 Pre-trained vectors on Google News dataset: https://code.google.com/p/word2vecoccurred.Each tweet timestamp is converted to epoch time and Min-Max normalization is applied over the time stamps associated with each event to keep the duration of the event in the interval [0, 1].

Experimental setup
In real time, new rumours arise and propagate at different time periods.Our experiments are conducted to predict stance of social media posts propagating in seen events as well as unseen events.
Here are two experimental setups we conducted on the dataset.
• Seen Event Here we train, validate and test on tweets of same event.Each event data is split 60:20:20 ratio in sequence of time.This setup helps in predicting stance of unseen tweets of the same event.
• Unseen Event: This setup helps in evaluating performance on an unseen event and in training on a larger dataset.Here we consider training and validation on 3 events and testing on 4 ℎ event.Last 20% data of each of the training event is set aside for validation.During training, mini-batches are formed only from the posts in each event and are fed to the model in the order they appear in the event.Baselines: We compared results of our proposed RNODE and Bi-RNODE models with RNN based baselines such LSTM [10], Bi-LSTM [1], GRU [3], Bi-GRU, and Majority (labelling most frequent class) baseline models.We also use a variant of LSTM baseline considering temporal information [20], LSTM-timeGap where the timegap of consecutive data points is included as part of the input data.
Evaluation Metrics: We consider the standard evaluation metrics such as precision, recall, F1 and in addition the AUC score to account for the data imbalance.We consider a weighted average of the evaluation metrics to compare the performance of models.
Hyperparameters: All the models are trained for 50 epochs with 0.01 learning rate, Adam optimizer, dropout(0.2) regularizer, batchsize of 50 and cross entropy loss function.Different hyperparameters like neural network layers (1, 2), hidden representation sizes (64,128), numerical methods (Euler, RK4, Dopri5 for RNODE and Bi-RNODE) and aggregation strategy (concatenation or averaging for Bi-LSTM and Bi-RNODE) are used for all the models and the best configuration is selected from the validation data for different experimental setups and train/test data splits.

Results and Analysis
The results of seen event and unseen event experiment setup can be found in Table 1, where the first and second rows for each model provides results on seen event and unseen event respectively.We can observe from Table 1 that for both seen event and unseen event experiment setup, our proposed RNODE and Bi-RNODE models outperformed baseline models for all the four events.For the seen event setup, Bi-RNODE gives the best result out-performing other models for most of the data sets and measures.While for unseen event setup, RNODE and Bi-RNODE models gave better results when compared to baseline models except for Charliehebdo event.Bi-RNODE results are better than RNODE for Charliehebdo and Ferguson, while it is close to RNODE for Ottawashooting and Sydneysiege.Under seen event experiment on syndneysiege event, we plot the ROC curve for all the models in Figure 2. We can observe

CONCLUSION AND FUTURE WORK
We proposed RNODE and Bi-RNODE models for sequence classification of social media posts which naturally consider the temporal information and use it to model the dynamics of hidden representations using an ODE.This makes them more effective than LSTMs for social media where posts occur at irregular time intervals.The experimental results on the rumour stance classification problem in Twitter supports the superior capability of the RNODE and Bi-RNODE in performing tweet classification.As a future work, we would like to further improve the sequence modelling capability of the proposed models by combining them with conditional random fields.

Figure 1 :
Figure 1: Architecture details of RNODE and Bi-RNODE (b) provides an overview of Bi-RNODE for post classification.For Bi-RNODE, an extra neural network   ′  ′  ′ () is required to compute hidden representations   ( ′  ) in the backward direction.Training in Bi-RNODE is done in a similar manner to RNODE, with cross-entropy loss and back-propagation to estimate parameters.

Figure 2 :
Figure 2: ROC curves of the models (a) RNODE (b) LSTM (c) GRU (d) BiRNODE (e) Bi-LSTM (f) Bi-GRU trained on sydneysiege event for seen event experiment ) and 2(d) corresponding to RNODE and Bi-RNODE respectively are higher than LSTM, GRU, Bi-LSTM , and Bi-GRU.The proposed models are computationally and parametrically efficient where RNODE (0.22M,in Millions) and Bi-RNODE (0.33M) models required less parameters when compared to LSTM (1.70M) and Bi-LSTMS (3.40M) models.Visualization of latent hidden state representations of the proposed models using t-SNE plot (Figure3(a) and 3(b)) shows that they are capable of separating data points from different classes into different groups.The proposed models

Table 1 :
[4]formance of all the models on RumourEval-2019[4]dataset. First and second rows of each model represents seen event and unseen event experiment results respectively.