Multi Output Learning using Task Wise Attention for Predicting Binary Properties of Tweets : Shared-Task-On-Fighting the COVID-19 Infodemic

In this paper, we describe our system for the shared task on Fighting the COVID-19 Infodemic in the English Language. Our proposed architecture consists of a multi-output classification model for the seven tasks, with a task-wise multi-head attention layer for inter-task information aggregation. This was built on top of the Bidirectional Encoder Representations obtained from the RoBERTa Transformer. We were able to achieve a mean F1 score of 0.891 on the test data, leading us to the second position on the test-set leaderboard.


Introduction
In recent years, the spread of misinformation on social media has been growing rapidly. Amid the global pandemic, the spread of misinformation has had serious consequences. Covid-19 misinformation caused mass hysteria and panic; reluctance to use masks and follow social distancing norms; denial of the existence of etc. There is a need for automatic detection and flagging of tweets spreading misinformation.
Automatic detection of misinformation is an open research problem in NLP. Misinformation is intentionally written to deceive and pass as factual which makes detecting it a difficult task. (Silva et al., 2020) analyzed a dataset of 505k tweets. They used sentiment analysis, polarity scores, and LIWC to build features. They used ML models such as RandomForest, AdaBoost, SVM, etc to classify tweets as factual/misinformation. Predicting answers to multiple questions, which is the setup of our current problem statement, can be modeled as a multi-task-learning problem. (Crawshaw, 2020) have highlighted different methods for sharing information among tasks, for task-specific performance boosts, such as cross-stitching and * * Denotes Equal Contribution soft-parameter-sharing. They also highlight ways for loss weighting based on task-dependent uncertainty and learning-speed. (Liu et al., 2019) have highlighted the use of attention mechanisms for the multi-task learning problem and show that it performs competitively with other approaches.
Inspired by this idea, we propose an Attention-Based architecture RoBERTa Multihead Attn for our task by incorporating inter-task information for task-specific performance enhancement. With a test-set F1 score of 0.891, our approach shows the superiority of combining information among tasks over modeling them independently and shows the effectiveness of multihead-attention for this purpose.

Task Description and Dataset
The objective of this task (Shaar et al., 2021) is to predict various binary properties of tweets. We were given several tweets where we had to predict answers to 7 questions. These questions pertain to whether the tweet is harmful, whether it contains a verifiable claim, whether it may be of interest to the general public, whether it appears to contain false information, etc. There were three tracks of languages on English, Arabic, and Bulgarian and a team was free to choose any subset of the languages.
For the English Language, the training dataset consisted of 862 tweets 1 The dev set consisted of 53 tweets and the testing set consisted of 418. The training dataset statistics are shown in Table 1.

Methodology
We explored different base embeddings inspired by (Devlin et al., 2019) and (Liu et al., 2020). These are state-of-the-art language models which when used with task-specific fine-tuning, perform well on a wide variety of tasks. The embeddings are passed to our task-specific architecture for further processing and the whole model is trained end-toend. We hypothesize that the prediction for one question may benefit from the use of information needed to predict another question. For instance, questions Q4 ("Harmfulness: To what extent is the tweet harmful to the society/person/company/product?") and Q6 ("Harmful to Society: Is the tweet harmful to the society and why?") in the task have a deduction process that may have several common steps. To model this, we add an inter-task multi-head attention layer before our prediction heads.

Preprocessing
Before feeding the tweets to the RoBERTa tokenizer, we performed the following operations: 1. We observed that there were a lot of non-ASCII characters in the tweets, so we stripped these characters.
2. We then replaced the emojis with their text description using the demoji python package.
3. We replaced all the links in the tweets with "URL" and mentions with "USER" tokens.

Data Augmentation
Due to the small size of the dataset, we used data augmentation to improve generalization. Back-Translation was used for this purpose. Given an input, the text is first translated into a destination language to obtain a new sentence. This new sentence is then translated back into the source language. This process creates a slightly modified version of the original sentence, while still preserving the original meaning. We carried this out using 3 destination languages (French, Spanish, German) using the MarianMT (Neural Machine Translation Model) 2 . As an example :

Original Tweet
For the average American the best way to say if you have covid-19 is coughing in a rich person face and waiting for their test results

Augmented Tweet
For the average American the best way to tell if you have covid-19 is to cough in a rich person's face and wait for their test results

Task-Wise Multi-head Attention Architecture
Multi-Head-Attention (Vaswani et al., 2017) has shown to capture representations across different subspaces, thus being more diverse in its modeling compared to Single-Head-Attention. Inspired by this, we added a multi-head-attention layer to aggregate information among different tasks.
Our entire architecture is shown in Figure 1. The input sentences are encoded using the RoBERTa tokenizer. These are then forward propagated to get the embedding vectors. This vector is passed through a linear-block 3 and then branched out using 7 different linear layers, one for each task. These are further processed to obtain the 7 taskspecific vectors (we will refer to these as task vectors).
Each of these vectors is then passed through a multi-head-attention layer with the vector itself being the query vector, and the concatenated task vectors being the key and value vectors. The attention weights captured through this method signify what proportion of information the model would want to propagate further from each of the taskspecific vectors. The information from all the task vectors is thus aggregated as their weighted sum to get the penultimate layers for each task. Note that the projection matrices for multihead-attention for all tasks are independent of each other. A final linear layer maps these to the prediction layers on which softmax is applied for per-task prediction.

Loss Weighting
The input data is skewed in the distribution of labels for each question. A natural approach to tackle this issue is to use a weighted loss. For each task, we use the following scheme for assigning weights : where N samples is the number of input data samples, N classes is the number of classes and N c is the number of samples for class c for a particular task. This weighting was done independently for each task, based on the label distribution for that particular task.

Experiments
We conducted our experiments with Bert BASE and Roberta BASE . All our code is open-source and available on Github 4 . The different architectures are explained below: • Vanilla BERT : The input sentence is passed through the BERT BASE-UNCASED model. The output is first processed through a couple of linear blocks and finally branched out for taskwise linear layers for predictions for each of the 7 questions.  • Bert Multihead Attn : Figure 1 shows this architecture, with RoBERTa replaced by BERT. The input sentence is passed through the BERT transformer to obtain the bidirectional encoder representations. These are passed through a couple of linear-blocks and then branched out to 7 linear layers, each one corresponding to a task. For each branch, the output from the seven linear layers is then fed into a separate multi-head attention layer, with 3 heads. The output from the multihead attention for each task is finally passed through a linear layer and softmax and is used for predictions.
• RoBERTa Multihead Attn : This model is the same as Bert Multihead Attn except that the transformer used is RoBERTa BASE .
All models were finetuned end-to-end, with weights also being updated for the embedding layers of BERT and RoBERTa. The loss function was weighted-cross-entropy for each task, and the final loss was the sum of losses for the 7 tasks. The learning rates followed a linear decay schedule starting from an initial learning rate. The models were trained in PyTorch with the HuggingFace library (Wolf et al., 2020).
Regarding the results in Table 3, we see that RoBERTa Multihead Attn performs the best overall on the development set. We obtain a significant boost in performance, over Vanilla-BERT, by using our proposed Multihead-Attention layers. Using RoBERTa embeddings further brings about a slight improvement over this. We thus finalize Roberta Multihead Attn as our final model for submission.
For this particular experiment, we used a learning rate of 5e-5 for the task-specific layer and 5e-6 for the RoBERTa fine-tuning layers. The model was trained for 60 epochs with the number of attention-heads set to 3. All layers except the penultimate, attention, and RoBERTa layers had a 0.1 dropout probability. Roberta Multihead Attn beats the baselines by a significant margin overall as shown in Table 2 and ranks 2nd on the test-set leaderboard for the English Language sub-task, with a test-set mean F1-Score of 0.891.
Upon request from the reviewers, we also show the results on the development set for the given architecture on the Arabic and Bulgarian datasets. The bert-base-multilingual-cased embeddings were used as the base embeddings for these experiments. With reference to Table 4 we see that our proposed architecture outperforms the Vanilla BERT architecture on both the Arabic and Bulgarian datasets, further illustrating it's effectiveness across languages.

Conclusion and Future Work
In this paper, we have described our system for predicting different binary properties of a tweet, on the English sub-task of the Shared Task On Fighting the Covid Infodemic in NLP4IF'21. Our approach uses the RoBERTa BASE architecture for the initial embeddings and builds on top of that using taskwise multi-head attention layers. Our results show that using a multi-head attention approach for aggregating information from different tasks leads to an overall improvement in performance.
Possible developments in this task can include the incorporation of additional contextual information in our models using tweet-related features like the #(number) of URLs in a tweet, the # of user mentions and the # of hashtags, etc. Also, user-related features such as the # of followers, account age, account type (verified or not) can be included. These features contain a lot of auxiliary information that can aid in fine-grained classification. Sociolinguistic Analysis such as Linguistic Inquiry and Word Count (LIWC) can also be used to gather emotional and cognitive components from the tweet.