ComplexDataLab at W-NUT 2020 Task 2: Detecting Informative COVID-19 Tweets by Attending over Linked Documents

Given the global scale of COVID-19 and the flood of social media content related to it, how can we find informative discussions? We present Gapformer, which effectively classifies content as informative or not. It reformulates the problem as graph classification, drawing on not only the tweet but connected webpages and entities. We leverage a pre-trained language model as well as the connections between nodes to learn a pooled representation for each document network. We show it outperforms several competitive baselines and present ablation studies supporting the benefit of the linked information. Code is available on Github.


Introduction
COVID-19 is a critical public health crisis, with over 25 million cases and 840 thousand deaths worldwide and counting (WHO, 2020). One important way to fight the pandemic is through understanding and leveraging the vast amount of information humans produce related to it on social media. Success here can facilitate individual and overall case detections, contact tracing, case predictions, effective information dissemination, and more Qin et al., 2020). However, the data is far too big -for example, hundreds of millions of tweets (Qazi et al., 2020) -to extract all the useful information by hand. There is also a huge amount of useless and even outright false information (Alam et al., 2020;. Therefore, algorithms that extract useful information effectively are crucial. WNUT-2020Task 2 (Nguyen et al., 2020 is to build such a system on a particular twitter dataset, and predict which tweets are informative "about 1 https://github.com/ComplexData-MILA/ gapformer recovered, suspected, confirmed and death cases as well as location or travel history of the cases." We present here GAPFORMER, which combines multiple sources of information -the tweet, content of linked web pages, and information about named entities related to the tweet -to improve detection of informative content.
Our experiments show GAPFORMER outperforms 7 baselines. It also performs best when incorporating all three types of data, showing the efficacy of using the proposed graph structure. We believe this method can also be applied in other related settings, such as misinformation detection, and have work in progress to investigate possible extensions.
However, COVID-19 is unprecedented in its global scale and impact. While there is also a historic amount of research taking place to counter it (Brainard, 2020), more is needed -there are still lives to be saved. In addition, while people have tackled related COVID-19 twitter data tasks (Boulos and Geraghty, 2020; Alam et al., 2020;, to our knowledge this particular task formulation has not been thoroughly investigated prior to this shared task competition. With the success of transformer-based models (Vaswani et al., 2017;Devlin et al., 2018), it is natural to consider applying them here. BERTweet (Dat Quoc Nguyen and Nguyen, 2020) and covidtwitter-bert (Müller et al., 2020), BERT language models fine-tuned on twitter data (and the latter on tweets related to COVID-19 specifically), are particularly relevant. However, pure language models struggle to consider additional contextual information beyond the raw content of the tweet. As our experiments show, this is an important limitation.
Because this limitation applies to many other applications as well, a rapidly growing area of research is embedding additional contextual knowledge into language models, especially with the use of graph structures Zhang, 2020;. This is often done at the language modelling level, such as how (Lu et al., 2020) uses normalized mutual information to inject global information into each layer of BERT. That said, other work has focused on injecting knowledge into an existing pre-trained language model. In particular, Transformer-XH (Zhao et al., 2020) proposes a way to leverage the network structure of documents alongside a pre-trained model for a classification task. GAPFORMER builds on this line of work by proposing a simple yet effective architecture which contextualizes each node before pooling the graph into a single fixed embedding.

Symbols
Definitions G = {V, E} A graph and its vertex/edge sets Entities mentioned by t Our work presents GrAph Pooling Transformer (GAPFORMER), which combines powerful semantic representations from pretrained language models with structured information from graph neural networks by incorporating additional context from entities and web documents within each tweet. We reformulate the task as graph classification, where each instance is a single graph comprised of one tweet as well as the extracted entities and docu-ments.

Graph Construction
For each instance, we construct a graph G = {V, E} with the documents and entities in each tweet t as nodes and t itself as a supernode. We describe the construction process below.

Node Selection
We extract each entity mentioned in a given tweet t, and link them to their respective Wikipedia entries. We then represent the entities with their associated Wikipedia documents. The full set of entities extracted from t forms E t . We use the Python packages FLAIR (Akbik et al., 2019) for named entity recognition and BLINK (Wu et al., 2019) for entity linking.
We also retrieve the content of any news articles linked to by the tweet, with the intuition that tweets which cite their sources are more likely to be informative, and that the content of said sources is likely to provide a useful signal as well. We use the newspaper3k Python package (Lucas Ou-Yang, 2013) to retrieve the summaries for each article, as full articles tend to be particularly long and may contain superfluous information. This set of articles forms A t .
Finally, to form the vertex set, we take V = A t ∪ E t ∪ {t}.

Edge Selection
To begin, we draw an edge between t and each node n ∈ V to embed t with additional information (including t itself, forming a self-loop). To better contextualize this information, we also draw edges for pairs of entities, as the intent of an entity mention may vary between different contexts. However, we must also be wary of simply adding as many connections as possible, which would decrease efficiency. We select which entity pairs to connect by calculating the normalized pointwise mutual information (NPMI) (Lu et al., 2020) between each pair. Using NPMI as the selection criterion allows us to connect nodes which frequently co-occur (mentioned in the same tweet) throughout the dataset, suggesting a more meaningful relationship. Empirically, we observed 0.15 to be a sensible threshold.
For entities e i and e j : npmi(e i , e j ) = pmi(e i , e j ) − log p(x, y) The full algorithm for edge selection is described in algorithm 1.

GAPFORMER
For each instance in the dataset, we construct a graph G as described above. Each node n i ∈ G is represented by a text document (either a tweet, entity description, or article). We tokenize each node and obtain word embeddings using a pre-trained transformer language model. We pool the word embeddings by selecting the embedding for the [CLS] tokens from the final output layer, forming the graph's node embeddings.
After embedding each node with the pretrained language model, we apply k layers of mean-pooling GraphSAGE convolutions (Hamilton et al., 2017) to contextualize each node with respect to their neighbors. We then aggregate the graph to a single embedding using max pooling, attention, or another pooling mechanism. Empirically, we find max pooling to be most effective on this dataset. Finally, we use a linear layer to predict output logits.
The full algorithm is presented in algorithm 2. The model is trained end-to-end using cross entropy loss, and optimized with AdamW (Loshchilov and Hutter, 2017). Full implementation details are available in our GitHub repository.

Data
The dataset provided for this task (Nguyen et al., 2020) consists of ten thousand tweets, labeled "informative" and "uninformative." A default split into train, validate, and test sets is also given, with details shown in  Since we do not have access to the actual test set for this task (it will only be published after camera-ready deadline for this paper), we treat the validation set provided as the test test in our reported results. We also randomly split 1000 tweets from the train set as the validation set used for hyper-parameter tuning of the deep learning based models, i.e. early stopping, to avoid corrupting the actual validation set (our test set).

Baselines
We compare our model with several baselines: • Naive Bayes (NB).
• Base BERT. This is the original BERT model with a classification head. The entire model is fine-tuned during training. It is implemented using Huggingface Transformers and trained using Pytorch-Lightning. (Wolf et al., 2019;Falcon, 2019).
• BERTweet (Dat Quoc Nguyen and Nguyen, 2020). This is a BERT model fine-tuned on twitter data. Our implementation fine-tuned the weights provided by the authors with training identical to Base BERT.
The first 3 all use term frequency-inverse document frequency (TF-IDF) as features and the default Scikit-learn (Pedregosa et al., 2011) implementation. The last 3 (BERT-based models), as well as GAPFORMER, all use the AdamW optimizer (Loshchilov and Hutter, 2017) with a linear scheduler with warmup, set to learning rate 7e-6, and are trained using an NVIDIA RTX8000 GPU. For GAPFORMER, we find CTBERT to be the most effective pre-trained language model, and use it in all experiments. We train it for 8 epochs with batch size 2 and accumulating 8 batches for gradient calculations, because CTBert is too big to fit a bigger batch in GPU memory. We also use precision 16, max sequence length 128, .5 dropout for the graph stage, and max pooling.    From this table, we can see that using all three information sources performs best in terms of overall accuracy and F1 score. Although just using the tweet gives particularly high recall, the overall performance is significantly worse than the others, about 3% lower in accuracy and F1 compared with using all three. This shows strongly that GAP-FORMER is leveraging the articles and entities effectively to achieve better accuracy.

Conclusions
We presented GAPFORMER, which detects informative Covid-19 tweets, using a graph classification system that leverages articles and entities. This is done by first building graphs using normalized pointwise mutual information to determine related nodes, and then combining pre-trained language models with GraphSAGE convolutions and a pooling mechanism to classify them.
As shown in the experiments, this way of incorporating additional data is effective. It enables this system to outperform all seven baselines tested.
In future work, we plan to extend and upgrade this system in several ways: • Enable broader input data, e.g. tweet replies and user information. This will include heterogeneous data rather than only text.
• Improve performance by collecting and using more labeled data.
• Adapt it to other tasks. For example, we believe GAPFORMER can be used to effectively detect Covid-19 misinformation.
We are also looking at real-world applications of this tool. Since accuracy and other metrics are already above 90%, we believe it can have practical value, especially with the improvements above. We aim therefore to find direct ways to use GAP-FORMER to help mitigate the Covid-19 pandemic.