A Dashboard for Mitigating the COVID-19 Misinfodemic

This paper describes the current milestones achieved in our ongoing project that aims to understand the surveillance of, impact of and intervention on COVID-19 misinfodemic on Twitter. Specifically, it introduces a public dashboard which, in addition to displaying case counts in an interactive map and a navigational panel, also provides some unique features not found in other places. Particularly, the dashboard uses a curated catalog of COVID-19 related facts and debunks of misinformation, and it displays the most prevalent information from the catalog among Twitter users in user-selected U.S. geographic regions. The paper explains how to use BERT models to match tweets with the facts and misinformation and to detect their stance towards such information. The paper also discusses the results of preliminary experiments on analyzing the spatio-temporal spread of misinformation.

As fear grows, false information related to the pandemic goes viral on social media and threatens to affect an overwhelmed population. Such misinformation misleads the public on how the virus is transmitted, how authorities and people are responding to the pandemic, as well as its symptoms, treatments, and so on. This onslaught exacerbates the vicious impact of the virus, as the misinformation drowns out credible information, interferes with measures to contain the outbreak, depletes resources needed by those at risk, and overloads the health care system. Although health misinformation is not new (Oyeyemi et al., 2014), such a dangerous interplay between a pandemic and a misinfodemic is unprecedented. It calls for studying not only the outbreak but also its related misinformation; the fight on these two fronts must go hand-in-hand.
This demo paper describes the current milestones achieved in our ongoing project that aims to understand the surveillance of, impact of, and effective interventions against the COVID-19 misinfodemic. 1) For surveillance, we seek to discover the patterns by which different types of COVID-19 misinformation spread. 2) To understand the impact of misinformation, we aim to compare the spreading of the SARS-CoV-2 virus and misinformation and derive their correlations. 3) To understand what types of interventions are effective in containing misinformation, we will contrast the spreading of misinformation before and after debunking efforts. 4) To understand whether the outcomes related to 1), 2) and 3) differ by geographical locations and demographic groups, we will study the variability of misinformation and debunking efforts across geographical and demographic groups.
While we continue to pursue these directions, we have built an online dashboard at https:// idir.uta.edu/covid-19/ to directly benefit the public. A screencast video of the dashboard is at bit.ly/3c6v5xf. The dashboard provides a map, a navigation panel, and timeline charts for looking up numbers of cases, deaths, and recoveries, similar to a number of COVID-19 tracking dashboards. 123 However, our dashboard also provides several features not found in other places. 1) It displays the most prevalent factual information among Twitter users in any user-selected U.S. geographic region. 2) The "factual information" comes from a catalog that we manually curated. It includes statements from authoritative organizations, verdicts, debunks, and explanations of (potentially false) factual claims from fact-checking websites, and FAQs from credible sources. The catalog's entries are further organized into a taxonomy. For simplicity, we refer to it as the catalog and taxonomy of COVID-19 facts or just facts in ensuing discussion. 3) The dashboard displays COVID-19 related tweets from local authorities of user-selected geographic regions. 4) It embeds a chatbot built specifically for COVID-19 related questions. 5) It shows casestatistics from several popular sources which sometimes differ.
The codebase of the dashboard's frontend, backend, and data collection tools are opensourced at https://github.com/idirlab/covid19. All collected data are at https://github.com/ idirlab/covid19data. Particularly, the catalog and taxonomy of facts are also available through a SPARQL endpoint at https://cokn.org/ deliverables/7-covid19-kg/ and the corresponding RDF dataset can be requested there.
What is particularly worth noting about the underlying implementation of the dashboard is the adaptation of state-of-the-art textual semantic similarity and stance detection models. Tweets are first passed through a claim-matching model, which selects the tweets that semantically match the facts in our catalog. Then, the stance detection model determines whether the tweets agree with, disagree with, or merely discuss these facts. This enables us to pinpoint pieces of misinformation (i.e., tweets that disagree with known facts) and analyze their spread.
A few studies analyzed and quantified the spread of COVID-19 misinformation on Twitter (Kouzy et al., 2020;Memon and Carley, 2020; Al-Rakhami and Al-Amri, 2020) and other social media platforms (Brennen et al., 2020). However, these studies conducted mostly manual inspection of small datasets, while our system automatically sifts through millions of tweets and matches tweets with our catalog of facts.
2 The Dashboard Figure 1 shows the dashboard's user interface, with its components highlighted.
Geographic region selection panel (Component 1). A user can select a specific country, a U.S. state, or a U.S. county by using this panel or the interactive map (Component 2). Once a region is selected, the panel shows the counts of confirmed cases, deaths and recovered cases for the region in collapsed or expanded modes. When a region is expanded by the user, counts from all available sources are displayed; on the other hand, if it is collapsed, only counts from the default (which the user can customize) data source are displayed. These sources do not provide identical numbers.
Interactive map (Component 2). On each country and each U.S. state, a red circle is displayed, with an area size proportional to its number of confirmed cases. When a state is selected, the circle is replaced with its counties' polygons in different shades of red, proportional to the counties' confirmed cases.
Timeline chart (Component 3). It plots the counts of the selected region over time and can be viewed in linear or logarithmic scale.
Panel of facts (Component 4). For the selected region, this panel displays facts from our catalog, and the distribution of people discussing, agreeing, or disagreeing with them on Twitter. A large number of people refuting these facts would indicate wide spread of misinformation. To avoid repeating misconceptions, the dashboard displays facts from authoritative sources only.
Government tweets (Component 5). It displays COVID-19 related tweets in the past seven days from officials of the user-selected geographic region. These tweets are from a curated list of 3,744 Twitter handles that belong to governments, officials, and public health authorities at U.S. federal and state levels.
Chatbot (Component 6). This component embeds the Jennifer Chatbot built by the New Voices project of the National Academies of Sciences, Engineering and Medicine (Li et al., 2020), which was built specifically for answering COVID-19 related questions. As part of the collaborative team behind this chatbot, we are expanding it using the aforementioned catalog.

The Datasets
The dashboard uses the following three datasets.
1) Counts of confirmed cases, deaths, and recoveries. We collected these counts daily from Johns Hopkins University, 4 the New York Times (NYT) 5 and the COVID Tracking Project. 6 These sources provide statistics at various geographic granularities (country, state, county).
2) Tweets. We are using a collection of approximately 250 million COVID-19 related tweets from January 1st, 2020 to May 16th, 2020, obtained from (Banda et al., 2020) (version 10.0). We removed tweets and Twitter handles (and their tweets) that do not have location information, resulting in 34.6 million remaining tweets. We then randomly selected 10.4% of each month's tweets, leading to 3.6 million remaining tweets. We used the OpenStreetMap (Quinion et al., 2020) API to map the locations of Twitter accounts from user-entered free text to U.S. county names. We used the ArcGIS API 7 to map the locations of tweets from longitude/latitude to counties.
3) A catalog and a taxonomy of COVID-19 related facts.
The manually curated catalog currently has 9,512 entries from 21 credible websites, including statements from authoritative organizations (e.g., WHO, CDC), verdicts, debunks, and explanations of factual claims (of which the truthfulness varies) from fact-checking websites (e.g., the IFCN CoronaVirusFacts Alliance Database, 8 PolitiFact), and FAQs both from credible sources (e.g., FDA, NYT) and a dataset curated by (Wei et al., 2020).
We organized the entries in this catalog into a taxonomy of categories, by integrating and consolidating the available categories from a number of source websites, placing entries from other websites into these categories or creating new categories, and organizing the categories into a hierarchical structure based on their inclusion relationship. The taxonomy is as follows, in the format of as an RDF dataset, in which each entry of the catalog is identified by a unique resource identifier (URI). It is connected to a mediator node that represents the multiary relation associated with the entry. For example, Figure 3 shows a question about COVID-19, its answer and source, and the lowest-level taxonomy nodes that the entry belongs to, all connected to a mediator node. This RDF dataset, with 12 relations and 78,495 triples, is published in four popular RDF formats-N-Triples, Turtle, N3, and RDF/XML. Furthermore, we have set up a SPARQL query endpoint at https://cokn.org/deliverables/7-covid19-kg/ using OpenLink Virtuoso. 10

Matching Tweets with Facts and Stance Detection
Given the catalog of COVID-19 related facts F and the tweets T , we first employ claim-matching to locate a set of tweets t f ∈ T that discuss each fact f ∈ F . Next, we apply stance detection on pairs p f = {(t, f ) | t ∈ t f } to determine whether each t is agreeing with, disagreeing with, or neutrally discussing f . Finally, aggregate results are displayed on Component 4 of the dashboard to summarize the public's view on each fact. Figure 2 depicts Figure 3: An entry of the catalog stored in RDF Claim matching. We generate sentence embeddings s t and s f , for t and f respectively, using the mean-tokens pooling strategy in Sentence-BERT (Reimers and Gurevych, 2019). The relevance between t and f is then calculated as: Given R t,f , we model claim-matching as a ranking task on the relevance between facts and tweets. Thus, the output of this stage is t f = {t ∈ T | R t,f ≥ θ} for each fact f ∈ F , where the threshold θ is 0.8 in our implementation.
Stance detection. Given t f , we detect the stance that each tweet t takes toward fact f . There could be 3 classes of stance: agree (t supports f ), discuss (t neutrally discusses f ), and disagree (t refutes f ). For this task, we obtained a pre-trained BERT Base model 11 and trained it on the Fake-News Challenge Stage 1 (FNC-1) dataset. 12 We denote this model Stance-BERT.
We first pre-process p f to conform with BERT input conventions by 1) applying W (·), the Word-Piece tokenizer (Wu et al., 2016), 2) applying C(a 1 , a 2 , . . . , a n ), a function that concatenates arguments in appearance order, and 3) inserting specialized BERT tokens [CLS] and [SEP]. Since BERT has a maximum input length of M = 512 and some facts can exceed this limit, we propose a sliding-window approach inspired by (Devlin et al., 2019) to form input x f : where S defines the distance between successive windows and L = M − (|W (t)| + 3) is the sequence length available for each fact. If i * S + L is an out-of-bounds index for W (f ), the extra space is padded using null tokens. Each element w ∈ x f contains a set of windows representing a tweet-fact pair. Each window w i ∈ w is passed into Stance-BERT, which returns probability distributions (each containing 3 entries, 1 for each class)ŷ f w i for each window. Stance aggregation.
For each fact f , the stance detection results are accumulated to generate scores S f C , where C ∈ {agree, discuss, disagree} that denote the percentage of tweets that agree, discuss, and disagree with f : 13 where σ(·) is a function that averages the model's output scores for each class across all windows of tweet-fact pair. The 3 final stance scores are passed to the dashboard's panel of facts (Component 4) for display. 11 https://github.com/google-research/bert 12 http://www.fakenewschallenge.org/ 13 We use the Iverson bracket: [P ] = 1 if P is true, else 0

Performance of Claim Matching
To evaluate the performance of the claim matching component, we first created a Cartesian product of the 3.6 million tweets with 500 "facts" from the catalog (see Section 3 for description of datasets), followed by randomly selecting 800 tweet-fact pairs from the Cartesian product. To retain a balanced dataset, 400 pairs were drawn from those pairs scored over 0.8 by the claim matching component, and another 400 pairs were drawn from the rest. To obtain the ground-truth labels on these 800 pairs, we used three human annotators. 183 pairs were labeled "matched" (i.e., the tweet and the fact have matching topics) and 617 pairs "unmatched". Table 2 shows the claim matching component's performance on these 800 pairs, measured by precision@k and nDCG@k(normalized Discounted Cumulative Gain at k). Both precision@k and nDCG@k are metrics of ranking widely used in classification problem, the order of top k prediction is considered in nDCG@k but not in precision@k.
Metric @5 @10 @20 @50 @100    Table 3 shows Stance-BERT's performance on the FNC-1 competition test dataset and our tweetfact pairs, using F1 scores for all 3 classes as well as macro-F1. On FNC-1, we tested 2 variations of the same model: Stance-BERT window , which uses the sliding-window approach (Section 4), and Stance-BERT trunc , a model that truncates/discards all inputs after M tokens but is otherwise identical to Stance-BERT window . Both variants significantly outperformed the method used in (Xu et al., 2018), one of the recent competitive methods on FNC-1.
Note that FNC-1 also includes a fourth "unrelated" class that we discarded, since we already have a claim-matching component. Because other recent stance detection methods Fang et al., 2019) only reported macro-F1 scores calculated using all four classes including "unrelated", we cannot report a direct comparison with their methods. However, we argue that our macro-F1 of 0.65 remains highly competitive. The model of (Xu et al., 2018) achieved a 0.98 F1 score on "unrelated", which suggests that "unrelated" (i.e., separating related and unrelated pairs) is far easier than the other 3 classes (i.e., discerning between different classes of related pairs). Given that Stance-BERT significantly outperformed (Xu et al., 2018) on all other 3 classes, it is plausible that Stance-BERT will remain a top performer under all four classes.
To evaluate Stance-BERT's performance on our tweet-fact pairs, the three human annotators produced ground-truth labels on another set of 481 randomly selected tweet-fact pairs. 200 pairs are labeled as "matched". These 200 pairs are further labeled as "agree"/"discuss"/"disagree", in a distribution of 110/73/17 tweet-fact pairs. Ultimately, we discovered that Stance-BERT performs remarkably well on "agree" and "disagree" classes but falters on "discuss".  Figure 4 is the cumulative timeline for the top-6 countries with the most COVID-19 misinformation tweets in the dataset. "Misinformation tweets" refer to tweets that go against known facts as judged by our stance detection model.

Misinformation Analysis
We also conducted a study on the correla-tion between misinformation tweet counts and COVID-19 case counts. We looked at the percentage of cases relative to a country's population size, and the percentage of misinformation tweets relative to the total number of tweets from a country. The Pearson correlation coefficients between them are in Table 4. We find that the number of misinformation tweets most positively correlates with the number of confirmed cases. In contrast, its correlation with the number of recovered cases is weaker.  Table 4: Correlation between the percentage of confirmed/deceased/recovered cases and the percentage of misinformation tweets. The number of recovered cases in U.K. after April 13th is missing from the data source.
Finally, we manually categorized the misinformation tweets based on the taxonomy (Section 3). Table 5 lists the five most frequent categories of misinformation tweets. These five categories make up 49.9% of all misinformation tweets, with the other 50.1% being spread out over the other 33 categories.

Conclusion
This paper introduces an information dashboard constructed in the context of our ongoing project regarding the COVID-19 misinfodemic. Going forward, we will focus on developing the dashboard at scale, including more comprehensive tweet collection and catalog discovery and collection. We will also introduce more functions into the dashboard that are aligned with our project goal of studying the surveillance of, impact of, and intervention on COVID-19 misinfodemic.