Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.


Introduction
The seeds of the coronavirus disease 2019  pandemic are reported to have started as a local outbreak in Wuhan (Hubei, China) in December, 2019, but soon spread around the world (WHO, 2020). As of January 24, 2021, the number of confirmed cases around the world exceeded 99.14M and the number of confirmed deaths exceeded 2.13M. 1 In response to this ongoing public health emergency, researchers are mobilizing to track the pandemic and study its impact on all types of life in the planet. Clearly, the different ways the pandemic has its footprint on human life is a question that will be studied for years to come. Enabling scholarship on the topic by providing relevant data is an important endeavor. Toward this goal, we collect and release Mega-Cov, a billion-scale multilingual Twitter dataset with geo-location information. As a result of the pandemic, most countries around the world went into lockdown and the public health emergency has restricted physical aspects of human communication considerably. As hundreds of millions of people spend more time sheltering in place, communication over social media became more important than ever. In particular, the content of social media communication promises to capture significant details about the lives of tens of millions of people. Mega-Cov is intended as a repository of such a content.
There are several ongoing efforts to collect Twitter data, and our goal is to complement these. More specifically, we designed our methods to harvest a dataset that is unique in multiple ways, as follows: Massive Scale: Very large datasets lend themselves to analyses that are not possible with smaller data. Given the global nature of COVID-19, we realize that a large-scale dataset will be most useful as the scale allows for slicing and dicing the data across different times, communities, languages, and regions that are not possible otherwise. For this reason, we dedicated significant resources to harvesting and preparing the dataset. Mega-COV has solid international coverage and brings data from 1M users from 268 countries (see Section 3.1). Overall, our dataset has ∼ 1.5B tweets (Section 2). This is one order of magnitude larger than #COVID-19 (Chen et al., 2020), the largest dataset we know of (∼ 144M tweets as of June 1, 2020). 2 Topic Diversity: We do not restrict our collection to tweets carrying certain hashtags. This makes the data general enough to involve content and topics directly related to COVID-19, regardless of existence of accompanying hashtags. This also allows for investigating themes that may not be directly linked to the pandemic but where the pandemic may have some bearings which should be taken into account when investigating such themes. This is important because users can, and indeed do, post about activities impacted by the health crisis without using any hashtags. In fact, users may not mention COVID-19 at all, even though what they are posting about could be affected by the pandemic one way or another (e.g., "eating habits", "shopping behavior"). Section B and Section C in the Appendix provide a general overview of issues discussed in the dataset. Longitudinal Coverage: We collect multiple data points (up to 3,200) from each user, with a goal to allow for comparisons between the present and the past across the same users, communities, and geographical regions (Section 3.2). Again, this is desirable since without data from pre-COVID-19 time it will be challenging to hold any such comparisons. For example, some users may have stopped posting about "exercising" during the pandemic but we cannot definitely identify this without access to these users' previous data where they may have been posting about their physical activities. Language Diversity: Since our collection method targets users, rather than hashtag-based content, Mega-COV is linguistically diverse. In theory, any language posted to Twitter by a user whose data we have collected should be represented. Based on Twitter-assigned language codes, we identify a total of 65 languages. However, applying two different language detection tools to the whole dataset, we identify more than 100 languages. (Section 3.3). No Distribution Shift: Related to the two previous points, but from a machine learning perspective, by collecting the data without conditioning on existence of specific (or any) hashtags we avoid introducing distribution bias. In other words, the data can be used to study various phenomena in-the-2 Both our own dataset and that of Chen et al. (2020) are growing over time. All our statistics in the current paper are based on our collection as of May 15, 2020. As of October 6, 2020, authors of #COVID-19 report 649.9M tweets on their GitHub (https://github.com/echen102/ COVID-19-TweetIDs), and our own dataset has exceeded 5B tweets.
wild. This warrants more generalizable findings and models. A dataset as large as Mega-COV can be hard to navigate. In particular, an informative description of the dataset is necessary for navigating it. In this paper, we provide an explanation of a number of global aspects of the dataset, including its geographic, temporal, and linguistic coverage. We also provide a high-level content analysis of the data, and explore user sharing of content from particular web domains with a focus on news media. In the context of our investigation of Mega-COV, we make an array of important discoveries. For example, we strikingly discover that, perhaps for the first time in Twitter history, users address one another and retweet more than they post tweets. We also find a noticeable rise in ranks for news sites (based on how frequent their URLs are shared) during 2020 as compared to 2019, with a shift toward global (rather than local) news media. A third finding is how use of the Twitter platform surged in March, perhaps making it the busiest time in the history of the network. Furthermore, we develop two groups of effective neural models: (1) COVID-relevance models (for detecting whether a tweet is related to COVID-19 or not).
(2) COVID-misinformation models (for detecting whether a text carries fake information or not). In addition to releasing our best models, we also apply them to a total of 30M tweets from Mega-COV and release our tags to accelerate further research on the topic.
The rest of the paper is organized as follows: In Section 2, we describe our data collection methods. Section 3 is where we investigate geographic, linguistic, and temporal dimensions of our data. We describe our models for detecting COVID-19 tweets and COVID-misinformation in Section 4. Section 5 is where we apply our relevance and misinformation models to a large sample of Mega-COV. Section 6 is about data release and ethics. We provide a literature review in Section 7, and conclude in Section 8.

Data Collection
To collect a sufficiently large dataset, we put crawlers using the Twitter streaming API 3 on Africa, Asia, Australia, Europe, North America, and South America starting in early January, 2020. This allows us to acquire a diverse set of tweets from which we can extract a random set of user IDs whose timelines (up to 3,200 tweets) we then iteratively crawl every two weeks. This gave us data from July 30 th , 2020 backwards, depending on how prolific of a poster a user is (see Table 4a for a breakdown.). In this paper, we describe and analyze the version of Mega-COV collected up to May 15, 2020 and use the term Mega-COV to refer it. Mega-COV comprises a total of 1, 023, 972 users who contribute 1, 487, 328, 805 tweets. For each tweet, we collect the whole json object. This gives us access to various types of information, such as user location and the language tag (including "undefined") Twitter assigns to each tweet. We then use the data streaming and processing engine, Spark, to merge all user files and run our analyses. To capture a wide range of behaviors, we keep tweets, retweets, and responses (i.e., direct user-touser interactions) as independent categories. Table 1 offers a breakdown of the distribution of the different types of posts in Mega-COV. Tweet IDs of the dataset are publicly available at our GitHub 4 and can be downloaded for research. To the extent it is possible, we intend to provide semi-regular updates to the dataset repository.

Geographic Diversity
A region from which a tweet is posted can be associated with a specific 'point' location or a Twitter place with a 'bounding box' that describes a larger area such as city, town, or country. We refer to tweets in this category as geo-located. A smaller fraction of tweets are also geo-tagged with longitude and latitude. As Table 2 shows, Mega-COV has ∼ 187M geo-located tweets from ∼ 740K users and ∼ 31M geo-tagged tweets from ∼ 267K users. Table 2 also shows the distribution of tweets and users over the top two countries represented in the dataset, the U.S. and Canada (North America), and other locations (summed up as one category, but see also Table 3 for countries in the data by con- Figure 2: World map coverage of Mega-COV. Each dot is a point co-ordinate (longitude and latitude) from which at least one tweet was posted. Clearly, users tweet while traveling, whether by air or sea. tinent). As explained, to allow comparisons over time (including behavioral changes during COVID-19), we include pre-2020 data in Mega-COV. For the year 2020, Mega-COV has ∼ 66M geo-located tweets from ∼ 670K users and ∼ 3M geo-tagged tweets from ∼ 109K users. 5 We note that significant parts from the data could still belong to the different countries but just not geo-located in the original json files retrieved from Twitter. Figure 2 shows actual point co-ordinates of locations from which the data were posted. Figure 3 shows the geographical diversity in Mega-COV based on geo-located data. We show the distribution in terms of the number of cities over the 20 countries from which we retrieved the highest number of locations in the dataset, broken by all-time and the year 2020. Overall, Mega-COV has data posted from a total of 167, 202 cities that represent 268 countries. Figure A.2 in Appendix A shows the distribution of data over countries. The top 5 countries in the data are the U.S., Canada, Brazil, the U.K., and Japan. As we mention earlier, other top countries in the data across the various continents are shown in Table 3.

Temporal Coverage
Our goal is to make it possible to exploit Mega-COV for comparing user social content over time. Since we crawl user timelines, the dataset comprises content going back as early as 2007.    of 2020 compared to the same period in 2019. This is expected, both due to physical distancing and a wide range of human activity (e.g., "work", "shopping") moving online. More precisely, moving activities online causes users to be on their machines for longer times and hence have easier access to social media. The clear spike in the month of March 2020 is striking. It is particularly so given a shifted pattern of use: retweeting and replying (to others) are both observably more frequent than tweeting itself. This especially takes place during the month of March, and somewhat continues in April, as shown in Figure 4b. Figure 4a and Figure 4b also show a breakdown of tweets, retweets, and replies.
A striking discovery is that, for 2020, users are engaged in conversations with one another more than tweeting directly to the platform. This may be the first time this pattern exists, perhaps in the history of the network. At least based on our massive dataset, this conclusion can be made. In addition, for 2020, we also see users retweeting more than tweeting. Based on Mega-COV, this is also happening for the first time.

Linguistic Diversity
We perform the language analysis based on tweets (n=∼ 1.5B), including retweets and replies. Twitter assigns 65 language ids to ∼ 1.4B tweets, while the rest are tagged as "und" (for "undefined").
Mega-COV has ∼ 104M (∼ 7%) tweets tagged as "und". We run two language identification tools, langid (Lui and Baldwin, 2012) and Compact Language Detector (Ooms and Sites, 2018) 6 langid (Lui and Baldwin, 2012), 7 on the whole dataset (including tweets tagged "und" by Twitter). 8 After merging language tags from Twitter and the 2 tools, we acquire a total of 104 labels. This makes Mega-COV very linguistically rich. Table 4 shows the top 20 languages identified by Twitter (left) and the top 20 languages tagged by one of the two tools, langid (Lui and Baldwin, 2012), after removing the 65 Twitter languages (right).

Models
We develop two groups of models suited for answering important questions related to COVID-19, including making use of Mega-COV. These are (1) COVID-relevance, where a classifier will   label a tweet as relevant to COVID-19 or not and (2) COVID-misinformation, where a model predicts text veracity pertaining COVID-19 (i.e., whether a text carries true or fake information related to the pandemic). We now describe our methods.

Methods
For all our models, we fine-tune 3 popular pre-

Hyper-Parameters and Optimization
For each model, we use the same pre-processing in the respective code released by the authors. For all models, we typically use a sequence length of 50 tokens. We use a learning rate of 5e−6 and a batch size of 32. We train each model for 20 epochs and identify the best epoch on a development set. We report performance on both development and test sets. We describe our baseline for each of the relevance and misinformation models in the respective sections below. We now introduce each of these two model groups.

COVID-Relevance Models
Data. Our COVID-relevance models predict whether a tweet is related to COVID-19 or not (i.e., not related). To train the models, we sample ∼ 2.3M multilingual tweets (65 languages TRAIN (n=3,146,334), 10% DEV (n=393,567), and 10% TEST (n=392,918). We then remove all hashtags which were used by  for collecting the data and fine-tune each of the 3 language models on TRAIN.
Results. As shown in Table 5, XLM-R Large acquires best results with 97.95 acc and 97.93 macro F 1 on TEST. These results are significantly better than a majority class baseline (based on TRAIN) and another arbitrarily-chosen (yet quite competitive) baseline model that chooses the related class (majority class in TRAIN) 75% of the time.
Model Generalization. Our COVID-relevance models are trained with distant supervision (hashtags as surrogate labels). It is conceivable that content related to COVID-19 would still occur in real world without accompanying hashtags. To test the extent to which our best model would perform on external data, we evaluate it on two external Twitter datasets, CoAID (Cui and Lee, 2020) and ReCOVery (Zhou et al., 2020), both of which are claimed by the authors to be completely (100%) related to  As Tabel 6 shows, We do observe a drop in model performance as compared to our best model on our own TEST set in Table 5 (acc drops on average by 15.5% and 7.6% F 1 ). However, the best model is still highly effective. It acquires an average acc of 82.46% and F 1 of 90.38% on the CoAID and ReCOVery datasets. We now introduce our misinformation models. 9 Each of the two datasets are also labeled for fake news (true vs. fake) focused on COVID-19, but our focus here is exclusively on using the two datasets as gold-labeled TEST sets for evaluating our COVID-relevance model. Note that we will use these two datasets again as explained in Section 4.4 as well.

COVID-Misinformation Models
To train models for detecting the veracity of news related to COVID-19, we exploit two recent and publicly available fake news datasets (in English): CoAID (Cui and Lee, 2020), and ReCOVery (Zhou et al., 2020). We now describe each of these datasets:   CoAID. Cui and Lee (2020) present a Covid-19 heAthcare mIsinformation Dataset (CoAID), with diverse COVID-19 healthcare misinformation, including fake news on websites and social platforms, along with related user engagements (i.e., tweets and replies) about such news. CoAID includes 3, 235 news articles and claims, 294, 692 user engagement, and 851 social platform posts about COVID-19. The dataset is collected from December 1, 2019 to July 1, 2020. Table 7 shows class distribution of news articles and tweets in CoAID.
More information about CoAID is in Appendix F. ReCOVery. Zhou et al. (2020) choose 60 news publishers with 'extreme' levels of credibility (i.e., true vs. fake classes) from an original list of ∼ 2, 000 to collect a total of 2, 029 news articles on COVID-19, published between January and May 2020. They also collect 140, 820 tweets related to the news articles, considering those tweets related to true articles to be true and vice versa. Table 7 shows class distribution of news articles and tweets in ReCOVery. Splits and Cleaning. Table 8 shows the distribution of tweets in CoAID and Recovery before and after the de-duplication process. As Table 8 shows, de-duplication results in significantly reducing the sizes of DEV and TEST sets in the two resources. The distribution of news article is shown in Table F.1 (Appendix F).
Training. We use both CoAID and ReCOVery after de-duplication for training neural models to detect fake news related to Covid-19. Using the same hyper-parameters and training setup as the COVIDrelevance models, we fine-tune the pre-trained language models on the Twitter dataset and the news dataset, independently. 10 Since Mega-COV is a social media dataset, we only focus on training Twitter models here and provide the news models in Appendix F. For the Twitter models, we develop one model on CoAID, another on ReCOVery, independently, and a third model for CoAID+ReCOVery (concatenated). Again, for each of these 3 datasets, we fine-tune on TRAIN and identify the best model on DEV. We then report the best model on both DEV and TEST.
Results. Since our focus is on detecting fake texts, we show results on the positive class only in Table 9. We report results in terms of precision, recall and F 1 . Our baseline is a small LSTM with 2 hidden layers, each of which has 50 nodes. We add a dropout of 0.2 after the first layer and arbitrarily train the LSMT for 3 epochs. As Table 9 shows, our best results for fake tweet detection on TEST for CoAID is at 90% F 1 (mBERT/XLM-R Large ), for ReCOV 68% (mBERT), and for these two combined is 92%. All results are above the LSTM baseline. We show results of the COVIDmisinformation news models in Table F.2 (Appendix F).

Applications on Mega-COV
Now that we have developed two highly effective models, one for COVID-relevance and another for COVID-misinformation, we can employ these models to make discoveries using Mega-COV. Since 10 Even though we could have used the monolingual versions of the transformer-based language models (i.e., BERT and RoBERTa), we stick to the multilingual versions for consistency.  our misinformation models are focused only on English (due to the external gold data we used for training being English only), we will restrict this analysis to the English language. 11 We were curious whether model predictions will have different distributions on the different types of Twitter posts (i.e., tweets, retweets, and replies). Hence, to enable such comparisons, we extract a random sample of 10M samples from each of these post types (for a total of 30M) from the year 2020 in Mega-COV.
We then apply the XLM-R Large relevance and misinformation models on the extracted samples. Table 10 shows the distribution of predicted labels from each of the two models across the 3 posting types (tweets, retweets, and replies). Strikingly, as the top half of the table shows, while only 7.77% of tweets are predicted as COVID-related, almost all retweets (99.84%) are predicted as related. This shows that users' retweets were focused almost exclusively on COVID-19. The table (bottom half) also shows that retweets are highest carriers of content predicted as fake (3.67%), followed by tweets (2.3%). From the table, we can also deduce that only 2.45% of all English language Twitter content (average across the 3 posting types) are predicted as fake. Given the global use of English, and the large volume of English posts Twitter receives daily, this percentage of fake content is still problematically high.

Annotation Study
We perform a human annotation study on a small sample of 150 random posts from those the model predicted as both COVID-related and fake. Two annotators labeled the 150 samples for two types of 11 But we emphasize the multilingual capacity of our COVID-relevance model. tags, relevance and veracity. For relevance, all the 150 posts were found relevant by the two annotators (perfect agreement). For veracity, since some posts can be very challenging to identify, we asked annotators to assign one of the 3 tags in the set {true, fake, unknown}. We did not ask annotators to consult any outside sources (e.g., Wikipedia or independent fact-checking sites) to identify veracity of the samples. Inter-annotator agreement is at Kappa (K)=77.81%, thus indicating almost perfect agreement. On average, annotators assigned the fake class 39.39% of the time, the true class 3.02%, and the unknown class 57.05%. While these findings show that it is hard for humans to identify data veracity without resorting to external sources, it also demonstrates the utility of the model in detecting actual fake stories in the wild. We provide a number of samples from the posts that were automatically tagged as COVID-related and either true or false by our misinformation/veracity model in Table 11.  Table 10: Distribution of predicted labels from our COVID-relevance and COVID-misinformation models on randomly selected 30M English samples from Mega-COV data.

Data Release and Ethics
Data Distribution. The size of the data makes it an attractive object of study. Collection and exploration of the data required significant computing infrastructure and use of powerful data streaming and processing tools. To facilitate use of the dataset, we organize the tweet IDs we release by time (month and year) and language. This should enable interested researchers to work with the exact parts of the data related to their research questions even if they do not have large computing infrastructure.
Ethical Considerations. We collect Mega-COV from the public domain (Twitter). In compliance with Twitter policy, we do not publish hydrated tweet content. Rather, we only publish publicly available tweet IDs. All Twitter policies, including respect and protection of user privacy, apply. We decided not to assign geographic region tags to the tweet IDs we distribute, but these already exist on the json object retrievable from Twitter. Still, location information should be used with caution. Twitter does not allow deriving or inferring, or storing derived or inferred, potentially sensitive characteristics about users. Sensitive user attributes identified by Twitter include health (e.g., pregnancy), negative financial status or condition, political affiliation or beliefs, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership, and alleged or actual commission of a crime. If they decide to use Mega-COV, we expect researchers to review Twitter policy 12 and applicable laws, including the European Union's General Data Protection Regulation (GDPR) 13 , beforehand. We encourage use of Mega-COV for social good, including applications that can improve health and well-being and enhance online safety. Is COVID -19 airborne contagious ? New study shows that coronavirus may be caught from the air * 3 -hours * after it has been exposed . True A close relative of SARS-CoV -2 found in bats offers more evidence it evolved naturally URL True Antiviral remdesivir prevents disease progression in monkeys with COVID -19 -National Institutes of Health ( NIH ) URL True COVID Surges Among Young Adults URL True  Osho et al., 2020;Pierri et al., 2020;Koubaa, 2020), estimating the rate of misinformation in COVID-19 associated tweets (Kouzy et al., 2020), the use of bots (Ferrara, 2020), predicting whether a user is COVID19 positive or negative (Karisani and Karisani, 2020), and the quality of shared links Singh et al. (2020).

Conclusion
We presented Mega-COV, a billion-scale dataset of 104 languages for studying COVID-19 pandemic. In addition to being large and highly multilingual, our dataset comprises data pre-dating the pandemic. This allows for comparative and longitudinal investigations. We provided a global description of Mega-COV in terms of its geographic and temporal coverage, over-viewed its linguistic diversity, and provided analysis of its content based on hashtags and top domains. We also provided a case study of how the data can be used to track global human mobility. The scale of the Mega-COV has also allowed us to make a number of striking discoveries, including (1) the shift toward retweeting and replying to other users rather than tweeting in 2020 and (2) the role of international news sites as key sources of information during the pandemic. In addition, we developed effective models for detecting COVID relevance and COVID misinformation and applied them to a large sample of our dataset. Our dataset and models are publicly available.

B Hashtag Content Analysis
Hashtags usually correlate with the topics users post about. We provide the top 30 hashtags in the data in  shows (e.g., Big Boss), doctors, and even fake news along with the pandemic-related hashtags.
An interesting observation from the Chinese language word cloud is the use of hashtags such as ChinaPneumonia and WuhanPneumonia to refer to the pandemic. We did not observe these same hash- Figure B.1: Word clouds for hashtags in tweets from the top 10 languages in the data. We note that tweets in non-English can still carry English hashtags or employ Latin script. tags in any of the other languages. Additionally, for some reason, Apple seems to be trending during the first 4 months of 2020 in China owing to hashtags such as appledaily and appledailytw. Some languages, such as Romanian and Vietnamese, involve discussions of bitcoin and crypto-currency. This was also seen in the Chinese language word cloud, but not as prominently.

C Domain Sharing Analysis
Domains in URLs shared by users also provide a window on what is share-worthy. We perform an analysis of the top 200 domains shared in each of 2019 and 2020. The major observation we reach is the surge in tweets involving news websites, and the rise in ranks for the majority of these websites compared to 2019. Table C.1 shows the top 40 news domains in the 2020 data and their change in rank compared to 2019. Such a heavy sharing of news domains reflects users' needs: Intuitively, at times of global disruption, people need more frequent updates on ongoing events. Of particular importance, especially relative to other ongoing political polarization in the U.S., is the striking rise of the conservative news network Fox News, which has moved from a rank of 118 in 2019 to 67 in 2020 with a swooping 51 positions jump. We also note the rank of some news sites (e.g., The Globe and Mail and The Star going down. This is perhaps due to people resorting to international (and more diverse) sources of information to remain informed about countries other than their own. Other domains: Other noteworthy domain activities include those related to gaming, video and music, and social media tools. Ranks of these domains have not necessarily shifted higher than 2019 but remain prominent. This shows these themes still being relevant in 2020. In spite of the eco-  nomic impact of the pandemic, shopping domains such as etsy.me and poshmark.com have markedly risen in rank as people moved to shopping online in more significant ways. We now introduce a case study as to how our data can be used for mobility tracking.

D Case Study: Mapping Human Mobility with Mega-COV
Geolocation information in Mega-COV can be used to characterize and track human mobility in various ways. We investigate some of these next.
Inter-Region Mobility. Mega-COV can be exploited to generate responsive maps where end users can check mobility patterns between different regions over time. In particular, geolocation information can show mobility patterns between regions. As an illustration of this use case, albeit Intra-Region Mobility. We also use information in Mega-COV to map each user to a single home region (i.e., city, state/province, and country). We follow Geolocation literature (Roller et al., 2012;Graham et al., 2014;Han et al., 2016;Do et al., 2018) in setting a condition that a user must have posted at least 10 tweets from a given region. However, we also condition that at least 60% of all user tweets must have been posted from the same region. We use the resulting set of users whose home location we can verify to map user weekly mobility within their own city, state, and country exclusively for both Canada and the U.S. as illustrating examples. We provide the related visualization in supplementary material under "User Weekly Intra-Region Mobility". 14 Here, due to increased posting in 2020, we normalize the number of visits between states by the total number of all tweets posted during a given month.

D.1 User Weekly Intra-Region Mobility
We can also visualize user mobility as a distance from an average mobility score on a weekly basis. Namely, we calculate an average weekly mobility score for the year 2019 using geo-tag information (longitude and latitude) and use it as a baseline against which we plot user mobility for each week of 2019 and 2020 up until April. In general, we observe a drop in user mobility in Canada starting from mid-March. For U.S. users, we notice a very high mobility surge starting around end of February and early March, only waning down the last week of March and continuing in April as shown in Figure D.8. For both the U.S. and Canada, we hypothesize the surge in early March (much more noticeable in the U.S.) is a result of people moving back to their hometowns, returning from travels, moving for basic needs stocking, etc.

E.1 Dataset
We randomly sample 200K tweets from the English data in Chen et al. (2020) and a maximum of 100K from each of the rest of languages. For languages where there is < 100K tweets, we take all data. For the negative class, we extract data from Jan-Nov, 2019 from Mega-COV. For each language, we take roughly the same number of tweets we sampled for the positive class. Table E    • User Engagement. Queries based on the true and fake articles and claims were used to build a dataset of user engagement from Twitter where the goal was to acquire the tweets discussing the news in question and related Twitter replies.