Weibo-COV: A Large-Scale COVID-19 Social Media Dataset from Weibo

With the rapid development of COVID-19, people are asked to maintain"social distance"and"stay at home". In this scenario, more and more social interactions move online, especially on social media like Twitter and Weibo. People post tweets to share information, express opinions and seek help during the pandemic, and these tweets on social media are valuable for studies against COVID-19, such as early warning and outbreaks detection. Therefore, in this paper, we release a large-scale COVID-19 social media dataset from Weibo called Weibo-COV, covering more than 30 million tweets from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic tweets information, interactive information, location information and retweet network. We hope this dataset can promote studies of COVID-19 from multiple perspectives and enable better and faster researches to suppress the spread of this disease.


Introduction
At the beginning of writing, COVID-19, an infectious disease caused by a coronavirus discovered in December 2019, also known as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), has infected 4,517,399 individuals globally with a death toll of 308,515 (Doctor, 2020). Under the circumstances, the physical aspects of connection and human communication outside the household among people are limited considerably and mainly depend on digital device like mobile phone or computers (Abdul-Mageed et al., 2020). In this kind of scenario, people will stay at home and spend more time on the social media communication. The social media plays an important role for people sharing information, expressing opinions and seeking help (Lopez et al., 2020), which makes social media platforms like Weibo, Twitter, Facebook and Youtube a more crucial sources of information during the pandemic.
In the previous studies, social media was consider as a valuable data source for studies against disease, like uncovering the dynamics of an emerging outbreak (Zhang and Centola, 2019), predicting the flu activity and disease surveillance (Jeremy et al., 2009). For example, some studies facilitate better influenza surveillance, like early warning and outbreaks detection (Kostkova et al., 2014;De Quincey and Kostkova, 2009), forecast estimates of influenza activity (Santillana et al., 2015) and predict the actual number of infected cases (Lampos and Cristianini, 2010;Szomszor et al., 2010). Therefore, it is necessary to make the relevant social media datasets freely accessible for better public outcomes to facilitate the related studies of COVID-19.
In this paper, we release a large-scale COVID-19 social media dataset from Weibo, one of the most popular Chinese social media platform. The dataset is named Weibo-COV and covers more than 30 million tweets from 1 November 2019 to 30 April 2020. Specifically, unlike the conventional API method to retrieve data, which limit large data access, we first build a high-qulity Weibo active user pool with 20 million active users from over 250 million users, then collect all active users' tweets during the time period and filter out tweets related to COVID-19 by selected 179 keywords. Moreover, the fields of tweets in the dataset is very rich, including basic tweets information, interactive information, location information and retweet network. We hope this dataset can promote studies of COVID-19 from multiple perspectives and enable better and faster researches to suppress the spread of this disease. arXiv:2005.09174v4 [cs.SI] 28 May 2020 At present, given specified keywords and a specified period, there are two methods for constructing Weibo public opinion datasets: (1) advanced search API given by Weibo; (2) Traversing all Weibo users, collecting all their tweets in the specified period, and then filtering tweets with specified keywords.
However, for method (1), due to the limitation of the Weibo search API, the result of once search contains up to 1000 tweets, making it difficult to build large-scale datasets. As for method (2), although we could build large-scale datasets with almost no omissions, traversing all billions of Weibo users requires very long time and large bandwidth resources. In addition, a large number of Weibo users are inactive, and it makes no sense to traverse their homepages, because they may not post any tweets in the specified period.
To alleviate these limitations, we propose a novel method to construct Weibo public opinion datasets, which can build large-scale datasets with high construction efficiency. Specifically, we first build and dynamically maintain a high-quilty Weibo active user pool (just a small part of all users), and then we only traverse these users and collect all their tweets with specified keywords in the specified period.

Weibo Active User Pool
As shown in Figure 1, based on initial seed users and continuous expansion through social relationship, we first collect more than 250 million Weibo users. Then we define that Weibo active users should meet the following 2 characteristics: (1) the number of followers, fans and tweets are all more than 50; (2) the latest tweet is posted in 30 days. Therefore, we can build and dynamically maintain a Weibo active user pool from all collected weibo users. Finally, the constructed Weibo active user pool contains 20 million Weibo users, accounting for 8% of the total number of weibo users. the content of the tweet origin weibo the id of the origin tweet, only not empty when the tweet is a retweet one geo info information of latitude and longitude, only not empty when the tweet contains the location information

COVID-19 Tweets Collection
According to the collection strategy described in section 2.1, we set the time period from 0:00 December 1, 2019 (GMT+8, the date of the first diagnosis) to 23:59 April 30, 2020 (GMT+8), and design a total of 179 COVID-19 related keywords. These keywords are comprehensive and rich, covering related terms such as coronavirus and pneumonia, as well as specific locations (e.g., "Wuhan"), drugs (e.g, "remdesivir"), preventive measures (e.g., "mask"), experts and doctors (e.g., "Zhong Nanshan"), government policy (e.g, "postpone the reopening of school") and others (see Appendix.1 for the complete list). Specifically, based on 20 million Weibo active user pool, we first collect a total of 569,829,866 tweets posted by these users in the specified period. Subsequently, we filter these tweets by keywords and finally obtain 33,519,644 tweets. These tweets constitute our final dataset.

Data Structure
As shown in Table 1, fields of tweets in the dataset is very rich, covering the basic information ( id, crawl time, content), interactive information (like num, repost num, comment num), location in-  formation (geo info) and retweet network (origin weibo). Therefore, various aspects of studies related to infectious diseases can be conducted based on this dataset, such as the impact on people's daily life, the early characteristics of the disease and government anti-epidemic measures.

Basic Statistic
As shown in Table 2, Weibo-COV contains a total of 33,519,644 tweets, including 895,012 tweets with geographic location information and 6,586,969 original tweets, and the number of deduplication users in the entire dataset is 8,876,036.

Daily Distribution
The distribution of the number of tweets by day is shown in Figure 2. It can be found that from December 1, 2019 to January 18, 2020, the number of COVID-19 related tweets is very small (less than 5K) and may include some noise data. Since January 19, 2020, the number of COVID-19 related tweets increase rapidly and maintain at least 200,000 per day.
Note that the data on April 4, 2020 is particularly striking and the number of tweets on that day exceeds 1.6 million. Because that day is Chinese Tomb Sweeping Festival, and a national mourning was held for the compatriots who died in the epidemic. People posted or reposted a lot of mourning tweets on Weibo on that day.

GEO Distribution
As shown in Figure 3, we plot location distribution of tweets with location information on April It can be seen that the distribution of tweets spreads all over the world, including major countries in Asia, Europe, Australia and America. Because although Weibo is a Chinese social media platform, with the development of economic globalization, more and more Chinese people go abroad and more and more foreigners start to use Weibo.
Therefore, our dataset can study the impact of the disease on the whole world, not only limit to China.

Word Cloud
As shown in Figure 4, we select four days of tweets data at different stages of the epidemic development and draw word clouds. It can be seen that people did not know the characteristics of the virus and the government began to take preliminary actions in the early days (e.g."unexplained pneumonia" and "health committee" in 2019-12-31), then people learned that the virus is a new coronavirus and studied prevention methods and medicines (e.g."new coronavirus", "N95 musk" and "ShuangHuangLian" in 2020-01-31), and then the control of COVID-19 became a problem that the whole world needs to face and governments took strict prevention measures (e.g."isolated at home" and "American COVID-19" in 2020-03-31), and right now virus has not been effectively controlled globally and has had many impacts on people's lives ("Cirque du Soleil in Canada" in 2020-

04-30).
Therefore, our dataset runs through the whole development of COVID-19, and includes impacts of the disease on all aspects of the society.

Related Work
Several works have focused on creating social media datasets for enabling COVID-19 research. (Chen et al., 2020), (Lopez et al., 2020) and (Abdul-Mageed et al., 2020) have already released datasets collected from Twitter. However, these datasets are mainly in English, Chinese tweets are also valuable and can provide additional supplements for researches.
Only one dataset proposed by (Gao et al., 2020) includes tweets from Weibo, but their method based on Weibo advanced search API, so they can not collect large-scale tweets from Weibo. Compared with our dataset, their overall size (less than 200K), time span (from January 20, 2020 to March 24, 2020), and number of keywords (only 4 keywords) are all much smaller.

Conclusion
In this paper, we release Weibo-COV, a first largescale COVID-19 tweets dataset from Weibo. The dataset contains over 30 million tweets covering from 1 November 2019 to 30 April 2020 and each tweet with rich field information. We hope this dataset could promote and facilitate related studies on COVID-19.