Collecting Verified COVID-19 Question Answer Pairs

We release a dataset of over 2,200 COVID19 related Frequently asked Question-Answer pairs scraped from over 40 trusted websites. We include an additional 24, 000 questions pulled from online sources that have been aligned by experts with existing answered questions from our dataset. This paper describes our efforts in collecting the dataset and summarizes the resulting data. Our dataset is automatically updated daily and available at https://covid-19-infobot. org/data/. So far, this data has been used to develop a chatbot providing users information about COVID-19. We encourage others to build analytics and tools upon this dataset as well.


Introduction
With the quick spread of COVID19, misinformation has rapidly spread. 1 Misinformation around the use of certain drugs for the prevention of Covid-19 has had fatal outcomes, and stigmatization guided by misinformation about certain communities as vectors of virus undermines the long-term welfare of our society. We are developing a natural language processing (NLP) backed-informational chatbot targeted at comprehensive COVID-19 information and misinformation. Users can interact with our chatbot on different platforms to access information about COVID-19, available care, and other topics of interest. 2 To aid in this effort, we aggregate factual information in the form of verified questions and answers to help answer frequently asked questions about the pandemic. We employ three main aggregation efforts in tandem: 1) generating high quality and accurate information from domain experts, i.e. Public Health researchers at Johns Hopkins University; 2) automatically scraping frequently asked questions and answers from online trusted sources, e.g. newspapers and government agencies; and 3) automatically ranking and manually aligning additional questions from social media with the scraped questions and answers in our dataset. This paper primarily describes our efforts to extract high quality content from trustworthy websites and domain experts. Our effort has resulted in a publicly available dataset that currently contains over 2,100 Questions and Answers from more than 40 webpages. The dataset is available at https://covid-19-infobot.org/data/. Since we are actively scraping more websites and rescrape all sites at least once a day, these numbers are updated daily. 3

Creating our FAQ Dataset
We create our publicly available dataset of over 2,100 question-answer pairs by aggregating FAQs from trusted news sources. 4 We choose websites to scrape based on three broad criteria: 1) the informativeness and trustworthiness of the website; 2) the ease of scraping frequently asked questionanswer pair from the website; and 3) the number of questions and answers on the website.
We use a straightforward scraping process that enables undergraduate students to contribute to our efforts. We developed a python library for students to easily add scrapers to our project. As demonstrated in the example in Figure 1, our library requires each question-answer (and metadata) 3 The dataset's statistics described in this paper are based on a snapshot of the data as of June 25th, 2020, corresponding with https://github. com/JHU-COVID-QA/scraping-qas/tree/ a446c00c318e02cad5188cec359b9d649d8c4933 to be stored as a simple dictionary. The library automatically adds this information to our set of question-answer pairs. Additionally, the library accordingly handles updating answers to questions in our dataset if a previously scraped website updates its information.
This has enabled students to efficiently join the project and contribute immediately. Further documentation is available at https://github.com/ JHU-COVID-QA/scraping-qas and we encourage others to join our efforts.

Metadata
For each scraped question-answer pair, we extract relevant metadata for our chatbot and other NLP analytics. The metadata includes information about the source of each question-answer pair (we include both the source name and the URL) and the date when the question-answer was last scraped from or updated on the website. Additionally, if the information on the website is targeted for a specific geographic area, we include that in our metadata as well.

Leveraging existing scrapers
We leverage existing scrapers for collecting questions-answer pairs for COVID-19. 874 of our examples come from scrapers released by deepset. 5 Following deepset's lead, we open-source our scrapers as well.

Continuous scraping
As our understanding of COVID-19 rapidly evolves, trustworthy sources update the informa-tion they release. Therefore, each day, we automatically re-run the web scrapers to find new information. This enables us to add new questionanswers or update answers to existing questions in our dataset.
If a previously scraped question-answer is removed from a website, we remove that example from our dataset. 6 Question and answers that we removed from our dataset as still available in our history since we archive each day's dataset. In turn, the quality of our dataset is constantly evolving and improving.

Data
The described effort resulted in a dataset that is evolving daily. The June 15th version contains over 2,100 questions and answers scraped from 40 websites. We list the number of question-answer pairs extracted from each source in Table 1. Our dataset contains some examples in different langages besides for English, owing to deepset scraping websites in multiple languages. Figure 2 plots the number of question-answer pairs in each of the five languages: English, German, Polish, Italian, and Swedish. Roughly 70% of our examples are in English. As we release more data, we will include further analysis of the growing dataset.
Websites might update or change how they store information. This is why the current version of our dataset contains just 1 example from the Delaware State Government webpage. The May 20th version of our dataset contains 22 examples from this website.

Manually Aligning Additional Questions and Answers
Since the internet contains many more questions that are not answered, we additionally collected questions and align them with the question-answer pairs in our dataset. We leverage information retrieval techniques to match these unanswered questions with questions in our dataset and then rely on domain experts to verify each aligned questionquestion-answer (QQA) pair. In this section, we provide details for each of these steps.

Online Question Extraction
We downloaded 28 million tweets from the COVID-19 Twitter Dataset (Chen et al., 2020), Qorona, 7 and CovidFaq 8 , extracted the questions from those resources, 9 sorted them by frequency, and discarded the questions that occurred less than four times. Then, we grouped semantically similar questions into 9, 200 clusters. Next, we extracted the centers of the clusters and, using a state-of-the-art sentence re-writer (Hu et al., 2019), we generated three high quality paraphrases of each question. This resulted in a collection of over 27, 000 unanswered questions about COVID-19.

Aligning Extracted Questions with Existing Questions and Answers
We worked with public health experts to align these unanswered questions with our verified question-7 https://github.com/allenai/Qorona 8 https://github.com/dialoguemd/ covidfaq 9 Corona and CovidFaq specifically contain questions. We extract questions from the Twitter dataset by determining whether a sentence from a tweet either ends with a question mark, or starts with a provided list of words (e.g., "who", "when", "where", etc). answer pairs (section 3). For each of these 27, 000 questions, we used a BM25 model (Robertson and Walker, 1994;Robertson et al., 1996) to determine the most similar answered questions in our dataset. 10 Following the EASL annotation protocol (Sakaguchi and Van Durme, 2018), for each unanswered twitter question, we presented public health experts with the five most similar QA's from our dataset. Based on a formal protocol developed by a senior Public Health researcher on our team (Figure 4), we asked the experts to determine, on a scale from 0 to 100, how relevant or similar the QA from our dataset is to the unanswered question.
For this annotation effort, we leveraged Turkle, open-sourced, locally hosted clone of Amazon Mechanical Turk developed by the JHU Human Language Technology Center of Excellence. 11 Figure 5 and Figure 6 illustrate our annotation interface.
As part of this protocol, expert annotators could indicate whether a question was not relevant to COVID-19 or whether an existing answer was no longer correct. We removed such labeled examples from our set. This effort results in 24, 240 annotated QQAs. Figure 3 plots the distribution of labels annotated for QQAs. Over 18, 000 examples were judged to be less than 1% relevant, indicating that the majority of the questions extracted from twitter are irrelevant to the answered questions in our dataset. These additional examples can be used to further train a chatbot to answer questions about COVID-19.

Conclusion
We have presented our growing dataset of over 2,100 question-answers that has been created by scraping over 40 websites. We also discussed other data we collected and annotated that may be beneficial to others in the community as well. Our evolving dataset is complementary to other recent COVID-19 QA datasets, e.g. Tang