ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data

Event forecasting is a challenging, yet important task, as humans seek to constantly plan for the future. Existing automated forecasting studies rely mostly on structured data, such as time-series or event-based knowledge graphs, to help predict future events. In this work, we aim to formulate a task, construct a dataset, and provide benchmarks for developing methods for event forecasting with large volumes of unstructured text data. To simulate the forecasting scenario on temporal news documents, we formulate the problem as a restricted-domain, multiple-choice, question-answering (QA) task. Unlike existing QA tasks, our task limits accessible information, and thus a model has to make a forecasting judgement. To showcase the usefulness of this task formulation, we introduce ForecastQA, a question-answering dataset consisting of 10,392 event forecasting questions, which have been collected and verified via crowdsourcing efforts. We present our experiments on ForecastQA using BERTbased models and find that our best model achieves 61.0% accuracy on the dataset, which still lags behind human performance by about 19%. We hope ForecastQA will support future research efforts in bridging this gap.


Introduction
Forecasting globally significant events, such as outcomes of policy decisions, civil unrest, or the economic ramifications of global pandemics, is a consequential but arduous problem. In recent years there have been significant advances in applying machine learning (e.g., time-series prediction methods) to generate forecasts for various types of events including conflict zones (Schutte, 2017), duration of insurgency (Pilster and Böhmelt, 2014), civil unrest (Ramakrishnan et al., 2014a) and terrorist events (Raghavan et al., 2013). (1/1/19) Apart from the fact of being one another's closest neighbors, the people of South Korea and Japan have a remarkable amount in common. Economically, they are among one another's biggest trading partners. And yet, time and again, relations between Seoul and Tokyo are marked, not by mutual support and co-operation but by anger, reproach and exasperation. Q: Will primary schools in Europe admit non-vaccinated children around September 2019?
(3/8/18) Public officials and health experts had given several warnings: Do not allow a student in school if they had not been vaccinated against measles.
(6/27/19) Fines for parents refusing measles jab. Parents will be fined up to € 2,500 if they don't vaccinate their children against measles under draft legislation in Germany which also threatens exclusion from crèches, nurseries and schools. Current automated forecasting methods perform well on problems for which there are sufficient structured data (e.g., knowledge graphs), but are not well suited for events for which such data may not exist. Humans, though, can often accurately forecast outcomes by leveraging their judgement, domain knowledge, and prior experience (Tetlock and Gardner, 2016), along with the vast amounts of unstructured text data available to us (e.g., news articles). We are able to identify and retrieve salient facts from the near-endless pool of unstructured information, synthesize those facts into coherent beliefs, and generate probabilistic forecasts. Unfortunately, the process does not scale well in terms of the amount of information that must be processed and the number of events one has to forecast.
Here we address the above problem by formalizing a forecasting task, creating a dataset, and providing benchmarks to develop methods for the task. Specifically, we formulate the forecasting problem as a multiple-choice Question Answering (QA) task, where the input is a news corpus, questions, choices and timestamps associated with each question, and the output is one of the given choices per question. Our approach is rooted in the observation that both forecasting and QA follow a similar process: digesting massive amounts of textual data, identifying supporting pieces of evidence from text, and chaining different pieces to generate answers/forecasts.
Forecast Question Answering (FORECASTQA) introduces a novel timestamp constraint per question that prohibits the model from accessing new articles published after the timestamp. By doing so, FORECASTQA simulates a forecasting scenario; each question's timestamp is chosen to ensure that the question is about the outcome of a future event.
To illustrate this, consider the question, "Will primary schools in Europe admit non-vaccinated children around September 2019?" in Figure 1, and the fact that models only have access to articles before "2019-09-01." With the addition of this timestamp constraint, our query becomes a question about a future event in "September, 2019" based on articles from the "past"; the model is now being tested for its forecasting ability 2 . To answer the question, the model must find pertinent events from "past" information, resolve the temporal and causal relations between them, and finally make a forecasting judgement based on its interpretation of past information to answer the question. Our task differs from that of other works that require an understanding of temporal relationships (Ning et al., 2020) and temporal commonsense reasoning (Zhou et al., 2019), as our task forces a model to make a forecasting judgement.
In support of the proposed FORECASTQA formulation, we construct a dataset of 10,392 yes-no and multiple-choice questions. This data is collected via crowdsourcing based on news articles, where workers are shown articles and asked to come up with yes-no and multiple-choice questions. We also crowdsourced appropriate timestamps for each question. Finally, we design a method based on pre-trained language models to deal with retrieved articles for our task. In our experiments, the methods using retrieved articles slightly outper-2 The ability to predict the outcome of future events based on unstructured text describing past events, without access to an extracted sequence of historical event triples, nor provided a fixed set of possible relations between events; as is the case with human forecasters. Q: Who will drop Japan as a trading partner in August 2019? Choices: South Korea (answer), South Africa, Syria, Portugal.
Article: Why Japan and South Korea just can't get along. (1/1/19) Apart from the fact of being one another's closest neighbours, the people of South Korea and Japan have a remarkable amount in common. Economically, they are among one another's biggest trading partners. And yet, time and again, relations between Seoul and Tokyo are marked, not by mutual support and co-operation but by anger, reproach and exasperation.
Reasoning Process: Seoul is in South Korea, Tokyo is in Japan (commonsense -world knowledge). Seoul and Tokyo are big trading partners (language understanding -lexical variations). The relations between Seoul and Tokyo are marked by anger, reproach and exasperation and these relations might cause trading relations to cease (forecasting skills -causal relationwe can infer the answer from this part). form closed-book models, suggesting that our task is still challenging in that finding relevant information for forecasting and making a judgement are not straightforward. Our best attempt achieves 61.0% accuracy on our dataset, a significant performance gap from human performance by 19.3%.

Related Work
Event Forecasting. There are several types of approaches exist to do event forecasting. One approach could learn from highly structured eventcoded data such as ICEWS (Boschee et al., 2015) and GDELT (Leetaru and Schrodt, 2013). When these datasets are used for forecasting, they are often represented as a time series (Morstatter et al., 2019;Ramakrishnan et al., 2014b), in which each data point is associated with a timestamp. Another approach is script-learning, in which a model is provided with a chain of events and a subsequent event and is asked to predict the relation between the chain and the "future" event (Hu et al., 2017;Li et al., 2018;Lv et al., 2019). They require to convert text data into event triples and translate the questions and answer choices into their format, which limits the expressiveness of natural text. However, unlike these datasets and approaches, FORECASTQA does not provide any structured data to a model. The model must learn how to extract, keep track of, and link pertinent events from unstructured text to solve forecasting questions.
QA and Temporal Reasoning on Text. There are several approaches for QA using unstructured text. Extractive QA approaches rely on finding answer spans from the text that best answer a question (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Yang et al., 2018;Kwiatkowski et al., 2019;Huang et al., 2019).
Multiple-Choice QA requires a model to pick the best answer from a set (Talmor et al., 2019;Sap et al., 2019;Zhou et al., 2019), and generative QA prompts the machine to produce its own answer (Khashabi et al., 2020). Our dataset is a type of multiple-choice QA, but it differentiates itself from other QA datasets (all formats) in that the required answer does not exist in the provided text, nor is sufficient evidence provided to be able to answer a question with 100% certainty; a forecast is required. We could convert our questions into alternative query formats such as a text-to-text format, but instead we stick to multiple-choice questions as humans often weigh the benefits of multiple choices when making a forecasting judgement.
QA datasets often exist to test certain types of reasoning. One pertinent example of a reasoning type that QA tasks test is the understanding of temporal and casual relations (Jia et al., 2018a,b;Sun et al., 2018;Ning et al., 2020). However, FORE-CASTQA requires more than just extraction and understanding of relations; a model must be able to extract and understand the relations present in the text with the goal of making a forecasting judgement about an event whose outcome is not found in the text. Another type of reasoning tested in QA tasks is commonsense reasoning (Talmor et al., 2019) and even temporal commonsense reasoning (Zhou et al., 2019). While questions in FORECASTQA often require commonsense to correctly answer, not all do; event outcomes do not always follow common sense. Furthermore, our questions test forecasting abilities, which often includes various types of reasoning in addition to commonsense.

The FORECASTQA Task
FORECASTQA is a question answering task whose goal is to test a machine's forecasting ability. We consider forecasting as the process of anticipating the outcome of future events based on past and present data (Tetlock and Gardner, 2016). We focus on forecasting outcomes of news-based events coming from topics such as politics, sports, economics, etc. Training a machine to make forecasting decisions is inherently difficult, as the ground-truth label of event outcome (e.g., whether an event will occur) -so often required for model training -is only obtainable "in the future". To make progress in our goal, we devise a way to simulate the forecasting scenario by introducing a novel time constraint, allowing us to validate the machine predic-  tions by obtaining desired ground-truth labels. There is also the difficulty of ensuring the quality of question generation via crowdsourcing (necessary when building a dataset of scale), due to possible human errors in question formation (Tetlock et al., 2017). We have taken steps to ensure our questions cannot be answered with certainty using "past" data given the time constraint or commonsense knowledge, but the questions are tractable to answer with an educated guess (see Sec. 4.1). 3 Task Definition. Formally, the input of the FORE-CASTQA task is a forecasting question Q with a corresponding ending timestamp t Q --the last possible date where Q remains a forecasting question. In addition, we have a set of possible choices, C, and a corpus of news articles, A; the output is a choice C ∈ C. Our task has a novel constraint that any retrieved article A ∈ A must satisfy t A < t Q . In other words, models have access only to articles that are published before t Q . We have ensured that the information required to solve the question deterministically comes out in an article, gold article, published after t Q , i.e., t gold article ≥ t Q . Another way to think of our setup is that we are asking Q on the day before t Q , knowing that the information required to solve Q is not available yet. This for-  Figure 3: FORECASTQA generation process. The input of FORECASTQA creation is a news article corpus and the output is yes-no/multiple-choice questions. mulation makes our task both a constrained opendomain QA and a forecasting problem--distinct from existing QA tasks. Challenges in FORECASTQA. Due to the constrained open-domain setting and forecasting properties, testing a model's forecasting ability encompasses the following challenges: information retrieval (IR) on limited sources, understanding of temporal and causal relations between events, and finally a forecasting judgement. Our time constraint limits the accessible articles and also creates more challenges than in standard open-domain QA; effective IR methods are necessary to anticipate what knowledge will be useful for predictions from past information sources. Once useful articles have been retrieved, models should understand these articles and reason over pertinent facts from them. Finally, these models use the gleaned knowledge to infer the outcome of a future event. Unlike in other reading comprehension tasks, models cannot rely on the existence of an answer within the text, but must make an educated guess as to what will happen in the future. While our task does encompass reasoning abilities tested in other datasets, no other tasks investigate these reasoning abilities in the context of predicting future events. More analysis on reasoning types can be found in Sec. 4.2.

Dataset Construction and Analysis
In this section, we describe how we construct our FORECASTQA dataset and analyze it.

Construction Details
The data collection is broken down into three sections: (1) gathering a news corpus, (2) generating question-answer-timestamp triples with distractor choices, and (3) verifying the triples' quality. The data generation process is summarized in Fig. 3.
News Corpus Collection. We started by gathering English news articles from LexisNexis 4 . We then curated a list of 21 trustful news sources and filtered articles based on their publishers; we also filtered out non-English articles. Finally, we selected the five-year period of 2015-2019 and filtered out articles outside this period, leaving us with 509,776 articles. This corpus is also used for retrieval in our task setting (i.e., constrained open-domain). Q-Answer-timestamp Triple Creation. 5 Once we assembled the news corpus, we built (question, answer, timestamp) triples to accompany the new corpus as inputs for our task. To generate the needed triples we looked to crowdsourcing via Amazon Mechanical Turk. Our generation task consists of the following steps: (1) we selected a random news article from 2019 from the collected news corpus (these news articles are gold articles and will be hidden for experiments); (2) workers created questions, which if posed before the respective article's publication date would be seen as a forecasting question; (3) they indicated the answer, along with supporting evidence that the question consisted of (to ensure the correctness of the true answer); (4) they were asked to make multiplechoice distractors with their own knowledge and/or access to search engines; and (5) we ensured that a temporal phrase is present in the questions, for example: "After May of 2020...", "... in June of 2021?" to provide a temporal context (constraint) for each question, yielding more precise and welldefined forecasting questions. Completion of this task results in the desired triple of: a forecasting question, an answer to the question (with distractor choices), and a timestamp as our temporal constraint. The timestamp is set as the first day of the month in which the gold article was published.
To diversify questions in the dataset, we created two kinds of questions: binary yes-no questions and multiple-choice questions with four choices. Multiple-choice questions start with one of the six Ws (i.e., who, what, when, where, why, and how) and are more challenging as they require determining the correctness of each choice. Question Quality Verification. We performed a separate crowdsourcing data verification to test and enforce the following criteria: (1) is answering the question a tractable problem given (relevant) Q: Which celebrations of China will the prodemocracy protests of demonstrators spoil in Hong Kong in September 2019?
Sen.: China's leaders will not want overshadowed by protests in Hong Kong, which have grown in intensity since mass demonstrations began in June. Figure 4: Reasoning skills (types) and their frequency (in %) in the sampled data. As each question can be labeled with multiple types, the total frequency does not sum to 100%. On average, 3 reasoning skills are required for each question. Examples of other reasoning types can be found in Fig. 11 in the appendix.

Reasoning -Detailed Reasoning Type
"past" articles?, and (2) is the question deterministically answerable given any article adhering to the question's temporal constraint? -If a question is too difficult, i.e., an educated guess to the answer (when given relevant, constraint-adhering articles) is not possible, then we filter the question out. On the other hand, if the questions are answerable with certainty using "past" articles, or commonsense/world knowledge, then they are not considered to be forecasting questions. The desired response (majority vote from 3 annotators) is a "yes" for criterion (1) and "no" for (2), as that would show that the tuple of question and time constraint simulates the desired forecasting scenario.
With the above method, we filtered out 31% of the questions collected in the triple creation step and were left with 5,704 yes-no questions and 4,513 multi-choice questions. More details about the verification step are included in Sec. A of the appendix.

Dataset Analysis
To better understand the properties of the questions in FORECASTQA, we examine: 1) a few data statistics 2) types of questions asked, and 3) the types of reasoning required to answer our questions.
Summary Statistics. FORECASTQA dataset is composed of 10,392 questions, divided into a 80/10/10 split of train, dev, and test data. Our 10k questions are roughly evenly split between multiple-choice and yes-no binary questions (Table 2). Over 17K distinct words were used to construct our questions and we have 218 unique time constraints associated with them; time constraints range from 2019-01-11 to 2019-11-12. We include additional statistics in Sec. D of appendix.

Types of Questions.
To understand the types of questions in FORECASTQA, we examined the popular beginnings of sentences and created a tree-map plot (see Fig. 2). As shown, nearly half the questions start with the word will (44%), a result of over half of the questions being yes-no questions.
Reasoning Types. To examine types of reasoning required to answer our questions we sampled 100 questions and manually annotated them with reasoning types. Due to the forecasting nature of our dataset, we are particularly interested in questions containing the forecasting ability and thus spend more time looking into these questions. Our condensed results can be found in Figure 4, and more results from our cataloguing effort can be found in Sec. C of the appendix. Note that most questions contain more than one reasoning type.

Methods
To evaluate the forecasting capabilities of recent multi-choice/binary QA model architectures on FORECASTQA, we provide a comprehensive benchmarking analysis in this work. We run the experiments in two settings: (1) closed-book and (2) constrained open-domain setup. In the closed-book scenario only Q (question) and C (answer choices) are provided to the model (Q, C), while A (news articles) is provided for setting (2), (Q, C, A) 6 . We run these settings to understand the difficulty of both the closed-book and open-domain challenges presented by the questions in FORECASTQA. yes / no C BERT [CLS] yes / no For both settings, we explore several baseline models, but all follows a general architecture of a text encoder f and an optional context aggregation module g to aggregate information from a set of retrieved articles. Fig. 5 shows the architectures used. We model both yes-no and multiple-choice questions as a binary classification task; a model's prediction is the class with the largest probability. Below we introduce the details of our baselines.
Text Encoder. We use pre-trained language model, BERT (Devlin et al., 2019), as a text encoder (f from above) 7 . f is designed to deal with (Q, C) and (Q, C, A) inputs, where A is a set of timestamped articles that are retrieved from A to answer (2) Set Aggregation: Alternatively, we ignore the temporal ordering of articles and use a maxpooling operation 7 We did not include more recent pre-trained language models (e.g., RoBERTa (Liu et al., 2019b), ALBERT (Lan et al., 2020), T5 (Raffel et al., 2020)) or pre-trained QA models like UnifiedQA (Khashabi et al., 2020), as these models are trained using text data published after the earliest timestamp in our dataset (2019-01-01), meaning information leakage could occur (and violates the forecasting setup). We tested more LMs in Sec. E.5 of appendix. on the [CLS] token representations of each article. This pooled representation is passed to an MLP layer to make a prediction. Comparison between these aggregations helps understand the effect of modeling temporal order of evidence. These two aggregation modules are denoted by "AGG (GRU)" and "AGG (Maxpool)," respectively.
Multi-document Summarization (MDS). Rather than conducting context aggregation of the retrieved articles, we consider an MMR summarizer (Carbonell and Goldstein, 1998) which performs extractive, multi-document summarization of text to generate a summary A summ (rightmost architecture in Fig. 5). The summary article A summ is treated as if it is an A i ∈ A and fed into a text encoder along with Q and C which then produce the [CLS] embedding for making a prediction. We name this method "MDS." Integrated Approach. To take the best of both worlds in (Q, C) and (Q, C, A) settings, we integrate two architectures (the leftmost and middle ones in Fig. 5). We concatenate the last two hidden representations of each architecture before passing the concatenated representation through a shared MLP layer. We use BERT LARGE as f in both architectures, AGG (GRU) for g and call this model "BERT LARGE ++ (integrated)" in Table 3.
Other Baselines. We also consider other baselines: ESIM (Chen et al., 2017b), BIDAF++ (Clark and Gardner, 2018), prepending extracted open event triples (Liu et al., 2019a) to BERT input, and a script learning approach, SAM-Net (Lv et al., 2019). We modify the approaches to fit into our setup. Detailed descriptions of each baseline method are included in Sec. E.3 of appendix.

Experimental Setup
We adopt two types of settings: the closed-book setting (Q, C) and the constrained open-domain setting (Q, C, A).
In the constrained open-domain setting, we use BM25 (Robertson et al., 1995;Qi et al., 2019) as our IR method 8 to obtain A, 10 retrieved articles. We also explore other IR methods in the later section. Note that we retrieve articles that do not violate the time constraints. We feed the question Q as a query and limit our access to articles in A by t Q . Additionally, we validate the  answerability of our questions by providing gold articles instead of retrieved articles (Sec. 6.3).
Evaluation Metrics. Because forecasting is uncertain, a system's prediction probabilities indicate its confidence answering the question. In addition to accuracy, we consider Brier score (Brier, 1950), which measures the mean squared error of probabilities assigned to sets of answer choices (outcomes). Formally, Brier = 1 N N i=1 C c=1 (p ic − y ic ) 2 , where p ic is the probability of prediction; y ic is a label indicator for class c of the instance (1 or 0), N is the number of prediction instances, and C is the number of classes (2 or 4). The highest Brier score is 0 (probability 1 for the correct class, probability 0 else), while the worst possible Brier score is 2 (probability 1 for the wrong class, probability 0 else). A confident model gets low Brier scores.

Human Performance
To benchmark human performance, seven annotators (computer science graduate students) who were not involved in question generation were asked to answer 150 randomly sampled questions from the test set. We consider two scenarios: 1) annotators are provided with retrieved articles, A; and 2) annotators can access any article published before the timestamp via Google Search. Moreover, as annotators live in the "future" with respect to the timestamp of a question, they might already know the actual answer. To avoid the over-estimation  of accuracy, we asked the annotators to not use their "future" knowledge. If they felt this is not possible, we asked them to skip the question. On average, 28.3% of questions are skipped. Given this setup, humans achieve 71.2% and 79.4% accuracy respectively, for the two scenarios when taking a majority vote for each question; we also observed good inter-annotator agreement. The two scenarios are referred as "(α)" and "(β)" in Table 3.

Results and Performance Analysis
Results on the Constrained Open-domain Setting. Table 3 shows the results of baseline methods for comparison. We compare pre-trained language models with different context aggregators and other baselines. The integrated model, BERT LARGE ++ shows the best performance in terms of accuracy, while BERT LARGE (closed-book) shows the best Brier score. Unlike the accuracy metric, the Brier score penalizes over-and under-confident forecasts (Mellers et al., 2014) -thus the best model under each metric can be different. The marginal differences in performance between the two settings suggest that access to information (text evidence) alone does not solve the forecasting problem. We hypothesize an inability to encode salient relations for forecasting purposes prevents the additional information from proving useful. Among the aggregators in BERT BASE , the GRU aggregator outperforms other aggregators and summarizers. This suggests that utilizing articles' temporal order helps the reasoning. Overall, baselines fall behind human performance by over 10% points given the same retrieved articles. Study of Different IR Methods. We further test several retrieval methods: BM25 (Robertson et al., 1995;Qi et al., 2019), TF-IDF (Chen et al., 2017a), and a pre-trained dense passage retriever (DPR) (Karpukhin et al., 2020). As in Table 4, BERT LARGE with DPR retriever and the Maxpool aggregator shows the best performance than other combinations. However, DPR does not achieve the best accuracy for all methods. This implies that 1)   Table 6: Answerability study on test set. Instead of retrieved articles, we provide BERTBASE with ground-truth context: a gold article or evidence sentence. We thus convert FORECASTQA to a reading comprehension task and examine the answerability of the questions.
stronger retrieval methods are required to identify useful evidence; 2) complex forecasting abilities may be a bottleneck of current systems.
Ablation on Timestamp Modeling. We conduct an ablation study on modeling time information (publication date) of the retrieved articles, as seen in Table 5. We test: a) pre-pending date string as BERT input, b) using binary encodings of dates 9 and concatenate with article encoding before aggregation, and c) using char-RNN (Goyal and Durrett, 2019) for encoding date string before aggregation 10 . We find that using binary encodings of dates improves the accuracy for the maxpool aggregator. However, the GRU aggregator's accuracy decreases when given date information. We conjecture that our modeling for the time information of each article is not strong enough to help forecasting. We leave more sophisticated modeling for future work.

Answerability of Questions.
To validate that the questions in FORECASTQA are indeed answerable, we convert our setup into a machine reading comprehension (MRC) task -find an answer given an assumed appropriate context. We provide the model with a gold article or the evidence sentence (Sec. 4.1). Since pre-trained models have achieved high performance on MRC tasks (Rajpurkar et al., 2016), we expect adequate performance when provided the correct context. As seen in Table 6, we observe that in closed-book setting, BERT is able to beat out a random baseline, but it still does not 9 https://temporenc.org 10 Details are described in appendix Sec. E.4   perform well; implying our questions are not trivial for BERT, and context is required to answer them correctly. When given the gold article, BERT achieves 76.9% (+22%) and it even performs better (84.4%) given the evidence sentence. This all implies that given the right information, our forecasting questions can be answered correctly.
Study of Data Efficiency. To examine how models might perform with less/more training data, we evaluate BERT BASE (closed-book) on the test set, by training it with varying amounts of labeled data. Fig. 6a shows the the resulting "learning curve." We observe the accuracy of the model is "expected" to reach 70%, assuming 100k examples -which is still 9% point lower than human performance.
Results on Different Question Types. We test BERT BASE (closed-book) on different question types of multi-choice questions from our development set (Fig. 6b). We find that the accuracy of the model varies across different question types: "how" questions are the most difficult to predict while higher accuracy is achieved on "why" questions. Also for yes-no questions, the method achieves 69.5% on "yes" questions and 62.9% "no" questions, indicating that there is no significant bias towards certain type of binary questions.
Error Analysis. We observe 4 main categories of errors produced by the methods in our analysis: (1) retrieving irrelevant articles, (2) incorrect reasoning on relevant evidence, (3) lacking (temporal) common sense, and (4) lacking numerical knowledge. Please refer to Sec. E.7 of appendix for examples and in-depth discussions of these errors.

Conclusion
Forecasting is a difficult task that requires every possible advantage to do well. It would be wise to harness this pool of unstructured data for training automatic event forecasting agents. To utilize this form of data for forecasting, we proposed a question-answering task that requires forecasting skills to solve FORECASTQA, and provided the accompanying dataset. Various baseline methods did not perform well, but this is not surprising given the inherent difficulty of forecasting. Our benchmark dataset can benefit future research beyond natural language understanding and hope forecasting performance will be significantly improved.

A Detailed Dataset Creation
In this section, we present detailed explanations of dataset creation. We first selected news sources as in the following section.

A.1 List of News Sources
The

A.2 Dataset Creation
Turking Guidelines. Figs 7 and 8 show the instructions and interface for creating our multiplechoice questions. Workers made multiple-choice distractors with their own knowledge, but they were encouraged to find good distractors using search engines. To ensure the answerability of the created questions, we ask them to indicate the answer along with the supporting evidence that the question is made from. We omit the interfaces due to the space limit.
Initial Screening. The ideal result of our crowdsourcing task are forecasting questions that are tractable but not trivial, and by definition not answerable with certitude using information currently available. Thus to avoid undesirable questions, we asked two additional questions to help screen poorly constructed questions. As shown in Fig 8, we try to determine the difficulty of the question and whether it is answerable using "current" or "past" information. Question 1 attempts to establish whether the question is indeed tractable and asks whether there exists some qualified group of people who could reason and make an educated guess at the answer to the question. On the other hand, question 2 tries to determine if the question is either too easy or is definitively answerable given "current" and "past" information. Thus, the desired response is "yes" and "no" for Questions 1 and 2, respectively; we filtered out created questions that do not satisfy the desired condition.

A.3 Additional Question Quality Checks
We asked the same two questions from our initial quality screening and an additional question to help adjust the timestamp associated with the question if needed. Per question, we got 3 crowd workers to answer the three questions and took the majority vote for question 1 and 2, while selecting the earliest selected timestamp for question 3. We dropped the question, if the majority vote was "no" for question 1 or "yes" for question 2. Moreover, if at least one worker selected "e" in the question 3 (There is no appropriate recent time stamp), then we filtered out the question. Additionally, if the created ques- Reasoning Process: The Corolla Wild Horses Protection Act will make people to protect the wild horses (forecasting skills -causal relations). If people start to protect the wild horses from January, the wild horses will be found in September (forecasting skills -inferring based on past eventswe can find the answer from this part). Horse is an animal (commonsense -world knowledge). The Outer banks of North Carolina = North Carolina and the Outer Banks (language understanding -paraphrase).  Table 7 shows an example of reasoning process to solve a question.  C Additional Reasoning Types Figure 11 shows additional reasoning types. Language Understanding. We introduce lexical variations and syntactic variations following Rajpurkar et al. (2016Rajpurkar et al. ( , 2018. Lexical variations represent synonyms or coreferences between the question and the evidence sentence. When the question is paraphrased into another syntactic form and the evidence sentence is matched to the form, we call it syntactic variation. We find that many questions require language understanding; lexical variations account for 46% and syntactic variations do for 66%.

B Example of Reasoning
Multi-hop Reasoning. Some questions require multi-hop reasoning (Yang et al., 2018), such as checking multiple properties (9%) and bridge entities (5%) . The former one requires finding multiple properties from an article to find an answer. The latter one works as a bridge between two entities, where one must identify a bridge entity, and find the answer in the second hop. Numerical Reasoning. To answer our questions, one needs numerical reasoning (Dua et al., 2019). The answer is found by adding or subtracting two numbers (5%), or comparing two numbers (8%) in the given articles. Commonsense Reasoning. The questions require world knowledge (Talmor et al., 2019), social commonsense (Sap et al., 2019), and temporal commonsense (Zhou et al., 2019). To solve these questions, an AI agent must leverage assumed common knowledge in addition to what it finds in the news corpus. We find that 36% questions need world knowledge and 7% questions require social commonsense. The other type of commonsense reasoning is temporal commonsense, which is related to temporal knowledge (Zhou et al., 2019). 9% questions are related to temporal commonsense. Tables 8 and 9 show the statistics and answer types in FORECASTQA.

E.1 Details on a Text Encoder
We use Huggingface's codes 11 . We chose the best learning rate among {3e−5, 1e−5, 5e−6} and the number of epochs is 3. We set the max sequence length to 512.

E.2 Details on IR methods
We index the English news articles with Elasticsearch (Gormley and Tong, 2015). We followed the setups in Qi et al. (2019). We use Elasticsearch's simple analyzer which performs basic tokenization and lowercasing for the title. We use the standard analyzer which allows for removal of punctuation and stop words from the body of articles. At retrieval time, we use a multi match query in the Elasticsearch against all fields with the same query, which performs a full-text query employing the BM25 ranking function (Robertson et al., 1995) on all fields, and returns the score of the best field for ranking. To promote documents whose title matches the search query, we boost the search score of any result whose title matches the search query by 1.25, which results in a better recall for entities with common names.

E.3 Details on Baselines.
We consider following baselines: (1) Event-based approaches: We test event-based approach, BERT with event triples (two entities and a relation between them) and BERT based on SAM-Net (Lv et al., 2019) for our setup. It is non-trivial to apply the event-based approaches to our setup. Thus, we preprocess the retrieved news articles into event triples (subject, relation, object) using Liu et al. (2019a). We simply regard them as text, we concatenate the triples, and feed them into BERT and call it BERT with event triples. In addition, we apply a script learning approach (SAM-Net (Lv et al., 2019)) to our setup. A question and choices are not used in their original method; thus we encode them using BERT and concatenate the encodings with the approach's final representation. This representation is fed into a linear layer and the linear layer predicts whether the choice is correct or not. We used BERT LARGE for the former one and BERT BASE for the latter one.
(2) ESIM (Chen et al., 2017b). An NLI model, where we change their output layer so that the model outputs probabilities for each answer choice with a softmax layer. We use ELMo (Peters et al., 2018) for word embeddings.
(3) BIDAF++ (Clark and Gardner, 2018). The model requires context, and thus we use a top-1 article by an IR method. We augment it with a self-attention layer and ELMo representations. To adapt to the multiple-choice setting, we choose the answer with the highest probability. The input to ESIM is a question and a set of choices (Q, C), while that of BIDAF++'s is a question, a set of choices, and retrieved articles (Q, C, A). 12

E.4 Time Modeling
We conduct an ablation study on modeling time information of the retrieved articles. We test the following models: a) pre-pending date string as BERT where the date format is "YYYY-MM-DD", b) using binary encodings of dates: we first encode the time into a binary encoding using "Temporenc 13 " and concatenate the encoding with an article encoding before aggregation, c) using char-RNN (Goyal and Durrett, 2019) for encoding date string before aggregation.

E.5 Experiments with Recent LMs.
As mentioned in Sec 5, we did not report more recent pre-trained language models (e.g., RoBERTa (Liu et al., 2019b), ALBERT (Lan et al., 2020)) because they are trained using text data published after the earliest timestamp in our dataset 12 We did not include existing event forecasting methods since they are designed for modeling structured event data (Fawaz et al., 2019) and thus are not directly applicable to FORECASTQA which requires modeling of unstructured text.