TIMEDIAL: Temporal Commonsense Reasoning in Dialog

Everyday conversations require understanding everyday events, which in turn, requires understanding temporal commonsense concepts interwoven with those events. Despite recent progress with massive pre-trained language models (LMs) such as T5 and GPT-3, their capability of temporal reasoning in dialogs remains largely under-explored. In this paper, we present the first study to investigate pre-trained LMs for their temporal reasoning capabilities in dialogs by introducing a new task and a crowd-sourced English challenge set, TimeDial. We formulate TimeDial as a multiple choice cloze task with over 1.1K carefully curated dialogs. Empirical results demonstrate that even the best performing models struggle on this task compared to humans, with 23 absolute points of gap in accuracy. Furthermore, our analysis reveals that the models fail to reason about dialog context correctly; instead, they rely on shallow cues based on existing temporal patterns in context, motivating future research for modeling temporal concepts in text and robust contextual reasoning about them. The dataset is publicly available at https://github.com/google-research-datasets/timedial.


Introduction
Humans can effortlessly reason about temporal concepts of everyday events such as their duration, frequency, or relative ordering (Allen, 1984; Radvansky and Zacks, 2014) based on rich commonsense knowledge about how the world works, especially in relation to time. However, reasoning about such concepts has been challenging for machines (Kahn and Gorry, 1977;Kozareva and Hovy, 2011) since it requires both understanding the local temporal expressions and reasoning about their global contexts such as their relative ordering and relations * Work done during an internship at Google.  Table 1: Examples from our TIMEDIAL challenge set, demonstrating the need for commonsense knowledge and arithmetic reasoning over the context to infer the correct answers. Key contextual information for reasoning success is highlighted. (UzZaman et al., 2013;Ning et al., 2018b;Pustejovsky, 2017). The problem becomes even more challenging in dialogs, where explicit and implicit inter-dependencies among temporal concepts can appear across conversation turns. For instance, for the first dialog in Table 1, one must understand the context, i.e., selling wine, and use world knowledge of minimum legal drinking age in order to reason about correct answers to fill in the blank. Similarly, in the second conversation, commonsense about the durations summer, month, week, day and their relations, plus numerical reasoning, are necessary to make the inference.
Although previous works have studied temporal reasoning in natural language, they have either focused on specific time-related concepts in isolation, such as temporal ordering and relation extraction (Leeuwenberg and Moens, 2018;Ning et al., 2018a), and/or dealt with limited context, such as single-sentence-based question answering (Zhou et al., 2019) and natural language inference (Vashishtha et al., 2020;Mostafazadeh et al., 2016).
In this work, we make the first systematic study of temporal commonsense reasoning in a multi-turn dialog setting. The task involves complex reasoning that requires operations like comparison and arithmetic reasoning over temporal expressions and the need for commonsense and world knowledge. We design a new task for dialog-based temporal reasoning and present a new challenge set in English, called TIMEDIAL, to evaluate language understanding models on the task. We formulate the problem as a crowd-sourced cloze task with multiple choices based on dialogs in the DailyDialog dataset (Li et al., 2017). Given a dialog with one temporal span masked out, the model is asked to find all correct answers from a list of four options to fill in the blank (Table 1).
The challenge set requires the models to demonstrate understanding of the context and use temporal commonsense to make right choices. Our final challenge set consists of 1.1K carefully curated dialog instances.
We then study the performance of several stateof-the-art pre-trained language models on TIME-DIAL along several dimensions including modeling paradigms (classification, mask filling, and generation), the scope of dialog contexts, in-domain vs. out-of-domain training, dependence on shallow text matching for reasoning, and the types of reasoning required. Our experiments demonstrate that offthe-shelf, pre-trained language models cannot effectively reason about temporal aspects in a dialog, even with domain-specific finetuning. Our findings indicate that large-scale pre-trained models even after fine-tuning may not be sufficient for robust temporal reasoning in dialogs, and motivate future research toward modeling temporal concepts over diverse everyday events, and contextual reasoning about them.

Task: Temporal Reasoning in Dialog
We formulate the dialog-based temporal commonsense reasoning problem as a cloze task (Taylor, 1953). Formally, given a multi-turn dialog context of n conversational turns between two speakers A and B, where a temporal words span within the context is masked out, the task is to predict the suitable temporal expression(s) for the masked-out span from a list of options. That is, we want the conversation model to select all the correct answers from the options based on the dialog context. Following similar cloze-style challenge datasets, we use accuracy as the evaluation metric (Mostafazadeh et al., 2016;Onishi et al., 2016;Mihaylov and Frank, 2018).
Having a non-trivial set of options is crucial to build a challenge set and to avoid accidental spurious biases (Geirhos et al., 2020;Gururangan et al., 2018;Le Bras et al., 2020). We ensure this via the following filtering process.
(1) For each masked span, there is more than one correct answer in the options. This makes the task more challenging for models since more comprehensive understanding of the context is required to recognize all the correct choices. In our dataset ( §3) we guarantee two incorrect answers for each masked span.
(2) Some incorrect options are selected to be spuriously correlated with the dialog context. For example, we include temporal spans in the dialog context as negative options, which will challenge models that rely primarily only on shallow pattern matching without correct temporal reasoning. We present more information in §3 about how the negative options were created by human annotators.

Dataset: TIMEDIAL
The TIMEDIAL dataset is derived from DailyDialog data (Li et al., 2017), which is a multi-turn dialog corpus containing over 13K English dialogs. Dialogs in this dataset consist of turn-taking between two people on topics over 10 broad categories, ranging from daily lives to financial topics.

Data Collection
Our data collection process involves two steps: (1) identifying dialogs that are rich in temporal expressions, and (2) asking human annotators to provide correct and incorrect options for cloze instances derived from these dialogs. We now describe these steps in detail.
Temporal expression identification. Here, we select dialogs that are rich with temporal information, in order to focus on complex temporal reasoning that arises in natural dialogs. Temporal expressions are automatically identified with SU-Time (Chang and Manning, 2012), an off-the-shelf  temporal expression detector. 1 We keep only the dialogs with more than 3 temporal expressions and at least one expression that contains numerals like "two weeks" (as opposed to non-numeric spans, like "summer", "right now", and "later"). In our initial experiment, we observe that language models can often correctly predict these non-numerical temporal phrases. We note that temporal expressions containing numerals serve as more challenging sets of options than non-numerical ones. This filtering step results in 1,127 unique dialogs for further processing.
Human annotated options. Next, we make spans in the dialogs. For a dialog, we mask out each temporal expression that contains numerals, each resulting in a cloze question that is then sent for human annotation.
This resulted in 1,526 instances for annotation. For each masked span in each dialog, we obtain human annotation to derive a fixed set of correct and incorrect options given the context. Concretely, given a masked dialog and a seed correct answer (i.e., the original text) for the masked span, the 1 https://nlp.stanford.edu/software/ sutime.shtml annotators 2 were asked to (1) come up with an alternative correct answer that makes sense in the dialog adhering to commonsense, and (2) formulate two incorrect answers that have no possibility of making sense in the dialog context. We highlight all time expressions in the context to make it easier for annotators to select reasonable time expressions.
To ensure that the annotated incorrect options are not too trivially distinguishable by the models (as discussed in §2), we define three rules for the annotators to follow.
• Rule 1: Phrase Matching. The rater should first try to pick another temporal span from the dialog context that makes syntactic/semantic sense (e.g., when the span is of the appropriate type, such as duration, for the masked span) but is still incorrect according to commonsense. • Rule 2: Numeral Matching. If Rule 1 does not apply, raters should follow a relaxed version of Rule 1, whereby the incorrect option should contain any numeral occurring in the dialog context.  • Rule 3: Open-ended. If neither of the above rules is applicable, then raters can come up with an incorrect option using their own judgment. The two incorrect options are required to differ from each other as much as possible.
Rules-1&2 are designed to confuse models that rely on shallow pattern matching. Finally, to ensure the quality of the human-annotated options, we perform a subsequent round of human validation on the gathered data. The validators identify and fix issues such as duplicate options, unreasonable or obscure annotations w.r.t natural usage, or ungrammatical annotations that do not fit in the context. Table 3 shows statistics of TIMEDIAL. The dataset contains over 1.1K test instances. Each dialog contains 11.7 turns and 3 temporal expressions on average, presenting richer and more complex context compared to the recent single-sentence-based temporal question answering benchmarks (e.g., Zhou et al., 2019;Vashishtha et al., 2020). As above, each test instance contains two correct answers and two incorrect ones. 3 Over half of the incorrect options are annotated based on phrase and numeral matching from context, which pose a significant challenge for models relying on shallow text matching, as we show in our experimental analysis ( §5). Answering different instances in the dataset requires different types of core reasoning abilities, such as comparison, arithmetic inference, or reasoning based on world knowledge or general commonsense. To facilitate fine-grained analysis, we also annotate the reasoning categories for a randomly sampled set of 100 dialogs. Though each instance can involve multiple reasoning types, we associate it with one predefined category label that indicates the primary type of reasoning it requires. Table 2 shows the category distribution and examples in each of the category. We observe that the dataset requires general commonsense for 60% of the dialogs, making it the most common reasoning type.

Modeling
We consider a broad set of methods and evaluate their performance on our challenge TIMEDIAL dataset. These methods vary in terms of the modeling paradigms, the scope of the dialog contexts, and training settings. In particular, they encompass the major ways pre-trained LMs are currently used in downstream tasks ( §4.1) which often outperform earlier specialized non-pretrained models. We also consider different lengths of context used in reasoning, varying by their vicinity to the masked span ( §4.2). Finally, we study different training settings, including zero-shot, in-domain, and out-of-domain training ( §4.3).

Modeling Paradigms
We experiment across three major modeling paradigms: (i) Binary Classification, (ii) Mask Filling, and (iii) Generation. Figure 1 shows the different architectures. For each test instance, the model takes as input a pair of (masked dialog context, candidate), and outputs a score measuring how likely the candidate being a correct answer. Based on the prediction scores of all options, the model then chooses the top two positive candidates as the predicted answer for the instance. Each paradigm of models is finetuned using training data from different domains, as discussed in §4.3.

Binary Classification
In this setting, we formulate the task as a binary classification problem, i.e., we use a classifier to measure the probability of the candidate in the (masked dialog context, candidate) pair being a correct answer. Any powerful LM -e. This method's key challenge is the lack of annotated training data for direct supervision. We generate weak supervision training data as follows.
In an unlabeled corpus, we use the SUTime tool  to annotate temporal spans. We mask each temporal span in this corpus and use the masked text as one positive example for binary classification. To generate negative example, we randomly sample another temporal span from the dialog context and use it as a negative example for the masked temporal span. The resulting data is noisy because the randomly sampled temporal span can also logically fit in the masked span in the given context; however, we assume the likelihood of that happening is low. We leave drawing harder negative instances using heuristics to future work.

Mask Filling
We also use the mask filling approach of BERTlike mask language models (MLMs). For each dialog context and a candidate temporal span of m tokens, we replace the blank in the dialog context with m masked tokens. We then evaluate the likelihood of predicting the temporal span tokens for those masked positions, and make average across the positions. A key advantage of this method is that we can directly apply a BERT model in the zero-shot manner since the model was pretrained in the same way, as for accommodating for [MASK] fillings. Additionally, we also finetune BERT's MLM for learning task specific properties.

Generation
The third method is a fully generative approach using the text-to-text paradigm of T5 (Raffel et al., 2020). Given a masked dialog context, the model is trained to generate the masked text in an encoderdecoder framework. As a result, evaluating the likelihood of generating the given temporal span (normalized with the length of the span) is used as the probability of it being correct. Similar to mask filling, we use T5 either in a zero-shot manner or with additional fine-tuning.

Dialog Context
We aim to study the influence of context on a model's temporal reasoning in dialog by incorporating varying scopes of dialog context based on their vicinity to the target span. Since the dialogs in TIMEDIAL are rich in temporal concepts, we want to evaluate LMs' dependence on shallow text matching vs. the ability to accurately understand the causal relations between those concepts (see Table 6). We use the following three settings: • Full context, where the model is presented with the complete available dialog to reason on. Due to our design of challenging negatives, the full context can often confuse models that rely on shallow cues. • Local context, where we provide only with the utterances that immediately precede and follow the target utterance. • Target context, where the context is restricted to only the particular utterance that contains the masked span.

Training Details
For all models, we consider two common training settings, e.g., in-domain data, which is typically small, and out-of-domain training where a large amount of data is available. Table 4 shows training data statistics. For mask-filling and generation, we also evaluate in a zero-shot setup with no finetuning.
In-domain training. Our challenge TIMEDIAL test set is derived from contextually rich dialogs  from the DailyDialog dataset, based on the number of temporal spans. However, this still leaves remaining data with less than 3 temporal spans or with no numeric span. By masking each temporal span in each dialog, we obtain 14.5K training instances to use in our domain specific fine-tuning.
Out-of-domain training. In this setting, we consider a much larger corpus from a general domain. Specifically, we use the large scale training set based on the Meena dataset Adiwardana et al. (2020), which is mined and filtered from public domain social media conversations over 341GB of text (40B words). 4 Compared to the above indomain data from DailyDialog which were manually written by human annotators in a clean and consistent way, the dialogs in the Meena corpus tend to be noisy, casual, and usually short. Like our DailyDialog processing, we identify all temporal expressions for dialogs in Meena using SUTime.

Experiments and Analyses
Using the proposed TIMEDIAL challenge set, we next conduct extensive experiments and analyses on the different model variants and context settings. We use either 4x4 or 8x8 Cloud TPUs V3 pod slices 5 for fine-tuning and one V100 GPU for inference. We provide more details of the experiment configurations in the appendix.
Evaluation. Since each example of TIMEDIAL contains two correct answers, we report the metric 2-best accuracy, which measures whether both of the model's top-ranked answers are correct. In  other words, if the model erroneously ranks an incorrect answer over a correct one, we consider it to be an error case. Note that we use the rankingbased metric as opposed to classification-based ones (for example, by asking the model to classify whether each individual candidate answer is correct or not (e.g., Zhou et al., 2019)) and because it presents a stricter measure that penalizes any incorrect answers being ranked over correct answers, and the ranking metric is not influenced by specific choices of the threshold hyperparameter that cuts off positive and negative predictions.  (Landis and Koch, 1977)).

Model Performance
Overall. The generation model based on T5-LARGE and finetuned on the in-domain DailyDialog data achieves the best performance. However, its 2-best accuracy (74.8) lagged far behind the human performance, demonstrating the difficulty of the TIMEDIAL challenge set.  Table 6: Example prediction errors made by different models for cases with challenging options, based on the phrase and numeral matching rules ( §3). GOLD denotes the true labels. The model predictions show that the models get confused by learning shallow text matching in terms of pre-existing temporal concepts (marked by bold faced text) in the context.

Zero-shot vs. out-of-domain vs. in-domain.
When comparing the different training data setup, we observe that models with in-domain training using the DailyDialog data (e.g., LARGE-IN) consistently outperforms those trained on the large out-ofdomain Meena dataset (e.g., LARGE-OUT). Both setups outperform the zero-shot models (without any fine-tuning) (e.g., LARGE-ZERO). The results show that the large LMs still highly depend on indomain or at least dialog data to grasp and enhance their temporal reasoning ability in dialog context. Further, we see increasing performance with increasing model size, which is not unexpected given the complexity of the task.

Error Analysis
Next, we analyze the different types of errors based on different rules for negative option creation in the annotation process. In particular, the phrase matching rule picks an exact time span from the dialog context, and numeral matching picks numerals from the dialog context. Thus, models picking those incorrect options imply reliance on spurious shallow text matching features. Figure 2 shows the percentage of errors in terms of the different rules. For example, the BERTbased classification model CLS-IN erroneously picks 52% of negative options created by the phrase matching rule as correct answers (i.e., by ranking those negative options over the true correct options). We observe that the various models are all most vulnerable to the phrase matching options compared  to other types of negative options, showing that they rely on spurious text matching to a significant extent. Between BERT and T5, we find T5 being more robust to shallow text matching. Table 6 provides further examples of prediction errors, illustrating confusions due to shallow text matching. In the first dialog, both incorrect answers already partially occur in the context or are related to preexisting concepts (i.e., "three" to "three o'clock", and "nine" to "September"). All the three models were confused and chose either of the two as the top prediction for the blank, even though the options clearly violate the context. Interestingly, the mask filling model was completely confused and ranked both incorrect answers over the correct ones. Similarly in the second example,  the models fail to capture the contextual semantics. Table 7 shows how different scopes of dialog context ( §4.2) affect model performance. First, the most restrictive target-only context is insufficient for accurate reasoning, by producing the weakest performance of most models. This highlights the importance of context information for temporal commonsense reasoning in dialog, which differs from previous temporal reasoning studies based on limited context (e.g., single-sentence question answering). Second, we note that the full dialog context does not always lead to the best performance.

Influence of Dialog Context
In 5 out of the 12 cases, using the local context yields equal or higher reasoning accuracy. The results show that the LMs still fall short of properly modeling the rich dialog contexts and making effective use of all information to do reasoning. on comparison-based instances seems similar.
Some recent work has focused on building challenging benchmarks for temporal commonsense reasoning. Story Cloze Test focuses on stereotypical causal temporal and causal relations between events (Mostafazadeh et al., 2016). Vashishtha et al. (2020) recast temporal reasoning datasets for event duration and event ordering into the natural language inference (NLI) format. Turque  is an reading comprehension dataset where the model needs to answer questions such as "what happens before/after [event]". Most related to our work is McTaco (Zhou et al., 2019), a dataset for evaluating temporal commonsense in the form of multiple-choice reading comprehension, where the context usually consists of a single sentence. Our work instead studies temporal commonsense reasoning in dialogs which often require significant commonsense and world knowledge to reason over rich context (Qin et al., 2019b;Dinan et al., 2018).
Commonsense reasoning with LMs. With the recent success of large pre-trained language models (LMs) (Devlin et al., 2019;Brown et al., 2020), it is an open question whether these models, pretrained on large amounts of data, capture commonsense knowledge. Several works have been proposed to assess the ability of LMs for commonsense or numerical reasoning (Zhang et al., 2020;Bouraoui et al., 2020), or to mine commonsense knowledge from LMs (Davison et al., 2019). Lin et al. (2020) showed that state-of-the-art LMs such as BERT and RoBERTa performs poorly on numerical reasoning tasks without any finetuning. Works have also been proposed to improve language model's commonsense reasoning (Qin et al., 2020(Qin et al., , 2019a and numerical reasoning abilities (Geva et al., 2020). In our work, we study several modeling approaches and finetuning settings of large LMs, and establish strong baselines for temporal commonsense reasoning in dialogs.

Conclusions
We introduced TIMEDIAL, a challenge set consistting of 1.1K multiple-choice cloze questions for temporal commonsense reasoning in dialog. The dataset is carefully curated to evaluate a models' ability to do temporal commonsense/numerical reasoning over dialog context. In order to establish strong baselines and provide information on future model development, we conducted extensive experiments with state-of-the-art language models with different settings: the scope of context, weak supervision strategies, and learning objectives. While humans can easily answer these questions (97.8% accuracy), even our best model variant (T5-large with in-domain training) struggles on this challenge set (73%). Moreover, our qualitative error analyses show that these large language models often rely on shallow, spurious features (particularly text matching) when answering these questions, instead of truly doing reasoning over the context.