Time-Aware Representation Learning for Time-Sensitive Question Answering

Time is one of the crucial factors in real-world question answering (QA) problems. However, language models have difficulty understanding the relationships between time specifiers, such as 'after' and 'before', and numbers, since existing QA datasets do not include sufficient time expressions. To address this issue, we propose a Time-Context aware Question Answering (TCQA) framework. We suggest a Time-Context dependent Span Extraction (TCSE) task, and build a time-context dependent data generation framework for model training. Moreover, we present a metric to evaluate the time awareness of the QA model using TCSE. The TCSE task consists of a question and four sentence candidates classified as correct or incorrect based on time and context. The model is trained to extract the answer span from the sentence that is both correct in time and context. The model trained with TCQA outperforms baseline models up to 8.5 of the F1-score in the TimeQA dataset. Our dataset and code are available at https://github.com/sonjbin/TCQA


Introduction
Question Answering (QA) models (Devlin et al., 2019;Clark et al., 2020) have achieved significant success in recent years.However, most existing QA models fail to understand time (Chen et al., 2021) since most QA datasets (Rajpurkar et al., 2018;Kwiatkowski et al., 2019) lack temporal information.Ignoring temporal constraints when answering questions can lead to inaccurate or unreliable results (Chen et al., 2022).For instance, as shown in Figure 1, neglecting the time while extracting the answer may lead to the selection of an incorrect entity, 'Katie'.
To overcome this limitation, language models must be able to incorporate temporal information into their comprehension of the context in which a question is asked.This requires the model to recognize temporal expressions within the text and understand the relationship between the time specifiers and numerical values.For example, asking about anything that happened 'after 2020' and 'before 2020' are entirely different, even though they include the same number.Therefore, models must be capable of comprehending the connection between time specifiers and numbers beyond simple numerical comparisons.
This study aims to investigate methods for enhancing the performance of QA models in timesensitive tasks.Specifically, we aim to develop a model that can process temporal information and utilize it to answer time-sensitive questions precisely.Injecting time awareness and numeracy into QA models is challenging since there are many possible temporal expressions, and the model must consider time information as an independent part of the context.Therefore, we propose a Time-Context aware Question Answering (TCQA) framework to achieve this issue.We train the model through Time-Context Dependent Span Extraction (TCSE) task and contrastive time representation learning.
In this paper, our contributions are: • We propose a TCQA framework that involves TCSE and contrastive time representation learning, and generate synthetic data to enhance temporal reasoning ability to understand time expressions.
• We demonstrate that training the model with TCQA can improve the time awareness of QA models.
• We introduce a new metric to evaluate QA models in terms of time and context awareness.

Related Work
Several previous works have addressed the issue of temporal reasoning in question answering using knowledge graphs.Shang et al. (2022) proposed a novel framework for handling complex temporal questions that involve time ordering.Saxena et al. (2021) jointly train the model using text with timestamps.However, these approaches may not be sufficient for time-sensitive QA tasks, as temporal knowledge graphs typically handle only structured time information such as (Barack Obama, position held, President of USA, [2009,2017]).Despite these efforts, there remains a gap in research regarding handling various time expressions and numerical reasoning in time-sensitive QA tasks.Chen et al. (2021) attempted to address this gap by constructing a dataset containing time-sensitive question-passage pairs.Their analysis revealed that existing language models often fail to adequately consider temporal constraints in such tasks, resulting in significantly lower performance than humans.

Method
We present an approach to improve the performance of models in time-sensitive QA tasks by proposing a Time-Context aware QA (TCQA) framework.

Synthetic Time-Sensitive Data Generation
Data generation for TCSE involves constructing question-context templates.A question-context template is a pair of questions in which the time constraint is masked, and a context in which time information and target entity are masked, as shown in Figure 2.
We extract time-related sentences from Wikipedia articles.Then, a question is generated for each extracted sentence by using the generation model (Raffel et al., 2020).We create a template of the question and sentence pair by replacing the person entity and time expression with special tokens, '[NAME]' and '[TIME]', respectively.
To obtain time-sensitive question-context pairs, we utilize a time pair generation process and we employ the 'names' Python module that randomly generates a person's name.
We generate random time pairs through rulebased matching of time specifiers and years.To simplify template generation, we assume that all events continue indefinitely when generating year numbers.We adopt seven time specifiers {in, after, since, before, until, between, from}.We generate positive time expressions that match the time range of the question and negative time expressions that does not.We exclusively use the time specifier 'in' when generating time expressions for the context to facilitate model training.For example with rule-based matching, if the question time is 'before 1995', then positive time is the year smaller than 1995, and negative time is the year greater than 1995.We randomly select one of the context templates to obtain a negative context.
As depicted in Figure 2, we get positive and negative context and time for each question.This allows us to produce sentences that are correct in both context and time (BC), only in context (CC), only in time (TC), and are incorrect in both (BI) for the corresponding question.

Time-Context dependent Span Extraction
We train the model in a multi-task setting using both reading comprehension and TCSE tasks.The loss for the reading comprehension task, denoted as L RC , is calculated by the sum of cross-entropy loss between ground truth and predicted distribution of start and end indices.Similarly, the TCSE task adopts the same loss function, but with the answer span set as the target entity in 'BC' context.

Contrastive Representation Learning
In order to enhance the time-awareness, we employ contrastive learning.We construct contrastive samples with TCSE data by pairing questions and contexts.For each sample within TCSE data, only BC context over four contexts corresponds to the question.Therefore, one positive pair (BC) and Contrastive loss is calculated based on the cosine similarity between the embedding of question (v q ) and context (v c ) for each pair.It is desirable for the distance between positive pairs to be minimized, while the distance between negative pairs should be maximized.Consequently, the contrastive label, denoted as Y, is assigned: Y=1 signifies that the context corresponds to a positive sample of the question, whereas Y=0 indicates that the context represents a negative sample.Subsequently, the contrastive loss is computed as follows: (1) w p and w n is the weight of the positive and negative sample, respectively.

Joint Training
The final loss is defined as a weighted sum of answer-span prediction loss (L RC ), TCSE loss (L T CSE ) and contrastive loss (L Contrast ):

Evaluation Metric of Time Awareness
We TC-score allows for a comprehensive evaluation of a model's performance in terms of both time and context awareness.

Dataset
TimeQA (Chen et al., 2021) is a reading comprehension dataset that involves complex temporal reasoning.TimeQA consists of two subsets, easy and hard-mode, which differ in the level of difficulty of temporal reasoning required.We use a hard-mode dataset as it involves reasoning with more complex time expressions. To

Baselines
BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2019) is a large pre-trained language model largely used in QA tasks.In our experiments, we use the base model fine-tuned with SQuAD2.0 (Rajpurkar et al., 2018).BigBird (Zaheer et al., 2020) is a language model that was developed to handle long sequence input.

Time-Sensitive Question Answering
We evaluate time-sensitive QA performance on TimeQA (Chen et al., 2021) dataset.We show the result in Table 1, demonstrating that training the model with TCQA outperforms the baseline models.BERT model further trained on TCQA shows a significant performance improvement of 8.5 F1-score compared to the model trained only on TimeQA.This improvement is the result of the model learning to distinguish correct time expressions.The performance gap between BigBird and others can be attributed to their maximum input length difference.

Time and Context Awareness
We evaluate the model's time awareness and context awareness using the TC-score.CA.The results suggest the importance of learning time expressions while maintaining contextual understanding.We utilized the TC-score to provide an overall assessment of the model's performance.
We found that the model's contextual awareness decreased, but its time awareness improved significantly, resulting in improved TC-score.We do not perform TC-score on models trained with TCQA, because the model has already learned the TCSE task.Alternatively, TCQA is assessed using an alternative approach in Appendix 5.5.

Analysis on Time Specifier
We analyze model performance on TimeQA according to the time specifier included in the question.Figure 3 shows the EM score difference for four kinds of time specifiers: {in, between, after, before}.There are comparatively substantial improvements in model performance on time specifiers 'after' and 'before'.This improvement demonstrates that TCQA effectively trains the model to understand the time range.However, the performance improvements on time specifier 'between' is comparatively low since it is more difficult as it requires simultaneous consideration of two distinct time ranges.

Ablation Study
An ablation study was conducted to assess the impact of contrastive representation learning.We compare the performance of the model trained with and without adding contrastive loss in Table 3. Adding contrastive loss resulted in an improvement in the model performance of BERT, RoBERa and ALBERT model.However, it led to a slight decrease in performance for the Bigbird model.The reason for this discrepancy of result is that the vector embedding size of the Big Bird model is eight times greater than that of the other models.Consequently, we can infer that the data used for contrastive learning was insufficient.

Reading Comprehension Performance
We conduct a comparative analysis of models on the SQuAD v2 dataset to investigate the effect of TCQA on context awareness.

Conclusion
In this paper, we demonstrated that existing QA models are inadequate in understanding time expressions.To address this problem, we proposed TCQA, which enables models to learn time expressions while maintaining their understanding of context.We constructed question-context templates to generate time-context dependent data for TCSE and contrastive learning, and jointly trained the model.Our experimental results showed that TCQA improves the performance of QA models on TimeQA.Additionally, we proposed a new evaluation metric, TC-score, and showed a gap in performance between models in terms of time and contextual understanding.Future research should focus on advancing temporal reasoning capabilities beyond the comprehension of simple temporal expressions.

Ethical Consideration
This paper presents a synthetic data generation framework that modifies time information and name while retaining the original text.Notably, this approach does not produce any unintended harmful effects, as it does not alter the semantic content of the original text beyond the specified modifications.

Limitations
Limitation of our approach is that TCSE does not cover all kinds of time expressions because we construct the data with only seven time specifiers.Although it is possible to enhance the model's time awareness by adding additional time expressions, our experiments showed that the inclusion of only these seven led to a performance improvement.

Figure 1 :
Figure 1: Example case of Time-Context dependent Span Extraction (TCSE) task.The passage consists of four types of sentences that depend on whether the sentences match the time and context of the question.The target span is 'Harry' in this example.

Figure 2 :
Figure 2: Four types of candidates, namely BC, CC, TC, and BI, are derived from question-context templates via time expression generation and random name.

Table 1 :
evaluate the TC-score of the model, we generate a test set of TCSE task using time-relatedModelBERT baseRoBERT a base ALBERT base BigBird RoBERT a Performance of baseline models, model trained with timeQA data, and model trained with the proposed method.We evaluate the model on the TimeQA test dataset; three runs average all results.Our method outperforms the baseline model.

Table 2 :
Table 2 indicates that the F1-score and TA exhibit similar trends, implying that TA is a reliable indicator of time awareness.We observed that training with TimeQA resulted in a decrease in contextual understanding, as evidenced by an 9.46-point drop in Comparison among the F1-score in TimeQA, and score in TCSE task: Time Awareness (TA), Context Awareness (CA), and Time-Context awareness score (TC-score) of BigBird RoBERT a model according to the data used for fine-tuning.

Model
BERT baseRoBERT a base ALBERT base BigBird RoBERT a

Table 3 :
Results of the ablation study of the contrastive learning

Table 4 :
Performance of BigBird model on SQuAD v2 development set.