The challenges of temporal alignment on Twitter during crises

Language use changes over time, and this impacts the effectiveness of NLP systems. This phenomenon is even more prevalent in social media data during crisis events where meaning and frequency of word usage may change over the course of days. Contextual language models fail to adapt temporally, emphasizing the need for temporal adaptation in models which need to be deployed over an extended period of time. While existing approaches consider data spanning large periods of time (from years to decades), shorter time spans are critical for crisis data. We quantify temporal degradation for this scenario and propose methods to cope with performance loss by leveraging techniques from domain adaptation. To the best of our knowledge, this is the first effort to explore effects of rapid language change driven by adversarial adaptations, particularly during natural and human-induced disasters. Through extensive experimentation on diverse crisis datasets, we analyze under what conditions our approaches outperform strong baselines while highlighting the current limitations of temporal adaptation methods in scenarios where access to unlabeled data is scarce.


Introduction
Patterns of language use change constantly over time, often in predictable and analyzable ways (Hamilton et al., 2016a;Kulkarni et al., 2015;Sommerauer and Fokkens, 2019).As language changes, the performance of NLP systems can be negatively impacted (Lazaridou et al., 2021).In most scenarios, training corpora are derived from a snapshot of data at some moment of time in the past, which puts the reliability of model performance on future data into question.Yet, there lacks a concrete reasoning or evidence that temporal adaptation elevates model performance.Despite the popularity of large language models and their usefulness in many NLP domains (Devlin et al., 2019), the representation of temporal knowledge in those models so far remains an open challenge.
The increased interest in temporal adaptation (i.e.scenarios in which the training and test datasets are drawn from different periods of time) has led to the curation of a number of datasets such as NYT Annotated Corpus (Sandhaus, 2008) and Amazon Reviews (Ni et al., 2019) that have been the focus of most of the recent work in this area.However, these benchmark datasets are curated in such a way that they can only capture temporal change of language over long periods of time (from years to decades), giving access to a large amount of data.In the contrary, on social media, language changes can happen rapidly (Kulkarni et al., 2015;Eisenstein, 2013).Word usage and topics can even change over the span of a single day (Golder and Macy, 2011), especially during very dynamic scenarios like crisis or disastrous events (Reynolds and Seeger, 2005;Del Tredici et al., 2019).We denote these phenomena induced by linguistic and semantic changes over time as temporal drift.
Accounting for temporal drift is critical in crisis situations in which information patterns can vary greatly between the phases of emergency management for crisis.For this purpose, we study short text classification in crisis situations.Given the time-critical nature of crisis scenarios, gathering annotations is too time-consuming and transfer learning is challenging due to the innate differences among the type of events (hurricane vs. earthquake) and the respective information needs.Thus, we offer a study investigating the impact of temporal drift on crisis datasets spanning shorter time periods (days/weeks), as well as datasets with relatively few samples (ranging from ∼1k to 22k).
In summary, we make the following contributions: 1. We investigate temporal drift during crisis events and its adversarial effect on task performance.To the best of our knowledge, this is the first study of temporal effects on text classification performance in crisis scenarios, when temporal drift is rapid and access to data is scarce.
2. We investigate the role of the domain of data in temporal drift and propose a simple metric to quantify the impact of temporal degradation on task performance.
3. We propose methods that adapt future data to known models, improving performance with no additional labeled data.
4. Through experiments on a multitude of diverse text classification datasets collected during crisis events, we analyse the effectiveness of our proposed methods over strong baselines.

Related Work
Analyzing semantic change of text over time has been of great interest since the pioneering work by Hamilton et al. (2016b) and others (Kutuzov et al., 2018;Rudolph and Blei, 2018;Martinc et al., 2020;Gonen et al., 2020).However, its influence on downstream task performance has only recently gained attention.Most importantly, the advent of contextualized word embeddings and large pretrained language models has led researchers to re-evaluate the role of temporality in language modeling (Jawahar and Seddah, 2019;Lazaridou et al., 2021;Hofmann et al., 2021;Kulkarni et al., 2021) and text classification (Bjerva et al., 2020;Florio et al., 2020;Röttger and Pierrehumbert, 2021;Agarwal and Nenkova, 2022).
The performance degradation due to temporal factors has been confirmed in several studies and across multiple domains.Jaidka et al. (2018) analyzed the temporal performance degradation of age and gender classification models based on user's social media posts.Based on features derived from Latent Dirichlet Allocation and word embeddings, they find that models perform best if test and training data come from the same time span.Florio et al. (2020) investigated temporal effects on Hate Speech detection in Italian social media over the period of five months.Their results suggest that models trained on data temporally closer to the test data perform better with transformer based models.Loureiro et al. (2022b) studied semantic shifts in social media and proposed a dataset annotated with words that have undergone a semantic shift over the past two years.Loureiro et al. (2022a) focus on Twitter as text domain and contribute pretrained language models which have been further trained on time-specific data from Twitter.Bjerva et al. (2020) propose to use sequential subspace alignment (SSA) to adapt contextualized word embeddings for language change over time.Their results suggest that SSA applied on past data is able to outperform baselines which have access to data from all time-steps.Röttger and Pierrehumbert (2021) compared time-agnostic domain adaptation with temporal domain adaptation which considers the temporal order of the data.They found that, while temporal adaptation clearly outperforms domain adaptation in language modeling, this does not necessarily translate onto downstream classification performance due to updated tokens not being relevant for the task.Agarwal and Nenkova (2022) found the temporal model performance deterioration to be less significant when using language representations which have been pretrained on temporally closer data.
Finally, Luu et al. (2022) have made the effort of conducting a large-scale study of temporal misalignment, the generalized scenario where training and evaluation data are drawn from different periods of time.Across multiple NLP classification tasks and domains they identify performance degradation with varying degrees but with social media and news being the most affected domains.
We contribute to the existing line of work by quantifying the temporal effects on downstream task performance over short time periods (days and weeks) during crisis events.In such a scenario and in contrast to previous work, we do not assume access to large corpora of unlabeled data for temporal adaptation via continuous pretraining.Our proposed approaches temporally adapt pretrained contextualized embeddings to learn time-aware embeddings and we evaluate their effects on downstream classification tasks.

Methods Overview
Luu et al. ( 2022) describe three distinct stages of a typical NLP system which consist of a pretraining stage, a domain (or temporal) adaptation stage and a fine-tuning stage.Separating the adaptation and fine-tuning stages makes the implicit assumption that there is access to unlabeled data from the (temporal) target distribution which has been proven to be beneficial for temporal adaptation (Luu et al., 2022).In contrast, we are looking at the dynamic setting during crisis events.Temporal alignment through continuous pre-training is not feasible due to the lack of unlabeled data and time constraints imposed by the application scenario (e.g.crisis monitoring).The latter also limits the feasibility of an online learning setup which requires new annotations in a continuous stream.Finally, transfer learning is difficult due to inherent differences in information needs (i.e. the type of labels) and domains (e.g.hurricane vs. earthquake).
Therefore, in this section we adapt and evaluate methods which are specifically designed for combining temporal adaptation and fine-tuning.Their training procedures are adapted to incorporate temporal information about the data along with the textual input.We describe each approach in the following:

Adapted Language Modelling (ALM)
Similar to previous work (see Section 2), we explore temporal adaptation via pretraining but use only the available training data.We therefore continue with the language modeling objective of our respective pretrained language model on the training data and use the resulting fine-tuned model (FT) for downstream task training.Following Dhingra et al. ( 2022), we investigate a variation for temporal modelling (TM) by concatenating time as textual information to the input to encourage the language model to learn temporally relevant features during pretraining.

DCWE: Dynamic Contextualized Word Embeddings
Hofmann et al. ( 2021) introduced a principled way to impart extra-linguistic knowledge into contextualized word embeddings by involving a prior distribution.This enables us to integrate temporal information into the embeddings during training.2More specifically, for each temporal snapshot (e.g.days, months, years, etc.) present in the training data, an additional set of parameters is learned which acts as a temporal offset added to the original word embeddings.This way the model is able to maintain the semantic meaning of a word embedded in its temporal context.We adapt this idea to our setting by introducing additional parameters for shifting the pre-trained contextualized embeddings.Given a sequence of words/tokens W = [w 1 , w 2 , ..., w n ] and their corresponding pre-trained embeddings To account for the temporal effect on the word meanings, we model word embeddings as a function of temporal context t associated to W .
Since meanings of most of the words in the vocabulary are temporally stable, we can place a Normal prior on h * i .
Hence, we write as h * i = h i + d i , where the offset d i is normally distributed as d i ∼ N (0, λ −1 I).However, pre-trained LMs make this temporal adaptation easily applicable to any task by adding only a regularization term L temporal on top of the task specific loss L task .
For training the model, the overall loss L = L task + L temporal is minimized.Similarly to Hofmann et al. (2021), we use K = 103 from Bamler and Mandt (2017), to enforce that h * i s change smoothly over time.

LMSOC: Socio-temporally Sensitive Language Modeling
Similar to DCWE, Kulkarni et al. (2021) propose a method to learn extra-linguistic context using graph representation learning algorithms and then primes with language models to generate language representations grounded in a socio-temporal context.We model the temporal order information as a linear chain graph and adapt this method to our setting by appending temporal graph embeddings to the initial layers of the pre-trained language model.During fine-tuning of the language model, the graph embeddings are kept frozen to inductively yield temporally-aware embeddings.

TAPH: Time Aware Projection on Hyperplanes
Time adds an additional context or dimension to the knowledge making temporal scoping an imperative part while deriving context embeddings.Therefore, we model temporal information as a hyperplane and define a projection operation (Wang et al., 2014) on it.To build a time-invariant classification model, we project the sentence-embedding (Reimers and Gurevych, 2019) of each text on a hyperplane to obtain a time-aware sentence embedding.We describe the method in more detail.
Let X = [x 1 , x 2 , ..., x n ] be a given sequence of words and H be its sentence embedding.Since the temporal span of our data is short, we assume that the temporal hyperplane w t represents the time frame of the training data. 3We derive time-aware sentence embeddings H t using our defined projection operation as follows: While training the model, we learn the hyperplane representation w t in addition to fine-tuning the pre-trained embeddings in an end-to-end fashion.However during inference, we assume that we could 'teleport' the data to the past by projecting their sentence embeddings on the hyperplane w t in order to revert their temporal changes.We then use these embeddings in the downstream tasks.

TDA: Temporal Domain Adaptation
Temporal Adaptation can also be interpreted as a variant of domain adaptation with the difference that the language change happens within the same domain, e.g.induced by external events or the general dynamic characteristics of the source infrastructure (e.g.social media platforms or news outlets).We adapt a widely used domain adaptation method (Ramponi and Plank, 2020) to our setting.We learn time-aware word representations by adding an additional classification layer during training to predict the time of each text and apply the Gradient Reversal method (Ganin et al., 2016).In this way, the input does not change during the forward pass but this additional layer affects the model parameters during back-propagation of error by an additional penalizing factor. 4This acts as an adversarial training objective forcing the model to adapt to the temporal structure of the data.

Data
We identify a collection of social media data during crisis with observable temporal phases (pre-, acute-and post-crisis), rapid change in language and a natural change in distribution over time -enabling us to evaluate how well temporally adapted models generalize over time.We use three datasets sampled from Twitter: Sandy, T26, and Humaid.We provide an overview here and refer to the Appendix A for dataset details.
Sandy The dataset by Stowe et al. ( 2018) collected during hurricane Sandy in 2012 contains approximately 22,000 tweets spanning 17 days centered on landfall in New York City, annotated for binary relevance to the storm and its effects. 5The tweets were collected by first identifying users impacted by the event, then retroactively pulling their data from before, during, and after the event.As opposed to keyword collection, this provides a relatively broad collection of both relevant and nonrelevant tweets and a more complete dataset for evaluating temporal drift, as each tweet doesn't necessarily contain the same keyword(s).
T26 The CrisisLex T26 (T26) dataset (Olteanu et al., 2015) includes labeled tweets for 26 different crisis events, labeled by informativeness into four different categories6 : (1) related to the crisis and informative, (2) related to the crisis but not informative, (3) not related to the crisis, and (4) not applicable category.This collection reflects a wide variety of events covering natural and humancreated emergencies, with the added difficulty that the individual datasets are relatively small, with each event containing only approximately 1,000 tweets.
Humaid The Humaid dataset (Alam et al., 2021) is similar to T26, containing data about 19 different events with dataset sizes ranging from 575 to 9467 tweets.They are annotated with 11 different classes designed to capture fine-grained information related to disaster events.

Data Splits
We follow previous work (Lazaridou et al., 2021;Agarwal and Nenkova, 2022) and create time-based data splits to assess the temporal performance degradation.Specifically, we use three variants of dataset splits: CONTROL, TEMPORAL and PRO-GRESSIVE.We illustrate this in Figure 2.
TEMPORAL Setup First, we split the entire data into two halves which cover equally-sized time periods.We call these first temporal half and the second temporal half, respectively.In the TEM-PORAL setting, we use all the data from the first temporal half as the training data and a test set which is comprised of a randomly sampled 50% of data from the second temporal half of a dataset.This evaluates the model's temporal generalization capabilities on test data from a temporally distant distribution than the training data.
CONTROL Setup To assess whether TEMPORAL setup constrains model's generalization capabilities, we compare its performance with a CONTROL setup.Here, we evenly spread the training data over time frames, exposing the model to the full knowledge of all time.In this setting, the training data comprises of 50% of instances from the first temporal half, along with 50% instances from the second temporal half, matching the total training data from the TEMPORAL setup.We use the same test set as in TEMPORAL setup while ensuring that there is no overlap between the train and test split from the second temporal half.
Under the assumption that a temporal gap between training and target distribution leads to performance decay, we expect that the CONTROL setup will yield better scores, as the model has access to training instances from the same temporal distribution as the test data.
PROGRESSIVE Setup As described previously, semantic changes are likely to occur in short time spans within crisis-related data streams.Therefore, to investigate a more fine-grained analysis of temporal performance decay, we simulate a scenario in which an event is progressing, we have access to all the previous data, and need to take decisions about the incoming data.In this setup, we split the entire dataset into ten temporally ordered bins with even samples.Then, for each test bin B t , we use all preceding bins B 0 to B t−2 for training.To identify the best performing model across all training epochs, we use bin B t−1 for development.

Baseline
For a consistent performance comparison, all proposed models use bert-base-cased as their underlying backbone model for deriving pretrained embeddings.
For the FT setup (see Section 3), we use the available training data for each dataset to run masked language modelling for three epochs to adapt the model to the data.We then fine-tune for the downstream task on the relevant training data using the updated pre-trained model.This will indicate whether the domain is the issue, or whether there is additional temporal effects.In the temporal modeling setup (TM) setup, we follow Dhingra et al. (2022) and prepend the textual representation of the timestamp for each tweet to the tweet text, then train an additional three epochs of masked language modelling.We then fine-tune for the downstream task on the relevant training data.
Finally, we apply another baseline where we use the timestamp text as second input to the model during supervised training, separated via a special token (i.e.[SEP] for BERT).We refer to this baseline as SEP.

Hyperparameters and Infrastructure
For a fair comparison, we run all experiments using the same hyperparameters and data splits.We use a learning rate of 1e − 4, batch size of 64, weight decay of 1e − 3 and no warmup due to the limited amount of training data.We use Adam (Loshchilov and Hutter, 2019) as optimization algorithm and train for three epochs.Based on the performance on the development split, we load the best performing model at the end of the training procedure.
We repeat each experiment using five different seeds and take the most frequent prediction across all runs as the final prediction by a model.All models are implemented in Python 3.6 using Py-Torch 1.10.2(Paszke et al., 2019) and the Hugging-Face (Wolf et al., 2020) framework (4.18) as model backend.We used a computation cluster containing a mixture of NVIDIA Tesla P100 (16GB), NVIDIA A100 (40GB) and NVIDIA V100 (32GB) GPUs.

Evaluation
We report binary-F1 Score for Sandy and macro-F1 score for multi-class classification task on T26 and Humaid datasets.The comparison of the CONTROL and TEMPORAL setting serves two purposes; first, to quantify the degradation of model performance due to temporal drift and second, to estimate the temporal adaptation ability for our approaches.We expect that models considering temporal information should experience less performance degradation between these two settings compared to the baseline model.
Additionally, we evaluate the mean model performance in the PROGRESSIVE setting for a more fine-grained analysis of temporal degradation.
Temporal Rigidity: While analyzing the effects of temporal drift on model performance, it is necessary to quantify the degradation of model performance due to this phenomenon.We quantify the temporal adaptability of a model using a metric called Temporal Rigidity (TR) score, that summarizes the performance deterioration of a model from aligned to misaligned test data.Higher values of TR imply that the model is not able to adapt itself temporally.
We denote f M (B i , B j ) as the F1 performance score of a model M when trained using data sam-pled from bin B i and evaluated using data sampled from bin B j .We define TR as: In Eqn.5 the normalization factor is given as N = |{(i, j) : i ̸ = j}|.Unlike Luu et al. (2022), who do not take temporal proximity of bins into account.We use 1 |i−j| as the penalizing factor for the model when training and test bins are temporally close but the performance degradation is significant.Crisis Phases: Additionally, we utilize the well-known temporal structures of the crisis events (Reynolds and Seeger, 2005;Yang et al., 2013) to analyze model performance.The temporal structure of the Sandy dataset is annotated using pre-, acute-and post-crisis labels.For each model we cluster the time-aware embeddings using K-Means algorithm (k=3) and report the Normalized Mutual Information score (NMI).NMI gives the correlation between the time-aware embeddings and the temporal structure of the underlying data.

Results and Analysis
In this section, we attempt to answer the following questions: Q1.To what degree is temporal performance degradation present in short-term Twitter data during crisis events?(Section 5.1) Q2.Does temporal adaptation improve model performance?(Section 5.2) Q3.Does the domain of the data play a role in temporal drift?(Section 5.3) Q4.How do the proposed models perform when trained continually?(Section 5.4)

Temporal Performance Degradation
In order to estimate the degree of temporal performance degradation in the crisis scenario, we compare the classification performance of the baseline model in the CONTROL and TEMPORAL setting.
Table 1 provides the averaged performance difference for all datasets.Given that we only change the temporal distribution of the training data, the effect is substantial with a difference in F1 up to 6.52 points for the Sandy dataset and slightly less Data Sandy T26 Humaid CONTROL -TEMPORAL 6.52 4.37 4.10 Table 1: Temporal Performance Degradation: Averaged F1 performance difference of the CONTROL to TEMPORAL setting for the BERT baseline model.Overall results show that contextualized language models fail to adapt temporally.Refer Section 5.1 for details.
pronounced on the T26 (4.37) and Humaid (4.10) dataset collections.Therefore, we conclude that, even in short-term scenarios like crisis events on Twitter, temporal distribution of the training data influences the classification performance.

Performance Comparison
Method Sandy We summarize the results on Sandy in Table 2. Overall we find that TDA outperforms all other methods in TEMPORAL setting.We obtain around 1.6% absolute increase over the baseline.We also observe that the difference between model performance in CONTROL and TEMPORAL setting (DIFF) is lowest for TDA (30.8% lower than the baseline) indicating the higher robustness of the model.TAPH achieves 1% absolute improvement in performance over the baseline in TEMPORAL setting (DIFF is lower by 16.9%).The T26 and Humaid datasets contain data for a multitude of events.Therefore, we aggregate model performances in Table 3 and provide detailed results per event in the Appendix A.2.We see that model performance varies greatly between the Sandy dataset and the others.This is due to two main reasons: (i) Data Size: Most of the event datasets in T26 and Humaid are very small, the temporal adaptation methods do not get enough training data to learn the parameters involved in temporal reasoning.To support our argument, we observe, in "Boston Bombings (2013)" dataset of T26, which contains 81,172 annotated tweets, TDA outperforms the baseline by an absolute increases of 6.17% and TAPH comes second with an absolute performance improvement of 2.9% under TEMPORAL setting, a performance pattern which is similar to Sandy dataset.(ii) Data Quality: Unlike Sandy, T26 and Humaid have been collected using keyword-based search.This data collection technique has two main drawbacks: firstly, it restricts the data size and secondly, harms the completeness of the dataset collecting tweets that contain same keywords.All the improvements we report are statistically significant (p < 0.05, using McNemar's Test).

Method
Learning from Temporal Information: To understand the cause of the performance improvement of the models, we utilize the annotated temporal structure of the Sandy dataset.In Table 4 we report two additional metrics: TR Score and NMI, in TEM-PORAL setting.Compared to the baseline, TDA is lowest (15.74% decrease) which suggests that TDA performs most robustly over time across all models.TAPH comes in second with a 9.26% decrease in TR Score from the baseline.NMI scores show similar patterns, with TDA achieving the highest score.We conclude that TDA learns the most meaningful time-aware embeddings.

Effect of Domain of Data
To understand whether the data domain is the main issue behind performance degradation or temporal effects indeed play a significant role, we perform additional experiments.We fine-tune the initial bert-base-cased embeddings for an additional three epochs with Masked Language Modeling Task (MLM) on the training data, before applying the Temporal Adaptation methods.We report the results for Sandy dataset in Table 2.For all models, there remains a substantial performance difference between the CONTROL and TEMPORAL settings which demonstrates the influence of temporal drift on performance.Similar to previous work (Agarwal and Nenkova, 2022), we observe that additional pre-training improves performance for most of the models.Still, TDA outperforms the baseline and TAPH comes in second.

Effect of Continual Learning:
Continual Learning requires continuous annotation of incoming data, which is not feasible during crisis events.However, for the analytical completeness of this paper, we simulate continual learning in the PROGRESSIVE setting to show the effectiveness of our proposed methods.In this setting, initially the models get access to very small amount of data to learn from, which affects model performance.
Performance improves as the size of training data increases gradually.In   performance averaged over all the bins.The results show that TDA outperforms and improves the BERT baseline by 1.2%.

Discussion
Adapting temporally by training on timestamp patterns as text prepended to the input (BERT+TM) underperforms in all experiments.We argue that the added information affects all tokens equally via the self-attention mechanism although only some tokens will experience a semantic shift relevant for text classification in the crisis scenario.
Similarly, the LMSOC and DCWE adaptation approaches cannot outperform the baseline without any temporal adaptation.The additional parameters for computing the temporal offset are not well-tuned for predicting temporal distributions which have not been observed during training.
Figure 3 shows that TDA correctly learns to put maximum attention weight on the word Katrina (i.e.reference to a previous hurricane) in the temporal context of hurricane.We provide representative examples of tweets in Appendix A that all other models but TDA fail to classify correctly.Forcing the model to learn time-invariant embeddings during training using an adversarial signal leads to TDA performing better over all other approaches.Although, TAPH does not fall far behind, it approximates temporal information to create time-static bins.The discrete approximation of temporal information is the main reason behind its performance drop.

Conclusion
The usage of natural language inevitably changes over time which influences performance of text classification models applied on data from different temporal distributions.We show that this effect is also prevalent for rapid temporal drift using social media during crisis events as an example.With the rise of pretrained contextualized embeddings, a dominant approach is to continue language modeling on data temporally closer to the target distribution.However, during crisis events such data is not available and annotated data is often scarce.
We investigate approaches which work without any additional data besides the input text and its temporal metadata.Our results show that under ideal conditions, i.e. high data quality and sufficient annotated instances, they outperform strong baselines.However, most crucially, our work highlights a critical gap of temporal adaptation for rapid temporal drift, namely if unlabeled data for alignment is missing and annotated data is scarce.Our work opens the door for future research on methods which do not rely on pretraining in unlabeled target domain data.In this sense, crisis data provides an interesting use case for evaluation.We release all our code and models, fostering future work in this area.

Limitations
While existing approaches account for temporal change of language over long periods of time, in social media this change can happen over the span of a single day during dynamic scenarios like crisis or disastrous events.In this work we study rapid tem-poral drift prevalently observed in social media during a crisis.We observe that often data from social media are collected using keyword based search and data sampling techniques, where data containing same set of keywords are collected.Since data collected using such techniques are both limited by size and vocabulary, as well by the issues inherent in keyword collection, the datasets naturally affect the performance of the methods described in the paper.Moreover, there exists differences among the types of crisis events (hurricane vs. earthquake) and their respective information needs.Hence, it is difficult to find a solution that works in all scenarios.Additionally, we highlight that evaluation of all the models was done on datasets annotated in presence of a crisis and that may not exactly reflect their performance in a real-world setting without annotated data, especially when differences among the types of crises are relevant.In a nutshell, we observe that during real-world crises, pre-trained language models turn out to be a good solution when access to unlabeled data is scarce and sufficient annotated data is unavailable.

Figure 3 :
Figure 3: Representative example shows that in comparison with other models TDA correctly puts maximum attention weight on the word katrina (another storm) in the temporal context of the hurricane while computing the contextual embeddings.Refer Section 6 for details.
Overview of the data splits used in our experiments.Bins in blue are used during training, bins in yellow for testing, grey bins are not used.The PRO-GRESSIVE setting comprises multiple experiments with increasing training data size and a single test data bin moving forward temporally.

Table 2 :
Temporal Adaptation Evaluation on Sandy:

Table 3 :
Performance Comparison on T26 and Humaid: The number of datasets for which the specific temporal adaptation method outperforms its baseline counterpart in the TEMPORAL setting.Refer Section 5.2 and 5.3 for details.

Table 4 :
Temporal Information Learning: Comparison of methods on TR (lower is better) and NMI scores (higher is better).Refer section 5.2 for details.
Table 5 we report the model

Table 5 :
Continual Learning Effects: Average model performance across all bins in PROGRESSIVE setting, in terms of F1 Score.Refer section 5.4 for details.