Efficient Continue Training of Temporal Language Model with Structural Information

,


Introduction
While Pre-trained Language Models (PLMs) have achieved remarkable success on most of NLP tasks (Qiu et al., 2020), they neglect the time variable due to their training on snap-shots of data collected at a particular point of time (Devlin et al., 2019;Liu et al., 2019).However, our language is constantly evolving and changing with new words being created (Huang et al., 2014;Rudolph and Blei, 2018;Amba Hombaiah Figure 1: Representative examples of syntactic role changes across time.We track the two most frequent syntactic roles of the word "oval" and "justify" from 2015 to 2018, where the syntactic role distribution of the word "justify" is stable across years, while the word "oval" varies dramatically at different times.et al., 2021) and existing words changing in meanings and usage (Labov, 2011;Eisenstein et al., 2014;Giulianelli et al., 2020;Montanelli and Periti, 2023).This "static" training paradigm prevents PLMs from generalizing over time and modeling language change (Lazaridou et al., 2021;Søgaard et al., 2021;Loureiro et al., 2022).
To make the model training more "dynamic", existing studies have explored temporal language models (TLMs), which model temporality by incorporating timestamp directly into representation when pre-training LMs, i.e., TempoBERT (Rosin et al., 2022), TempoT5 (Dhingra et al., 2022).Their methods involve prepending a special time token to each sequence in the training data.Through stacking multiple attention layers, each token at different positions can capture the temporal information brought by time.With different time prefixes, PLMs can adaptively compute the corresponding temporal representations.Moreover, TLMs have better generalization1 capability over time as standard LMs are unaware of which data is "new" and which is "old" due to the absence of timestamps during the training process (Dhingra et al., 2022).
Though the methods mentioned above can cap-ture temporal information to a certain degree, they can merely incorporate "superficial" temporal information provided by the time prefix with the Masked Language Model (MLM) objective.Thus, it is natural to leverage more temporal-specific information captured by textual tokens, e.g., utilizing a small set of lexicons with salient lexical semantic change (Hamilton et al., 2016;Giulianelli et al., 2020;Tahmasebia et al., 2021) in the very recent LMLM method (Su et al., 2022).In this paper, we launch a thorough study to explore the effects of different methods for lexicon selection based on statistical patterns or linguistic attributes and find that the distributional change of syntactic roles (Kutuzov et al., 2021) is the most effective strategy in temporal-specific lexicon selection.As shown in Figure 1, the discrete syntactic role distribution of the word "oval" dramatically changes over time, while the word "justify" is stable across years.Based on the above observations, we propose a Syntax-Guided Temporal Language Model (SG-TLM), which consists of two masking strategies: a Syntax-Guided Masking (SGM) and a Temporal-Aware Masking (TAM) strategy.Experimental results demonstrate that our proposed method significantly improves performance over other TLMs methods on two datasets and three tasks.Extensive results further confirm the efficiency of our method than the state-of-the-art lexicon selection solution based on semantic change, remarkable transferability across various model frameworks, and its positive impact on adaption to future data.Summary of Contributions: (i) We explore the task of efficient syntax-guided lexicon selection, which is more challenging for static PLMs to predict on the time-stratified data.(ii) We propose a simple yet effective Syntax-Guided Temporal Language Model (SG-TLM).(iii) SG-TLM exhibits excellent performance than other TLMs in terms of memorization and generalization for downstream tasks.(iv) Our method demonstrates superior efficiency than the SOTA solution, exhibiting high transferability across different model frameworks and positive adaptive ability to future data.

Temporal Language Model
Previous works have explored temporal language models to enhance the capability of PLMs in modeling language change and generalizing over time.
One popular method is to prepend a timestamp in different forms to a textual sequence, e.g., "<2015> Sydney defeats Paris by points at Oval." (Rosin et al., 2022), "year: 2015 text: Sydney defeat Paris by points at Oval." (Dhingra et al., 2022), and utilize MLM objective to capture the temporal information brought by the time prefix.Through interacting each time prefix with the correlated textual tokens equally, temporal information can be injected into the pre-trained representation, which ignores the diachronic change degree of different tokens, e.g., time-specific tokens and time-agnostic tokens2 .Thus, it is natural to enhance existing temporal language models by "accurately" injecting time information into these time-specific tokens, i.e., the core of building better temporal language models is to select these tokens with the time attribute.Normally, it is difficult to directly compute or estimate the diachronic change degree or the time attribute of tokens.Existing works mainly leverage the discrepancy of data across different periods to approximate the diachronic change degree of tokens.For instance, Su et al. (2022) measures the statistical distance of token representation across years to select these tokens (i.e., a lexicon) with salient lexical semantic change.Though effective, existing semantic-based lexicon selection methods require forwarding all training data with a large-scale language model in the data process stage and neglect structure information within the language.To accelerate the lexicon selection process and leverage structure information within languages, we explore the potential of syntactic role changes of different tokens in this paper that may benefit from the speed superiority of various syntactic parsing tools.We elaborate on more details of syntax-based lexicon selection below.

Syntax-Guided Lexicon Selection
We construct "syntax-guided lexicons" based on diachronic differences in syntax.According to Su et al. (2022), we first adopt YAKE! (Campos et al., 2018), a feature-based and unsupervised system to extract the candidate keywords W t = {w t 1 , w t 2 , • • • , w t k } from the texts D t of time t.Then, we utilize off-the-shelf Stanza3 (Manning et al., 2014) to automatically parse the syntax information for each sentence in the texts D t and count the frequency of syntactic roles for each candidate word w i and store them in the set . This set is a collection of dictionaries, with each dictionary representing the syntactic roles and their frequencies, which is structured as follows: , where k t j represents the syntactic role for word w i in time t, and v t j is its frequency.For example, if the candidate word "oval" has the syntactic roles "amod" and "nmod" in time t with frequencies 150 and 100, respectively, the corresponding dictionary in R t would be r t oval = {amod : 150, nmod : 100}.Using these syntactic dictionaries, we create feature vectors ⃗ a t and ⃗ a t ′ to represent the syntactic profiles of the candidate words in different periods.The size of the feature vectors ⃗ a t and ⃗ a t ′ may vary across words since we create separate feature lists for each word, including the corresponding syntactic roles.To align the vectors for each time, we pad the vectors with 0 for any missing syntactic roles.Finally, we calculate the cosine distance between ⃗ a t and ⃗ a t ′ to measure the difference between the syntactic profiles of the candidate words W t .We use a hyper-parameter k to control the degree of syntactic change, ranking the candidate words W t based on their cosine values and selecting the top-k words as the syntax-guided lexicon, which consists of the tokens with significantly changed syntactic roles across different periods.Our proposed lexicon selection method is much faster than those used in LMLM (Su et al., 2022), in which their computation cost will be discussed in Section 4.4.

Discussions and Observations
In Section 2.2, we propose a direct and efficient syntax-guided approach for obtaining the lexicons which have undergone significant syntactic change over time.Following Su et al. (2022), we mask the tokens in the selected lexicons and utilize perplexity (ppl.) as a qualitative measure to compare the influence of different lexicon selection methods on static PLMs.To complete this, we build a time-stratified corpus from publicly released crawl news4 datasets, which contains 1M English news articles for each year between 2014 and 2018.We post-tune the BERT5 model with the data from 20146 and evaluate the four testing sets after 2015.

Methods for Comparison
We introduce six approaches to lexicon construction: (i) We first adopt two methods for extracting lexicons: random selection and frequency-aware selection.(ii) Besides, we introduce two approaches to selecting the lexicons with salient linguistic changes: lexical semantic change, i.e., LMLM (Su et al., 2022) and syntactic role change, i.e., our proposed syntaxguided method (SMLM) introduced in Section 2.2.
(iii) Additionally, we consider words that are dependent on the tokens with significant diachronic change, as identified from the syntactic parsing process7 , i.e., the head node in the dependency parsing tree (McDonald et al., 2005), and propose two methods to include these words dependent on extracted lexicons based on LMLM and SMLM, named L-DMLM and S-DMLM, respectively.

The Influence of Lexicon Selection Methods
The results are shown in Figure 2. We can see that: (i) In the absence of adding dependent words, the ppl. of SMLM is much higher than the other three lexicon selection methods, which indicates that it is more challenging for static PLMs to predict the lexicons selected from the syntactic perspective.(ii) After adding dependency information, both SMLM and LMLM methods show an apparent increase in their ppl.values, i.e., L-DMLM and S-DMLM, suggesting the positive impact of adding dependent words from syntactic parsing in lexicon selection.
(iii) Above all, S-DMLM (marked with ) achieves the highest ppl.values among six methods, which can select diachronic change lexicons that impose the most significant challenge to static LMs. 3 Syntax-Guided TLM The Masked Language Model (MLM) objective (Devlin et al., 2019) is a widely-adopted selfsupervised training method that involves randomly masking a certain percentage of the tokens in a text sequence and training a model to predict the masked tokens based on their context.Previous TLMs add a timestamp token at the beginning of the input sequence and utilize the MLM objective to predict the random masked tokens based on the context and the timestamp.However, these methods disregard inherent temporal-specific information provided by lexicons with salient change tokens.Based on the aforementioned observation8 , we propose Syntax-Guided Temporal Language Model (SG-TLM), which consists of two main components: a Syntax-Guided Masking (SGM) scheme and a Temporal-Aware Masking (TAM) method.
Our proposed model is illustrated in Figure 3.

Syntax-Guided Masking (SGM)
We construct the syntax-guided9 lexicons based on the distributional change of syntactic role across timestamps.Formally, given the text set we first rank the lexicons according to the word's cosine values of syntactic change.Then, we select k (k ∈ {100, 200, • • • , 500}) words with relative high scores as the masking candidate words Wmask t .Considering the effectiveness of adding dependent words, we treat the words that are dependent by the candidate words Wmask t within the sentence as additional temporal information.Specifically, given the masking ratio α10 , we prioritize masking the words in Wmask t and their corresponding words in the dependency relationship 11 .We randomly mask the other tokens from the sequence if there are no sufficient masking candidates to meet the required number of masking tokens.Assuming it masks m tokens in total and the sequence after masking at time t is d t ′ i .The optimization objective of SGM can be written by: Temporal-Aware Masking (TAM) Unlike previous work (Rosin et al., 2022), we predict masked tokens with salient syntactic role change and time tokens, given the remaining unmasked words within the sequence.Formally, given a sequence {d i } at time t, we denote d (0) i for its timestamp and prepend the time token t to d i .Now the inputs to the model are a sequence i }, we predict the time t by the whole input text: As for the granularity of t, different values can be used according to the use case.In this work, we experiment with the granularity of a year for the WMT dataset, and a month for the RTC dataset.Evaluation Following the pre-training period, we evaluate the models' memorization (Dhingra et al., 2022) and generalization (Lazaridou et al., 2021) abilities by measuring their performance on downstream tasks13 .To evaluate memorization, the model is tested on the same time steps as the pre-training data S 1...T = {S 1 , S 2 , . . ., S T }.To evaluate generalization, the model's performance is measured on future times (S T ˜ST +n ), which is invisible during the post-tuning stage.
After continual training on the WMT dataset, two tasks are used for model evaluation: political affiliation classification (POLIAFF) (Luu et al., 2022) and named entity recognition (TWINER) (Rijhwani and Preoţiuc-Pietro, 2020).The POLIAFF task involves fine-tuning the model with 10,000 labeled sentences from 2015.Testing includes {S 2015 , S 2016 , S 2017 , S 2018 } for memorization and {S 2019 , S 2020 } for generalization, with each year containing 2,000 specific sentences.For the TWINER task, 2,000 labeled tweets from 2015 are selected for fine-tuning.We evaluate memorization abilities using datasets {S 2016 , S 2017 , S 2018 } and generalization capabilities using dataset {S 2019 }.Following pre-training on the RTC dataset, the model is evaluated on the political subreddit prediction (PSP) (Röttger and Pierrehumbert, 2021) task.Specifically, the model is fine-tuned on 20,000 labeled data samples from April 2017, extracted from the same dataset used for pre-training.Memorization and generalization are tested on {S 2017−04 , S 2018−08 , S 2019−08 } and {S 2020−01 , S 2020−02 }, respectively.We calculate the F1-score as the testing results for all our experiments.More details about the task and our model's training are shown in Appendix A.

Baselines
We establish several baselines that encapsulate different continual pre-training strategies.Firstly, we consider two naive baselines that do not incorporate timestamps during the pre-training stage: (i) BERT (w/o) (Devlin et al., 2019), which is directly fine-tuned on the downstream task without pre-training.(ii) Uniform involves training the model with mixed pre-trained data.Additionally, we adopt three up-to-date TLMs, which utilize timestamps during pre-training: (iii) TAda (Röttger and Pierrehumbert, 2021), which involves continual pre-training on specific time buckets to obtain separate models specialized for different periods.(iv) Temporal (Dhingra et al., 2022) integrates time t as a prefix to the input during pre-training, with temporal-specific lexicons randomly generated.(v) LMLM (Su et al., 2022) is the SOTA of temporal adaptation, which strengthens PLMs' generalization with salient lexical semantic change.We utilize their method of lexicon construction.STLM and SG-TLM are our methods, with the distinction of whether to add dependent words during the lexicons construction.Appendix B offers additional training details on the compared TLMs.

Main Results
Table 1 presents the results on the WMT dataset.TLMs demonstrate superior performance compared to models that do not consider timestamps.However, previous TLMs such as TAda and Temporal show only marginal improvements over the Uniform model, indicating limited learned temporality information from the timestamps.Conversely, incorporating linguistic information into TLM training significantly improves both memorization and generalization.Among the evaluated baselines, SG-TLM achieves the highest average F1-scores on both datasets, i.e., 66.77 and 66.67, highlighting the effectiveness of leveraging syntax and dependency information within languages.1: Results of the WMT dataset: memorization and generalization performance on the POLIAFF and TWINER tasks.The timestamp represented in italics are not visible during the post-tuning stage, i.e., 2019 and 2020 in POLIAFF dataset and 2019 in TWINER dataset.We utilize the above testing sets to evaluate the models' generalization and the left to test the memorization.STLM (marked with †) and SG-TLM (marked with ‡) are our methods, the difference is whether to include dependent words during the masking scheme.Our proposed SG-TLM achieves the highest average F1-score on two tasks.Each number is the average of 5 runs with different seeds.However, the performance differences among the methods in the RTC dataset are relatively minor compared to the WMT dataset, which can be attributed to the shorter time intervals and the relatively stable and slight dynamic temporality of the RTC dataset.Moreover, SG-TLM also shows a notable drop in future years due to the inherent uncertainty of future language changes (Lazaridou et al., 2021) in both datasets.In Section 4.5, we will investigate the approaches to refreshing the models as new data arrives and compare the temporal adaptation (Röttger and Pierrehumbert, 2021;Luu et al., 2022) performance of different methods.

Detailed Analysis
Efficiency Comparison Table 3 shows the comparison of lexicon construction efficiency between LMLM and SG-TLM.LMLM provides a fairly complex and time-consuming method to select semantic-based lexicons, i.e., 200 minutes for representing and 360 minutes for clustering.However, SG-TLM benefits from the speed superiority of syn-  tactic parsing tools rather than large-scale PLMs, resulting in a speedup ratio of 5.5× compared to representation and 180× compared to measuring.Furthermore, SG-TLM outperforms LMLM on the TWINER task by achieving a 1.45 higher F1-score, demonstrating its superior effectiveness.
Ablation Study To investigate the impacts of different components within SG-TLM, we remove individual components from the complete model and observe the resulting performance.The results are shown in Table 4. Notably, excluding the SGM objective leads to the most significant decline in performance, highlighting its pivotal role within the SG-TLM framework.Furthermore, each component contributes positively to the overall performance, indicating the utility and significant contributions of all SG-TLM components in improving the model's effectiveness.

Scale Effects in Performance
We also explore whether our proposed SG-TLM would keep effective over random masking when increasing the amount of data.Using the WMT dataset, we successively expand the training data for both models, Table 5: Results of scale effects in performance (average of 5 runs).In method names, "4M", "8M", etc., denote the use of 4 million, 8 million, etc., datasets for pretraining.SG-TLM and Uniform represent our proposed method and the random masking strategy, respectively.multiplying the volume by factors of 2, 3, 4, and 5.The results of this scaling experiment are summarized in Table 5.It is clear that the SG-TLM consistently outperforms the uniform masking approach across all years and data scales, demonstrating the robustness of our approach with increasing data.
Hyper-Parameter Analysis Considering the correlation between the masking ratio and the model's performance, we conduct experiments to explore the most suitable masking ratio a and word counts k for the SG-TLM objective.The results are shown in Figure 5.Our SG-TLM achieves the best performance when the masking ratio a is set to 30% and the number of candidate words k is 200.To better understand the insights of our presented SG-TLM, we conduct token-level analysis on the selected lexicons in Appendix D.

Temporal Adaptation to New Data
Unlike domain adaptation, temporal adaptation (Röttger and Pierrehumbert, 2021) updates models with current data to mitigate temporal mis- alignment 14 .In this section, we consider the scenario where we already have a trained model on the 2015-18 slices and new data from the 2019 slice.We attempt to update the model by continuing pretraining TLMs on the unlabelled target year data.To compare the adaptability of different TLMs on target year, we evaluate their performance on the TWINER dataset.Precisely, we fine-tune the adaptation models on the labeled source year data15 and then test models on 2019 data.Experiments are conducted adapting three TLMs with three lexicon construction methods, totaling nine settings.Results are shown in Figure 4.

Transferability Across PLM Frameworks
To verify the transferability of our methods across different model frameworks, we implement our method in both encoder-only and decoder-only models and utilize random lexicon construction as the baseline for comparison.

Effectiveness on Encoder-only PLMs
We implement our method on two popular encoder-only PLMs, i.e., BERT and RoBERTa.As shown in Table 6, we observe a significant improvement in each PLM using our SG-TLM.On average, the BERT model improves by 0.77 points, and the RoBERTa model improves by 1.31 points.These findings illustrate the versatility of our proposed SG-TLM, as it enhances the performance of various PLMs by incorporating syntax information.
Effectiveness on Decoder-only PLM We also conduct experiments on large-scale decoder-only language models, i.e., LLaMA-7B (Touvron et al., 2023).In these experiments, we extract the word selection component from our method and compare two data selection methods: one based on our Syntax-Guided (SyG.)approach and the other on random selection.The perplexity of LLaMA-7b is evaluated on the 2,000 sentences selected using these two methods.As shown in Table 7, SG-TLM yields higher perplexity than random selection in the RTC datasets 16 .This highlights the complexity and diversity of our selected data, indicating the effectiveness of incorporating syntax into data selection and the potential to enhance temporal capture in large language models. 16Since the training data for LLaMA is cut off at 2022, we crawl the latest Reddit data from https://files.pushshift.ioto complete the experiment.

Related work
Temporal Language Model Several works have explored the temporal effects in language models (Huang andPaul, 2018, 2019;Rijhwani and Preoţiuc-Pietro, 2020;Lazaridou et al., 2021;Søgaard et al., 2021;Agarwal and Nenkova, 2022;Loureiro et al., 2022;Cao and Wang, 2022;Cheang et al., 2023).Recently, existing works have investigated the temporal language model to model temporality information and generalize over time.Dhingra et al. (2022)  We extend this work with a detailed study using syntactic role changes to harness temporal-specific information efficiently.

Conclusion
In this paper, we enhance the temporal language model from the syntactic perspective and discover that predicting the syntax-guided lexicons is more challenging for static PLMs compared to other methods.Building upon these findings, we propose a syntax-guided temporal language model (SG-TLM) that incorporates time information into tokens with significant syntactic changes.Our SG-TLM achieves the SOTA performance, reduces computational costs during lexicon construction, and demonstrates excellent transferability to new data and frameworks compared to other baselines.

Limitation
There are still some limitations in our work which are listed below: • While we introduce a data selection strategy that incorporates syntactic changes to identify time-specific sentences, we only conduct preliminary validation of our method's transferability on Large Language Models (LLMs), without involving training and inference stages.Recent studies highlight that LLMs continue to struggle with effective generalization when it comes to emerging data (Wang et al., 2023).As a potential solution, our future work aims to integrate our method with in-context learning (Dong et al., 2022) to enhance the temporal generalization capabilities of LLMs.• Recent studies (Kutuzov et al., 2021;Giulianelli et al., 2022) show the effectiveness of utilizing syntactic features in detecting lexical semantic changes.This prompts us to investigate the compatibility of our method with the semantic lexicon solution.Given the lexicon constructed by semantic change W l and syntactic change W s , we conduct three straightforward methods to combine the lexicons between W l and W s : W l ∩ W s , W l ∪ W s , W l \ W s .Consistent with the previous experiment, the masking ratio a is 30%, and k is 200.The results are shown in Table 8.Contrary to our expectations, the combined model's performance is inferior as compared to the original model, suggesting that this method fails to merge information effectively from multiple dimensions.In future work, we will explore more suitable methods to integrate semantic and syntactic information.

B Implementation of the Baselines
This section further provides experimental details of several TLMs used as baselines in Section 4.2: TAda, Temporal, and LMLM.Following the strategy by (Su et al., 2022), LMLM utilizes 500 candidate words and a mask ratio of 0.3.For the remaining baselines, the mask ratio aligns with the standard BERT (Devlin et al., 2019)  Temporal On the other hand, the Temporal (Dhingra et al., 2022) baseline is trained on the entire dataset as a single model.The unique feature of this setup is the way it incorporates time into the input.Specifically, the model takes in a concatenation of the time t and the input x, that is, P (y|x, t; θ) = P (y|t ⊕ x; θ).This is achieved by prefixing the input with a simple string representation of time, such as "year: 2014".Thus, the model is trained to generate outputs based on both the input and its associated time, allowing it to develop a sense of temporal sensitivity.
LMLM In the lexical-based Masked Language Model (LMLM) (Su et al., 2022) setup, the model is trained to account for semantic shifts in words over time.This is achieved by constructing a lexicon of words that have exhibited significant semantic changes over time and then using this lexicon in a masked language model setup during pre-training.The model is thus trained to predict the original word based on the context and the time token, allowing it to capture temporal dynamics in word semantics.
For all these baselines, we perform fine-tuning on the same downstream task data, ensuring fair comparisons among the models.Moreover, we use the same BERT-BASE-UNCASED pre-trained model as the foundational language model for all baselines, setting a level playing field.
As a variant of LMLM and SG-TLM, we also consider an additional baseline approach where timestamps are treated as a prefix at the beginning of the sentence and are randomly replaced with the <MASK> token for prediction.Meanwhile, the remaining tokens within the sentence are randomly masked.This baseline is equivalent to employing Temporal-Aware Masking (TAM) with random token masking in sentences, corresponding to the experiment with θ -SMLM presented in Section 4.4.Notably, despite these different masking conditions, our proposed method SG-TLM consistently outperforms the other models.

C Parsing Toolkit Analysis
Since there is a strong correlation between the parsing toolkit's capability and the performance of our SG-TLM, we compare the selected Stanza with another commonly adopted parsing tool, i.e., UD-Pipe 19 .As shown in Table 9, the results of utilizing Stanza as the parsing method outperforms UDPipe in all the timestamps, i.e., the average accuracy of Stanza is 68.27, while UDPipe is 64.78, which indicates that utilizing a more accurate parsing tool can significantly improve the model's performance.Though the UDPipe does not depend on GPU resources, this toolkit is unsuitable for parsing syntactic roles in the selection process of lexicons. 19UDPipe (Straka and Straková, 2017)

D Token-level Analysis
From Figure 5, it is surprising that there is no positive correlation between the word counts and the model's performance.To understand the reason behind this phenomenon, we select the top 500 words from the candidates W t mask according to the cosine value.The distribution of those words is shown in Table 10, which indicates that only about 20% words have relatively significant syntactic change (cosine value ≥ 0.01).This suggests that performance mainly comes from correctly predicting a small number of keywords, such as topic words and newly emerging words, which have relatively salient syntactic change.Furthermore, we also show the distribution of these lexicons by Part of Speech (POS) across various years in Table 11.From the data, it's evident that the distribution of POS remains relatively consistent year-over-year.Nouns dominate the distribution, implying their higher propensity for syntactic variation.

E Superiority in Adapting Temporal Change
This section aims to demonstrate the superior performance of the SG-TLM model in adapting to temporal changes compared to other Temporal Language Models (TLMs).Specifically, we compare the SG-TLM model against established baselines, i.e., Uniform, Temporal, and LMLM, as previously introduced in Section 4.2.All models are further pre-trained on BERT using 4 million data instances spanning from 2015 to 2018 in the WMT dataset and evaluated to predict masked tokens at different timestamps using 2,000 data samples from the 6328 same source spanning six years (2015)(2016)(2017)(2018)(2019)(2020).This experimental setting has also been used to verify the ability to adapt to temporal changes in Röttger and Pierrehumbert (2021).The results of this comparison are presented in Figure 6.We can observe that the SG-TLM consistently outperforms the other models across all examined years, achieving the lowest perplexity scores, which illustrates the superior adaptability of the SG-TLM model to temporal shifts.Notably, the SG-TLM model also exhibits superior generalization capability with respect to the 2019 and 2020 data that were not present during the training period, outperforming other baseline models, which further demonstrates its robustness and reliability in handling temporal shifts.

Figure 2 :
Figure 2: Results of the ppl.score.S-DMLM achieves the highest ppl.values among six selection methods.

Datasets
We conduct continual pre-training on two datasets: WMT NEWS CRAWL (WMT) and REDDIT TIME CORPUS (RTC), respectively.The WMT 12 dataset, an open-domain dataset, consists of 4 million news articles published between 2015 and 2018: {D 2015 , D 2016 , D 2017 , D 2018 }.On the other hand, the RTC (Röttger and Pierrehumbert, 2021) is a monthly time-stratified dataset from March 2017 to February 2020.Unlike the WMT dataset, it specifically focuses on the political domain, enabling us to explore temporal dynamics within the specific domain.We select three months {D 2017−04 , D 2018−08 , D 2019−08 } for pre-training, each containing 1 million unlabeled comments.

Figure 4 :
Figure 4: Results of different TLMs' adaptability performance on the target year (average of 5 runs).The horizontal axis indicates the different TLMs, and the vertical axis shows the F1-score of the TWINER task.Our proposed SG-TLM outperforms other models with the highest F1-score among the nine settings.

Figure 5 :
Figure 5: Results of the different masking strategies of SG-TLM (average of 5 runs).The horizontal axis indicates the candidate word counts k and the vertical axis represents the F1-score of the TWINER task.
and Rosin et al. (2022) directly prefix the time token to text sequences and fine-tune on time-stratified data.Hofmann et al. (2021) and Rosin and Radinsky (2022) modify the structure of the language model to create time-specific contextualized word representations.Su et al. (2022) recently proposed a semantic-based, lexical masking strategy to enhance PLMs' temporal generalization.

Figure 6 :
Figure 6: Results of adapting to the temporal shift.The SG-TLM consistently achieves the lowest ppl.scores compared to Uniform, Temporal, and LMLM models from 2015 to 2020.

Table 3 :
Efficiency comparison between LMLM and SG-TLM.SG-TLM is a faster and more effective model.

Table 6 :
Results of different PLMs on the RTC task (average of 5 runs).Our proposed SG-TLM achieves the highest F1-score on both BERT and RoBERTa.

Table 7 :
Comparative perplexity results of LLaMA-7B on the RTC dataset using Syntactic-Guided (SyG.) and random selection methods.

Table 8 :
The compatibility of our method with semantic lexicon solution.W l represents lexical semantic solution, while W s represents our syntactic role solution.Evaluation Metric We utilize the F1 score 18 as the evaluation metric in all the experiments.

Table 9 :
Results of different syntactic methods under the time-stratified settings (average of 5 runs).

Table 10 :
utilizes fast transition-based neural dependency parser that follows the same annotation schemes as Stanza.Distribution of the syntactic changed words, where SyC.represents for the Syntactic Change.

Table 11 :
The distribution of the selected lexicons by Part of Speech (POS) across various years in the WMT pre-trained dataset.