TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Language Models (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TemporalWiki, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM’s ability to retain previous knowledge and acquire updated/new knowledge at each point in time. We also find that training an LM on the diff data through continual learning methods achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning.


Introduction
Large Language Models (LMs) pretrained on a vast amount of text corpus have shown to be highly effective when finetuned or prompted to perform various downstream tasks (Raffel et al., 2019;Brown et al., 2020;Sanh et al., 2022;Wei et al., 2022).However, most of the datasets used to evaluate these LMs are static benchmarks; the train and test data are both from similar points in time.On the other hand, in the real world, factual knowledge is frequently changed, added, or deprecated.For example, suppose a language model is asked what the most dominant coronavirus variant is (Figure 1).
The answer would have been the Delta variant in the fall of 2021 but has changed to the Omicron variant near the end of 2021.If LMs remain unchanged and are not periodically trained to cope with the changing world, they will be outdated very quickly.This means downstream tasks that directly depend on or are finetuned from the LM will suffer from temporal misalignment (Luu et al., 2021;Lazaridou et al., 2021), which refers to the misalignment in time between the train and test data.
Temporal misalignment becomes a critical problem, especially when using language models for knowledge-intensive tasks such as closed-book question answering (Roberts et al., 2020;Petroni et al., 2021;Jang et al., 2022) since they rely solely on the knowledge stored in their parameters.Furthermore, LMs augmented with retrieval mechanism (Guu et al., 2020;Lewis et al., 2020;Borgeaud et al., 2021) often suffer from hallucination even if they successfully retrieve up-to-date information (Zhang and Choi, 2021;Chen et al., 2021;Longpre et al., 2021).This means that the implicit knowledge stored in the model parameters has to be updated as well because it may cause conflicts with the explicit knowledge retrieved from external sources such as up-to-date knowledge bases and ultimately cause the LM to hallucinate.
Recently, Lazaridou et al. (2021); Jang et al. (2022) have explored updating the internal knowledge of LMs through continual pretraining on new and updated data as a solution for mitigating temporal misalignment.However, these datasets are still static in nature: as the world changes, they will eventually get outdated as well.In order to comprehensively measure the capability of ever-evolving LMs on addressing temporal misalignment, automated periodic evaluation of the LMs is crucial.
In this paper, we introduce TEMPORALWIKI, a lifelong benchmark for training and evaluating everevolving LMs in a periodic and automated manner, shown in Figure 1.The corpora used for updating LMs are constructed by comparing articles from consecutive English Wikipedia snapshots and retrieving only changed information, which we name as TWIKI-DIFFSETS.The evaluation datasets are constructed in a similar manner by comparing English Wikidata snapshots that correspond to the Wikipedia snapshots in time and categorizing each factual instance into UNCHANGED or CHANGED.Since Wikidata updates may not exactly align with Wikipedia updates, we only retain factual instances that can be grounded to articles in Wikipedia, ensuring the quality of the data and name the resulting evaluation dataset as TWIKI-PROBES.The entire benchmark creation process is done without any human annotation, thus allowing it to be automated and lifelong as new English Wikipedia and English Wikidata snapshots are released by Wikimedia 1 on a monthly basis.
Through TEMPORALWIKI, we aim to tackle the following research questions: How can we train ever-evolving LMs efficiently and automate the evaluation of each update?How does updating LMs only on updated data from Wikipedia compare to updating LMs on entire Wikipedia snapshots, especially in scenarios requiring multiple updates?How problematic is catastrophic forgetting (McCloskey and Cohen, 1989) when LMs are updated only on new data, and how can we effectively mitigate catastrophic forgetting?Our main 1 https://commons.wikimedia.org/contributions are summarized as follows: • We introduce TEMPORALWIKI, a lifelong benchmark for ever-evolving LMs.Unlike previous static benchmarks, TEMPORALWIKI is responsive to the dynamic changes in the world and can be utilized to automatically train and evaluate ever-evolving LMs on each English Wikipedia and English Wikidata snapshot update.
• We find that continually training LMs only on the updated portion of English Wikipedia, which we call temporal language modeling, is much more efficient than updating LMs on entire English Wikipedia snapshots in terms of both computation and stability-plasticity trade-off.It is still a challenging task, especially when multiple updates are required due to catastrophic forgetting.
• As competitive baselines for temporal language modeling, we implement previous continual learning approaches that mitigate forgetting while bolstering the learning of new knowledge, thus providing an overall enhancement in terms of both stability and plasticity.
We hope that TEMPORALWIKI will foster future research towards training ever-evolving LMs.

Background
Recent works have introduced the need to tackle the issue of temporal misalignment, which refers to neural networks showing poor performance due to misalignment in time between the train and test data.Temporal misalignment can be caused either by (1) the dynamic nature of language (Röttger and Pierrehumbert, 2021;Hombaiah et al., 2021;Rosin et al., 2021;Loureiro et al., 2022) or ( 2) the update of factual information (Chen et al., 2021;Dhingra et al., 2022;Jang et al., 2022).Luu et al. (2021) have emphasized the effect of temporal misalignment on eight different NLP downstream tasks, asserting that misalignment between the train and test sets of the downstream tasks causes severe performance degradation that can be mitigate finetuning on the corpus from the target period.Agarwal and Nenkova (2021) have argued this to be less of a concern when utilizing representations from pretrained LMs and show that self-labeling on the downstream task is more effective than continued pretraining on more recent data for temporal adaptation.Note that these works have focused on misalignment caused by the dynamic nature of language on tasks that are not knowledge-intensive, such as text classification.
Others have tackled the problem caused by the update of factual knowledge.Lazaridou et al. (2021) have shown that LMs deteriorate significantly in performance when there is a misalignment in time between the pretraining data and the downstream task and argued ever-evolving LMs are necessary.Dhingra et al. (2022) have proposed explicitly including time information during pretraining as a potential solution.Jang et al. (2022); Jin et al. (2022) have implemented continual learning methods to mitigate catastrophic forgetting that occurs during continued pretraining on new data.
Despite the recent community interest in the need for ever-evolving LMs, the community lacks widely-available resources to train and evaluate such LMs.Previous works have introduced benchmarks comprised of data sources from Twitter feeds (Osborne et al., 2014;Yogatama et al., 2014;Loureiro et al., 2022), recent news articles (Jang et al., 2022), and arXiv papers (Lazaridou et al., 2021) where the temporal adaptability of LMs and the effectiveness of different methodologies of updating LMs can be evaluated.However, these data sources are domain-specific and inherently static.
On the other hand, Wikipedia and Wikidata are known to be great sources of general world knowledge and thus have been widely used by the community (Dinan et al., 2019;Thorne et al., 2018;Kwiatkowski et al., 2019;Piktus et al., 2021).120K volunteer editors make 120 updates to the English Wikipedia per minute and add hundreds of new article entries every day (Logan IV et al., 2021) 2 .Even though every Wikipedia and Wikidata update may not correspond to an actual change in the real world, TEMPORALWIKI leverages the dynamic nature of Wikipedia and Wikidata to provide a lifelong benchmark for developing and maintaining ever-evolving LMs.

TemporalWiki
In this section, we delve into the process of creating TEMPORALWIKI, which is comprised of training corpora (TWIKI-DIFFSETS) and evaluation datasets (TWIKI-PROBES) sourced from English Wikipedia and English Wikidata, respectively.For efficiency, English is abbreviated when referring to English Wikipedia and English Wikidata throughout the paper.Moreover, we clarify that not all Wikipedia/Wikidata updates equate to actual updates of world knowledge.In Section 3.1, we first describe the process of constructing the training corpora from Wikipedia snapshots.Then in Section 3.2, we describe the process of generating the evaluation datasets from Wikidata snapshots.In Section 3.3, we describe the quality control applied to the evaluation datasets.

Generating Training Corpora from Wikipedia
It is highly computationally expensive to train an LM on the entire Wikipedia snapshot every time the LM requires updates since most part of Wikipedia is unchanged from the previous snapshot.Moreover, it is not certain whether training on whole snapshot is the best approach for updating the factual knowledge stored in the LM.Therefore, we compare the differences between consecutive Wikipedia snapshots in order to use only updated and new text for training.We call these subsets TWIKI-DIFFSETS.Algorithm 1 shows the procedure for generating them.
As shown in Algorithm 1, a single TWIKI-DIFFSET is generated by getting the differences (similarly to git diff) between two consecutive Wikipedia snapshots.If an article with a Algorithm 1 Generating TWIKI-DIFFSETS Require: Wikipedia snapshots W P prev and W P recent where W P recent is more recent.D := An empty array to store new and updated data.*article in W P has attributes id and text for all article a r ∈ W P recent do if a r .id= a p .id for some article a p ∈ W P prev then D.append(GETDIFF(a p , a r )) else D.append(a r ) end if end for function GETDIFF(a p , a r ) Di f f := An empty string to append difference between text in two articles.for all paragraph p r ∈ a r .textdo if p r have no matching sentences with any paragraph p p ∈ a p .text then Di f f ← Di f f + p r else if p r have some matching and some different sentences with any paragraph p p ∈ a p .text then Di f f ← Di f f + sentences that differ between p r and p p .end if end for return Di f f new unique id is included in the recent snapshot, we append the entire article to TWIKI-DIFFSET.For an article having an existing id in the previous snapshot, we compare the two articles by paragraphs and add new or updated sentences to TWIKI-DIFFSET.Examples of TWIKI-DIFFSET are shown in Figure 5, and detailed statistics are shown in Section 4.

Generating Evaluation Datasets from Wikidata
The success of a LM update for continual pretraining setting can be evaluated by quantifying the stability-plasticity dilemma (Mermillod et al., 2013): the dilemma of neural models having to sacrifice either stability, ability to retain learned knowledge, or plasticity, ability to obtain new knowledge.In order to evaluate whether each update is successful, we need evaluation datasets that can quantify the amount of changed (updated or new) knowledge successfully gained (plasticity) and the amount of knowledge that remains unchanged as intended (stability).Therefore, we categorize factual instances from Wikidata snapshots that are temporally aligned with Wikipedia snapshots and call the resulting datasets TWIKI-PROBES.Wikidata snapshots are structured knowledge graphs that store factual information in the form of (Subject, Relation, Object) such as (Barack Obama, born-in, Hawaii).These factual instances can be used to probe the LM for factual knowledge (Petroni et al., 2019)

Quality Control for Evaluation Data
We apply several quality control steps to the categorized factual instances from Section 3.2 to reflect the actual knowledge change from the LM update.
Alignment with TWIKI-DIFFSETS We ensure correct alignment of CHANGED instances with articles in TWIKI-DIFFSETS and UNCHANGED instances with articles from the entire Wikipedia since Wikidata updates do not necessarily entail Wikipedia updates and vice versa.In order to do this, we take three steps.
Step #1: We crawl information from each Wikipedia article page to find the mapping to the corresponding Wikidata entity id and store the information as a dictionary.Step #2: Then, for each factual instance from CHANGED, we check if the Subject id can be mapped to an article from TWIKI-DIFFSETS using the dictionary of id mappings.Likewise, for each instance from UNCHANGED, we check if the Subject id can be mapped to an article from Wikipedia.
Step #3: Lastly, for a successfully mapped factual instance from Step 2, we finally keep the instances where Object exists in the text of the article.
Heuristic Filtering In addition to the alignment with TWIKI-DIFFSETS, in order to further ensure the quality of the evaluation datasets, we apply three heuristic filtering rules to strengthen the quality of the data.Rule #1: We remove the instances where either SUBJECT or OBJECT is a substring of the other.Rule #2: We remove the instances where OBJECT contains more than 5 words.Rule #3: We limit the proportion of single SUBJECT to have 1% of the total, and RELATION and OBJECT by 5% of the total.Table 4 shows some examples of TWIKI-PROBES after quality control.

Dataset Statistics
In this paper, we construct TEMPORALWIKI from 08.2021 to 12.20213 and its statistics are discussed below.
Training Corpora Statistics Statistics of Wikipedia snapshots and TWIKI-DIFFSETS are shown in Table 1.An interesting aspect of TWIKI-DIFFSETS is that the amount of information being updated and added (i.e., number of tokens in each subset) is similar for each month.

Evaluation Dataset Statistics
The statistics of TWIKI-PROBES from the initial categorization from Algorithm 2 and quality control are shown in Table 2 4 .For further analysis, we break down the entity types of Subject and Object, and observe a similar proportion of each entity category for each month of TWIKI-PROBES (Appendix B).
We also show the distribution of the top 30 most frequent Relation of UNCHANGED and CHANGED (Appendix C).

Experiments with TEMPORALWIKI
In this section, we train and evaluate ever-evolving LMs with TEMPORALWIKI.Section 5.1 describes the experimental settings.Section 5.2 describes the baseline methodologies for updating LMs.Section 5.3 shows evaluation results on the training corpora.Section 5.4 presents the experimental results on TWIKI-PROBES.

Experimental Settings
For our baseline language model (LM), we continue pretraining GPT-2 Large (Radford et al., 2019) (774M parameters).We first compare the baseline performances between updating GPT-2 with TWIKI-DIFFSETS and updating it with entire Wikipedia snapshots and evaluate each update using TWIKI-PROBES.We also implement continual learning methods from literature known for mitigating catastrophic forgetting that occurs when updating GPT-2 with only TWIKI-DIFFSETS.Further detailed configuration of the experimental settings is provided in Appendix D.

Baseline Models
Here RecAdam We implement a regularization-based continual learning method for training large LMs called RECADAM (Chen et al., 2020) which places a stronger independent assumption among the model parameters, overcoming the limitations of implementing traditional methods such as EWC (Kirkpatrick et al., 2017) for training large language models.We set the hyperparameters of the optimizer identical to the original implementation.

Mix-review
We implement a rehearsal-based continual learning method for training large LMs called MIX-REVIEW (He et al., 2021) which mixes in random subsets of the initial pretraining data (08.2021Wikipedia data).We fix the mix-ratio as 2 in our experiments.
LoRA We implement a parameter-expansionbased continual learning method called LORA (Hu et al., 2022) which freezes the original parameters while adding trainable rank-decomposition matrices into each layer.We use hyperparameters identical to the optimal setting of the original implementation.

K-Adapter
We implement another parameterexpansion-based continual learning method, K-ADAPTER (Wang et al., 2021), which freezes the original parameters while adding additional adapters (an increase of 103M parameters) to the LM. 5 5 We add the additional parameters once for the updates from 08.2021.Exploring the optimal interval to add parameters for ever-evolving LMs is left for future work.

Intrinsic Evaluation
We first perform intrinsic evaluation by measuring the perplexity of the baseline models on their training corpora.For each month, we measure the model's perplexity on TWIKI-DIFFSETS and NON-TWIKI-DIFFSETS, where the latter refers to the subset of the month's entire Wikipedia snapshot that does not include the data from TWIKI-DIFFSETS.We sample 10,000 input instances from each subset with a fixed length of 512 and measure the perplexity on proper noun tokens determined by a Part-of-Speech (POS) tagger (Honnibal and Montani, 2017) as in (Lazaridou et al., 2021), which can be considered as a proxy for tokens containing factual knowledge.Therefore, the result on NON-TWIKI-DIFFSETS is meant to indicate the performance on unchanged knowledge, while the result on TWIKI-DIFFSETS corresponds to updated and new knowledge.Figure 2 shows the relative perplexity of each baseline method compared to INITIAL (i.e., dividing each model by INITIAL, and thus the lower, the better).
Results on NON-TWIKI-DIFFSETS show that the relative perplexity of DIFF increases while that of FULL remains constant as time goes on,  which implies that forgetting occurs when the LM is trained with TWIKI-DIFFSETS.The relative perplexities of continual learning methods increase less rapidly than DIFF, which means that applying continual learning mitigates catastrophic forgetting.MIX-REVIEW, especially, shows the least amount of forgetting among the continual learning methods, which indicates that training on the past corpus is effective in retaining performance on the previous training corpora in terms of perplexity.
On the other hand, the results on TWIKI-DIFFSETS show the opposite trend: the relative perplexity of DIFF is much lower than FULL.One thing to note is that the perplexity of FULL is very similar to that of INITIAL on TWIKI-DIFFSETS, which suggests that updating LMs on entire Wikipedia snapshots hinders the effective learning of changed data compared to DIFF, despite both having seen the same instances of TWIKI-DIFFSETS during training for the same number of iterations.Among continual learning methods, K-ADAPTER and LORA shows higher overall perplexities than DIFF while MIX-REVIEW and RECADAM shows similar perplexity.

Extrinsic Evaluation on TWIKI-PROBES
Performing only intrinsic evaluation on the training corpora is not sufficient because the intrinsic eval-uation itself only tests the capability of the LMs for memorization (McCoy et al., 2021).Through extrinsic evaluation with TWIKI-PROBES (Section 3.2), we specifically focus on evaluating factual knowledge of the LMs from each update.Placing equal importance on stability (UNCHANGED) and plasticity (CHANGED), we show the average of the perplexities of UNCHANGED and CHANGED as well as individual perplexities in Table 3, and show a bar graph of the average perplexities in Figure 3 6 .
As shown in Table 3, DIFF and all continual learning methods show better overall performance on CHANGED factual instances than INITIAL in all months, bolstering the results from the intrinsic evaluation.For UNCHANGED, however, DIFF suffers from catastrophic forgetting, showing consistent performance degradation as the number of updates increases.In contrast, continual learning methods effectively mitigate much of the catastrophic forgetting during temporal language modeling, resulting in lower perplexity on UNCHANGED, except RECADAM which performs worse as the number of updates increases.K-ADAPTER, especially, shows surprising results on UNCHANGED, outperforming even FULL throughout all of the months.Moreover, all continual learning methods surpass or are on par with DIFF on CHANGED factual instances, showing that ability to learn new knowledge (plasticity) is not sacrificed to preserve previous knowledge (stability).
Moreover, as shown in the average perplexity column of Table 3 and Figure 3, K-ADAPTER shows the most robust performance throughout the time periods.It is important to note that K-ADAPTER is around 12 times more computationally efficient than FULL in terms of total training time, under the same computational constraint.DIFF also outperforms FULL in all months but 1011, showing that temporal language modeling itself is an effective approach for overall stability-plasticity trade-off.We note that, as also shown in previous works (Lazaridou et al., 2021), results in Table 3 present an overall high perplexity (>200) because the sentences in TWIKI-PROBES are not natural sentences; they are factual phrases synthetically generated from a naive concatenation of Subject, Relation, and Object.We address this issue via light-tuning in Appendix E.

Effect of Temporal Misalignment
We quantify the effect of temporal misalignment on each method by training the LMs and evaluating their zero-shot perplexity on CHANGED instances of TWIKI-PROBES with various time intervals of training and evaluation.Among continual learning methods, we select K-ADAPTER since it shows the most robust performance for extrinsic evaluation across all time periods.As shown in Figure 4, FULL method is mostly influenced by the number of training updates and not much by whether there is temporal alignment.Since FULL is continuously pretrained on the entire Wikipedia corpus in each month, it would have likely seen the data containing CHANGED factual instances multiple times, leading to lower perplexity as training steps increases. 7For DIFF and K-ADAPTER, there is a general trend of strong performance when there is temporal alignment (diagonal entries), outperforming FULL with much fewer global training steps.It is important to note that K-ADAPTER shows robustness against temporal misalignment, i.e., the perplexity does not increase much even when the 7 Although directly training INITIAL on the whole Wikipedia corpus of a specific month can be an alternative, we exclude it here because it would only learn the knowledge of the specific month and thus inappropriate for a truly ever-evolving setting.
training and evaluation months do not match, compared to DIFF which suffers from a more severe perplexity spike.

Conclusion
In this paper, we provide answers to the four proposed questions in Section 1. (1) How can we train ever-evolving LMs efficiently and automate the evaluation of each update?We introduce TEM-PORALWIKI, a lifelong benchmark that can be used for training and evaluating ever-evolving LMs in an automated manner.It consists of TWIKI-DIFFSETS as the training corpora for temporal language modeling and TWIKI-PROBES as the evaluation datasets for measuring the stability-plasticity trade-off.(2) How does updating LMs only on new and updated data from Wikipedia compare to updating LMs on entire Wikipedia snapshots, especially in scenarios with multiple updates?Through experiments on TEMPORALWIKI, we show that updating LMs on TWIKI-DIFFSETS leads to better acquisition of new and updated knowledge than updating on entire Wikipedia snapshots with much less computational cost (12 times less).(3) How serious is catastrophic forgetting when LMs are updated only on new and updated data?We observe that temporal language modeling is a challenging problem, especially as the number of LM updates increases.However, results still show an overall enhancement in terms of stability-plasticity compared to updating with entire Wikipedia snapshots, showing that temporal language modeling is an effective alternative.(4) How can we mitigate catastrophic forgetting?We find that continual learning methods (regularization, rehearsal, and parameter-expansion) for large language model training effectively mitigates forgetting and shows robust performance in terms of enhancing the overall trade-off between stability and plasticity on TWIKI-PROBES.

Limitations
As mentioned at the beginning of this Section, each Wikipedia and Wikidata update does not ensure an actual update of real-world knowledge.For example, an addition of a new Wikipedia page does not necessarily mean that all the information on the new page is new world knowledge.Likewise, existing factual knowledge may be added to Wikidata because Wikipedia and Wikidata do not cover all of the world knowledge and may have some missing information about the world.Moreover, one aspect that is not covered in this work is knowledge deletion.While maintaining Wikipedia and Wikidata, volunteer editors not only update or add new information but also delete information that is incorrect or misinformed.As removing the misinformation and bias stored in LMs is an important issue and necessary for truly ever-evolving LMs, future work should address this aspect utilizing deleted information from general knowledge sources such as Wikipedia.

A Examples of TWIKI-DIFFSETS and TWIKI-PROBES
Figure 5 shows the examples of TWIKI-PROBES which is either an updated or a new piece of information.By comparing consecutive snapshots of Wikipedia corpus, we keep track of changed information.Table 4 shows the examples of CHANGED factual instances TWIKI-PROBES, which is aligned with the corresponding sentence in TWIKI-DIFFSETS.

B Details of Entity Types of Subject and Relation
Figure 6 shows the ratio of different entity types of Subject and Object of UNCHANGED and CHANGED.

C Details of Relation Distribution
The distribution of Relation for UNCHANGED, CHANGED factual instances in TWIKI-PROBES is shown in Figure 7.

D Continual Pretraining and Light Tuning Configuration
For each LM update, we use 8 32GB V100 GPUs with a global batch size of 64 and a fixed input sequence length of 512.We use the max learning rate of 1e-4 and one cycle learning rate scheduling policy (Smith, 2018).For light-tuning, the training is done for only one epoch with a learning rate of 1e-5 and a batch size of 32.Input and output sequence lengths are set to 25.For continual learning-based methods, we unfreeze all of the parameters during light-tuning, following Jang et al. ( 2022).

E Light-tuning results with TWIKI-PROBES
Using the pre-defined templates of LAMA (Petroni et al., 2019) seems to be an option, but we find that those templates do not fit well to our experiments because there is a considerable distribution gap between LAMA and TWIKI-PROBES; over half of the instances of TWIKI-PROBES are filtered out to apply the templates, especially for CHANGED.Instead, to alleviate the distributional shift that causes high zero-shot perplexity, we light-tune the LMs on 500 instances randomly sampled from WikiData that do not overlap with instances from TWIKI-PROBES (details in Appendix F).Unlike finetuning, light-tuning lets the LM only learn the input and output distribution of the task, avoiding the problem of test-train overlap pointed out by Lewis et al. (2021).Table 5 shows the results of light-tuning, which demonstrate a similar trend as the zero-shot performance.Although light-tuning avoids the problem of test-train overlap, results are largely affected by the sampled instances for tuning, so a zero-shot evaluation setting is preferred for reliability.
Many knowledge-intensive tasks such as closedbook question answering (Roberts et al., 2020;Petroni et al., 2021;Jang et al., 2022) or slot filling (Petroni et al., 2021) use accuracy, EM, or F1 score to evaluate the task.We also show the F1 score on TWIKI-PROBES in Table 6.Overall trend is consistent with zero-shot perplexity metric; K-ADAPTER shows robust performance for both UNCHANGED and CHANGED.

F Light-Tuning Data
We sample 500 instances from WikiData for each time step that do not overlap with instances from Table 4: Examples of successful alignment between CHANGED factual instances from TWIKI-PROBES-0910 and articles from TWIKI-DIFFSETS-0910.The alignment is considered successful because for the given factual instance, the Subject matches the title of the Wikipedia page and the Object exists in the article.

Figure 1 :
Figure1: An overview of using TEMPORALWIKI, consisting of TWIKI-DIFFSETS and TWIKI-PROBES to train and evaluate ever-evolving LMs, respectively.Differences between Wikipedia snapshots at different points in time are used for temporal language modeling, and categorized factual instances in the corresponding Wikidata snapshots are used for temporal evaluation.
we describe the baseline methods used for training and evaluation, namely INITIAL, FULL, DIFF, RECADAM, MIX-REVIEW, K-ADAPTER, and LORA.Initial As the starting model checkpoint for all of the experiments, we continually pretrain pretrained GPT-2 fromRadford et al. (2019) on the 08.2021Wikipedia snapshot for four epochs in total (around 546K global steps) so that the initial GPT-2 used for all of the experiments is updated with the last two years of world knowledge.We denote this checkpoint as INITIAL, and it serves as the initial checkpoint for all of the other methods.Full We start from INITIAL and continue pretraining it on the entire Wikipedia snapshot of each month in a sequential manner.For example, after training on the 09.2021Wikipedia snapshot from INITIAL, we continue training it on the 10.2021 Wikipedia snapshot and move on to the next snapshot.We denote the resulting model as FULL.We iterate through the training data only once, which corresponds to an average of 4.6 billion token updates (140K global steps) for each month.Diff We start from INITIAL and continue pretraining it on TWIKI-DIFFSETS in a sequential manner.We denote the resulting model as DIFF.Similarly to FULL, we iterate through the training data only once, which is an average of 347 million token updates (12K global steps) for each month.

Figure 2 :
Figure 2: Relative proper noun perplexity of FULL, DIFF, and K-ADAPTER, LORA, RECADAM and MIX-REVIEW compared to INITIAL on TWIKI-DIFFSETS and NON-TWIKI-DIFFSETS for each month.Lower ratio indicates better performance.The performance of DIFF (orange) and RECADAM (yellow) in (b) is is almost identical.

Figure 3 :
Figure 3: Average overall perplexity of TWIKI-PROBES.We average the perplexities of UNCHANGED and CHANGED with equal importance placed on stability and plasticity.The x-axis depicts the two-month intervals.A lower score indicates better performance.

Figure 4 :
Figure 4: The zero-shot perplexity of the LMs updated and evaluated on various time intervals of CHANGED of TWIKI-PROBES, showing the effect of temporal misalignment.The better the results, the darker the performance is colored.The vertical axis represents the Trai.

Figure 5 :
Figure 5: Examples of TWIKI-DIFFSETS constructed from comparing November 2021 and December 2021 Wikipedia Dumps.(a) shows an instance of information update and (b) shows an instance of new information.

Figure 6 :
Figure 6: Entity types of Subject and Object in TWIKI-PROBES.
.] Carlo Alighiero died in Rome on 11 September 2021 at the age of 94.[...] Shang-Chi and the instance of Film [...] Shang-Chi and the Legend of the Ten Rings is a 2021 American Legend of the Ten Rings superhero film based on Marvel Comics featuring the character Shang-Chi.[...] Out of Shadows language of work or name Spanish [...] It was later translated into Portuguese, Turkish and Spanish.[...] Mario Chalmers member of sports team Indios [...] On September 27, 2021, Chalmers signed with Indios de Mayagüez de Mayaguez of the Baloncesto Superior Nacional.[...]

Figure 7 :
Figure 7: TWIKI-PROBES distribution of the top 30 Relation.
. Through Algorithm 2, we distinguish each factual instance into either UNCHANGED or CHANGED.Wikidata snapshots W D prev and W D recent where W D recent is more recent.Un, C := Arrays that store UNCHANGED and CHANGED factual instances, respectively.for all fact (s r , r r , o r ) ∈ W D recent do P ← {(s, r, o) | s = s r where (s, r, o) ∈ W D prev } if P = / 0 then C.append(s r , r r , o r ) else if r r / ∈ P then C.append(s r , r r , o r ) else if r = r r and o = o r for some(s, r, o) ∈ P then Un.append(s r , r r , o r ) else C.append(s r , r r , o r ) end if end for As shown in Algorithm 2, given two consecutive Wikidata snapshots, a single TWIKI-PROBE is constructed, which is used to evaluate an LM updated with TWIKI-DIFFSET.Algorithm 2 categorizes instances with new Relation or instances with the same Relation, but a new Object into CHANGED, and unchanged instances into UN- CHANGED.

Table 1 :
Statistics of TWIKI-DIFFSETS.The two digits indicate the month of the year 2021 that the Wikipedia snapshot was obtained from.The four digits for WIKI-DIFFSET indicate the months of the two snapshots being compared.For instance, TWIKI-DIFFSET-0809 indicates the difference between August (08) and September (09).

Table 2 :
Detailed Statistics of TWIKI-PROBES during construction.Un and C represents UNCHANGED and CHANGED factual instances, respectively.

Table 7 :
Statistics of the data used for Light-Tuning