Time-Aware Language Modeling for Historical Text Dating

,


Introduction
The temporal dimension texts is critical to many natural language processing(NLP) tasks, such as information retrieval (Kanhabua and Nørvåg, 2016), question answering (Shang et al., 2022;Stricker, 2021), text summarization (Cao and Wang, 2022;Martschat and Markert, 2018), event detection (Sprugnoli and Tonelli, 2019), and sentiment analysis (Ren et al., 2016).Timestamps of documents provide essential clues for understanding and reasoning in time-sensitive tasks, whereas they are not always available (Chambers, 2012).One way to solve this problem is to predict when a document was written according to its content, which is also known as automatic text dating (Dalli, 2006).
Text dating has been widely introduced in computational sociology and digital humanities studies.One typical application of it is to date his-torical documents for the construction of digital libraries (Baledent et al., 2020).Such task is also called historical text dating (Boldsen and Wahlberg, 2021), diachronic text evaluation (Popescu and Strapparava, 2015), or period classification (Tian and Kübler, 2021).Compared to other dating tasks, historical text dating is more challenging as explicit temporal mentions(e.g., time expressions) that help to determine the written date of a document usually do not appear in it (Toner and Han, 2019).To solve it, current research on historical text dating focuses on document modeling, trying to find the relationship between time and linguistic features (Boldsen and Wahlberg, 2021).
There are two main issues to historical text modeling.One is to learn word representations by diachronic documents.Current research on word representation either learn static word embedding throughout the corpus (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), or learn dynamic word representations using pre-trained models (Tian and Kübler, 2021).However, neither of them takes into account the relation between time and word meaning.For example, broadcast usually refers to sowing seeds before the 20th century; after that, it means transmitting by radios or TVs in most cases.Studies on language evolution help to find the relationship between the same words in different time periods, since they often discuss them by mapping them into a same semantic space (Ferri et al., 2018).However, how to apply such methods into document modeling for historical text dating is still unexplored.
Another is document modeling for historical texts.
Initial work on neural networkbased document modeling employ convolutional neural networks(CNN) or recurrent neural networks(RNN) (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), while recent research turns to pre-trained models like BERT (Tian and Kübler, 2021) or RoBERTa (Li et al., 2022).However, these studies always treat time as a prediction target, but not a variable in modeling, which does not help to capture the temporal characteristics of diachronic documents.In fact, evidences from the research on language modeling show that time-based models help to capture semantic change of language, and consequently improve the performance of timesensitive tasks (Agarwal and Nenkova, 2022;Rosin et al., 2022;Röttger and Pierrehumbert, 2021).To this end, some studies attempt to incorporate the time factor into language modeling, e.g., learning explicit temporal expressions via language models (Dhingra et al., 2022;Rosin et al., 2022).However, as mentioned above, these methods are not suitable for historical text dating due to the lacking of explicit time expressions in historical texts.
In this paper, we present a time-aware language model named TALM, trying to introduce the time factor into the modeling procedure for historical text dating.Inspired by the work in language evolution, we propose to learn word representations over documents of different time periods separately.In this way, each word has time-specific variants to the vocabulary.It is important because a document should be represented by word embeddings that are temporally consistent with it.We also apply an alignment approach to map all the time-specific variants of a word into the same semantic space in order to make them comparable.In particular, we propose temporal adaptation, a representation learning approach to learn word representations having consistent time periods with documents they belong to.This approach attempts to learn temporal word representations based on the knowledge of two aspects: time-specific variants of words and their contexts, which depict the temporal attribute of words from multiple perspectives.We also build a hierarchical model for long document modeling, where the temporal adaptation is applied for word representation.We validate our model on a largescale Chinese diachronic corpus and an English diachronic corpus.Experimental results show that our model effectively captures implicit temporal information of words, and outperforms state-of-theart approaches in historical text dating as well.Our contributions can be summarized as follows: • We propose a temporal adaptation approach that enables the word representation models to capture both temporal and contextualized information by learning the distributed representations of diachronic documents; • We propose a time-aware language model for historical texts, which uses the temporal adaptation approach to obtain time-specific variants of words that are temporally consistent with the documents they belong to, thereby improving the ability to model diachronic documents; • We report the superior performances of our model compared to the state-of-the-art models in the historical text dating task, showing the effectiveness of our model in capturing implicit temporal information of words.

Related Work
Automatic text dating follows the research roadmap from traditional machine learning to deep learning technologies, like many other NLP tasks.Early studies employ features by manual work to recognize temporal expressions within documents (Dalli, 2006;Kanhabua and Nørvåg, 2016;Niculae et al., 2014), which suffer from the problem of generalization and coverage rate.Traditional machine learning methods focus on statistical features and learning models, such as Naïve Bayes (Boldsen and Wahlberg, 2021), SVM (Garcia-Fernandez et al., 2011) andRandom Forests(Ciobanu et al., 2013).Recent studies turn to deep learning methods, and the experiments show their superior performances compared to traditional machine learning ones (Kulkarni et al., 2018;Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019;Ren et al., 2022).Pre-trained models are also leveraged to represent texts for the dating task, such as Sentence-BERT (Massidda, 2020;Tian and Kübler, 2021) and RoBERTa (Li et al., 2022).Pre-trained models show state-of-the-art performances on the text dating task; however, few of them consider the time attribute of words.Language evolution studies explore the issue by modeling words from different time periods.Such work can be categorized into three classes: 1)learning word embeddings for each time period separately, then mapping them into the same space via alignment methods (Alvarez-Melis and Jaakkola, 2018;Hamilton et al., 2016); 2) learning word embeddings for a time period first, then using them as the initial values to train word embedding for other time periods (Di Carlo et al., 2019); 3) learning unified word embeddings by introducing temporal variables into the learning procedure (Tang, 2018;Yao et al., 2018).However, static word embed-ding has the drawback of dealing with polysemous words (Asudani et al., 2023), which may not be suitable for temporal word representation learning.In addition, there are still few studies on the language evolution modeling for historical text dating.
On the other hand, research on the temporal pretrained models shows that encoding the temporal information in language modeling is beneficial in both upstream and downstream tasks (Agarwal and Nenkova, 2022;Röttger and Pierrehumbert, 2021).Efforts have been made to build models that can learn temporal information, such as incorporating timestamps of texts into the input of the model (Pokrywka and Graliński, 2022;Pramanick et al., 2021), prefixing the input with a temporal representation (Dhingra et al., 2022;Chang et al., 2021), or learning the masked temporal information (Rosin et al., 2022;Su et al., 2022).However, there is still insufficient discussion of document modeling without explicit temporal mentions.

The Approach
In this section, we explain the mechanism of the time-aware language modeling approach based on temporal adaptation, and how it is applied to the task of historical text dating.Figure 1 shows the overall architecture of the proposed model, which consists of three main modules: word representation learning and alignment, temporal adaptation, and diachronic document modeling.

Word Representation Learning and Alignment
Word representation alignment is a pipeline, where word representations are learned firstly in an unsupervised manner, and then are aligned in the same semantic space.Note that each model for learning word representations is trained respectively on the documents of every time period.In our method we use BERT (Devlin et al., 2018) as the language model to learn.Let D t be the document collection of the tth time-period.Let a sentence , where X t ∈ D t , n is its length, and , where e t i is the distributed representation of x t i .We train the model with the masked language modeling task on each period separately, and extract the embedding layers as word representations: We follow the idea of Hamilton et al. ( 2016), mapping the word representations of different time periods into the same semantic space by orthogonal Procrustes algorithm, which aims to make them comparable.Let W t ∈ R d×|V | be the word vector matrix at time period t.We use the singular value decomposition to solve for matrix Q, so that the cosine similarity between the word vector matrices from different time periods remains unchanged while minimizing the distance in the vector space.
The objective for alignment is shown below: where R t ∈ R d×d .

Temporal Adaptation
The meaning of words always changes over time.
Hence it is necessary to consider the model's ability to adapt different meanings of words over time.This problem is similar to domain adaptation (Ramponi and Plank, 2020), where the model should have the ability of transferring from the source semantic space domain to the target one.In the historical text dating task, the time period of the document to be predicted is not known in advance, hence the model needs to determine the word representations of a time period that are adaptive to the document.This is called temporal adaptation in our model.This subsection shows the temporal adaptation approach.The middle part in Figure 1 represents the temporal adaptation module proposed in this paper.The input of this module are documents, which consist of segmented blocks.We encode each word with three embeddings: token embedding, position embedding, and block embedding.The block embedding of a word is utilized to indicate the sequential information of the sentence it positioned.The word embedding is defined as follows: The main part of the temporal adaptation module is a Transformer Encoder.h i represents the hidden space vector for each token obtained from the Transformer Encoder.One of the purposes of this module is to adapt h i to e i t , which can make the hidden representations fit temporal domain.
On the other hand, contexts also indicate temporal information, which means modeling contextual information helps to determine the time period a text may belongs to.To this end, we design a The architecture of our model.The left part is the word representation and alignment module, which is to learn and align the temporal word representation in each time period; the middle part is the temporal adaptation module, which aims to learn the time-specific variants of words and their contextual information; the right part is the hierarchical modal for diachronic documents, which incorporate the results of temporal adaptation into language modeling.masked language modeling task to learn the contextual information of the input texts.Different from other temporal masked language models, we do not consider the idea of masking temporal information as they did not obviously appear in the text.Alternately, we use a fundamental method by masking a portion of words randomly, based on the assumption that the contexts of word also indicate the temporal information since they are in the same time period.Hence we follow the masked language modeling task in BERT training process.

Diachronic document modeling
The aim of diachronic document modeling is to obtain a document-level representation for long historical texts.The novelty of this approach lies in the combination of the hierarchical document modeling and temporal adaptation.By doing so, temporal features are incorporated into each layer of the transformer encoder, allowing the hierarchical model to learn implicit temporal information according to the knowledge gained from temporal adaptation.Experiments show that this method improves the performance of the text dating task.
Inspired by (Grail et al., 2021), we propose a hierarchical model for long diachronic document modeling.The hierarchical structure in this paper consists of a word transformer and a block transformer.The word transformer is responsible for learning local token representations in each block, while the block transformer learns the contextual information of blocks.Since transformer encoder on the block encoding layer lacks the sequential information of the blocks, we employ three types of embedding to indicate the word order information and block order information, as mentioned in equation 3.After block transformer encoding, the model obtains the global block-aware representations, and put it to the next layer for further learning.
In order to make use of the temporal representations of words in diachronic document modeling, we explore the bridging approaches within the Transformer Encoder layer to incorporate the temporal feature representation during the hierarchical modeling process.Specifically, three bridging approaches are built: • Feature Aggregation This approach directly adds the input at each layer with the temporal representations of words.
• Dot Attention The document input representation is used as the query, while the time feature representations of the words are employed as the key and value.The dot product attention mechanism is applied to each input sample to integrate the temporal information.
• Additive Attention Similar to dot attention, an additional attention parameter matrix is introduced in this method.The document input representation serves as the query, while the time feature representations of the words are utilized as the key and value.Using the additive attention mechanism, the temporal feature information is weighted and merged into the original input at each step.
After N-layer encoding, we utilize an attention mechanism to determine the importance of different document blocks.Attention scores are calculated to represent the importance of each document block, and these scores are utilized to weigh the feature information of each block.Finally, we obtain the representation of the document, which is then evaluated for historical text dating.

Training objective
The overall training objective is a combination of the losses from two modules: temporal adaptation and diachronic document modeling.
For temporal adaptation module, we have two training objectives in learning process.One of the learning objective is that transforming hidden representation learned by Transformer Encoder into target temporal domain, which means minimize the distance between h i and ēt i .Hence we adopt mean squared error our loss function, which is defined below: where N is the number of tokens.By doing so, the model can map the word representations of the input text to the semantic space domain of the corresponding time period, making the model to be represented more appropriate.
Another training objective of temporal adaptation module is masked language modeling(MLM), which aims to maximize the predict probability of masked words.Let X Π = {π 1 , π 2 , . . ., π K }, which denote the indexes of the masked tokens in the sentence X, where K is the number of masked tokens.Denote that X Π is the collection of masked tokens in sentence X, and X −Π is the collection of unmasked tokens.The learning objective of MLM is defined as: For diachronic document modeling module, we use the cross-entropy loss as the training objective for historical text dating task.In this loss function, y i,c ∈ {0, 1}, representing whether the true class of document i belonging class c, and C is the number of classes.The cross-entropy computes as: Finally, We aggregate three losses to form the mixture training objectives of our model.The overall training objective is formulated as the sum of the above three losses: where L CLS represents the loss function of text dating,both L M SE and L M LM are the loss function of temporal adaptation.

Experiments
This section shows the experimental results as well as a discussion of the proposed model.Specifically, we give the details about the experimental settings first, then we introduce the datasets and the preprocessing methods for them.In subsection 4.3, we show the results of the comparative experiments, the ablation tests and the parameter evaluations.
Finally, a discussion of our model is given.

Dataset
Our experiments are conducted on two datasets.One is a Chinese dataset including documents of Chinese official dynastic histories(a.k.a.Twenty-Four Histories) with the time spanning from 2500 B.C. to 1600 A.D., and the other is an English dataset named Royal Society Corpus (Kermes et al., 2016), with the time spanning from 1660 to 1880.Such kind of datasets have a long time span, so that they are suitable for our model to explore the performance of temporal modeling.Considering that some historical records in the Twenty-Four Histories Corpus were not written in its time periods but compiled and summarized by the later writers, we assign a timestamp to each document of the corpus based on the time it is finished.We divided the corpus into smaller blocks for the convenience of the analysis.Specifically, we extract every thirty sentences from the original corpus as one block, with each sentence containing a maximum of thirty Chinese characters or English words.These samples are labeled with their corresponding time periods and randomly split into training, validation, and testing sets in an 8:1:1 ratio, respectively.Statistical results of the dataset is shown in Table 1.As shown in Table 2, on the Twenty-Four Histories Corpus, our proposed model achieves the best performance in the text dating task with an F1 score of 84.99%, outperforming other baseline models.
We can see that, the methods based on the structure of pre-trained language models outperform the traditional neural network-based approaches.Longformer achieves a best F1 score of 81.6% in the baseline models, which is specifically used to handle the problem of long text input.However, it is challenging to incorporate temporal information into the Longformer model, as common attention mechanisms have high computation complexity for long document inputs.Hence, in this study, we adopt a hierarchical document modeling structure, which reduces the computational complexity for long input sequence.Among the baseline models, both HAN and Hierarchical BERT utilize hierarchical document structures and achieve competitive performance.Our proposed model leverage temporal adaptation and bridging structures in learning process, outperforms these hierarchical models by 6.9% and 3.39% in F1 score respectively, demonstrating the effectiveness of incorporating time information in text dating tasks.
In the methods used for dating, SBERT and RoBERTa achieve similar model performance, while LSTM performs a relatively low performance, due to its limited modeling capability.However, these baseline models primarily focus on the language features of the text itself and do not incorporate the temporal information associated with the text.In contrast, our proposed model incorporates the corresponding temporal information of the input through different mechanisms.The additive attention achieves the best performance, outperforming SBERT and RoBERTa by 4.28% and 4.24% in F1 score, respectively, demonstrating the advantage of TALM in learning temporal information.
On the Royal Society Corpus, our model achieved the best performance among all the models, similar to that on the Chinese corpus.Specifically, our model with additive attention gains an increasing 2.58% F-1 score compared with the best baseline SBERT.It suggests that our model helps to learn temporal information and incorporate this information effectively into text dating tasks in both Chinese and English texts.Comparison Of Bridging Methods We use different bridging approaches to make the model having the ability of time-awareness.Here we will discuss the impact of these approaches to model performances.As shown in Table 2, each of these approaches leads to an improving performance, showing the effectiveness of integrating temporal information through bridging mechanisms.Overall, the additive attention approach achieves the best results.This attention mechanism introduces additional attention matrix parameters, which control the amount of temporal word representation information selected during training.As a result, the model can learn word vector representations corresponding to specific time periods more effectively.
Flexible Evaluation Criteria This study focuses on the historical text dating task.However, in the historical corpora, word meaning changes in a gradual way, and the semantic similarity between texts from adjacent periods may be higher than those that have longer time span.Inspired by (Li et al., 2022), we define a more flexible evaluation metric called Acc@K.This metric treats the prediction result adjacent to the correct time period as correct one, which is a more flexible metric.N acc @K represents the number of the correct cases, where those having a period class distance of ±⌊ K 2 ⌋ are also taken into account.N all represents the total number of samples in the dataset.The specific mathematical formula is defined as follows: We use the metric of Acc, Acc@3 and Acc@5 to evaluate our model, As can be seen from Table 2 that: 1) our model outperforms the baselines in most cases, showing its robustness in both rigid and relaxed evaluation criteria; 2) our model can better distinguish cases of adjacent time periods, compared with the other baselines.Ablation Study In the ablation experiments, we analyze the contributions of these components by evaluating TALM without time-specific variants of words, TALM without context learning, and TALM without temporal attention.
When the temporal attention module is removed, the performance of the bridging module is also eliminated.Therefore, temporal adaptation and time-aware language modeling cannot be integrated with the dating task.This model performs the worst with an F1 score of 76.69%.When the time-specific variants of words are removed, the model does not work well to learn temporal word representations, leading to a performance decrease of 3.42% in F1 score.Similarly, when the context learning module is removed, the model does not work well to learn contextual information for temporal word representations, resulting in a performance decrease of 6.31% in F1 score.These results indicate that all the module play crucial roles in the text dating task.

Analysis
Case Study In this section, we discuss the performances of our model in different time periods, analyze cases that our model fail to recognize, and investigate the impact of the temporal adaptation module in a visualization way. Figure 2 shows the confusion matrix of our model on the test set.We can see that it is much difficult to predict the time period of the Southern Liang.Actually, most samples are falsely predicted to the Dynasty Tang,

Method Twenty-Four Histories Corpus
Royal Society Corpus P R F1 Acc Acc@3 Acc@5 P R F1 Acc Acc@3 Acc@5 Han (Yang et al., 2016)  with a few samples predicted as the Northern Qi.This could be attributed to the relatively short time duration of each period in the Southern and Northern Dynasties.In other words, the semantic gap between these three dynasty is not much obvious.Furthermore, the Later Jin period lasted for only about ten years in the ancient China history, and this may lead to low performance in identifying if a document belongs to this time period.This indicates that the Later Jin period serves as a transitional time period with semantic and textual writing styles similar to the Tang and Song Dynasties.Table 4 shows that our model outperforms the state-of-the-art model RoBERTa in those cases who have a short historical time duration.Specifically, our model gains an increasing F1 score of 5.28%, 9.00% and 13.40% in the time period of Southern Liang, Northern Dynasty of Qi, Later Jin.It also can be seen from Table 2 that, the score of the metric Acc@5 of our model significantly higher than that of Acc.It suggests that texts of adjacent periods have a great impact on the performance of the model, leading to a higher probability of the error-predicted period results close to the gold labels than to other periods.To make an in-depth analysis of such errors, we sample the false-predicted cases of Southern Liang, which got the lowest recall score on the test set.As shown in Table 5, the sentence 1-3 belong to the South- ern Liang period, whereas our model misclassifies them into the class of Northern Qi or Tang, which are adjacent periods to the gold label.Likewise, the sentence 4 and 5 are both texts from the Later Jin period, however the model misjudges them as the texts of the Song period.As a matter of fact, there is a very short time interval between the Later Jin and the Song period, nearly 13 years.The main reason of these errors may lie in the high similarity of language usage as such periods are very close, and language changes without striking difference are insufficient for more precise prediction.In other words, there is not significant change of words in these adjacent periods, making it difficult to provide discriminative features for temporal language modeling.Time-aware Visualization In this section, we further investigate the impact of the temporal semantic learning module on the text dating task and compare it with the RoBERTa model.To avoid the influence of the contexts to word representations during the hierarchical document modeling process, we select the representations obtained after the first layer of additive attention in the document encoding module.After encoding with this layer of attention, the model's input incorporates the word representations with temporal information.We take the average of the word representations as the document vector and used the T-SNE method to visualize the document vectors in the test set.Similarly, we also select the word vector input layer of the RoBERTa model for comparison.Figure 3 shows the visualized results, indicating that our model has a better performance to evaluate the temporally semantic relationship among documents.

Conclusion
In this paper, we propose to incorporate the time attribute into the language modeling for histori-cal text dating by building TALM, a time-aware language model.To address the problem of the model suffering from missing temporal expressions in historical texts, we propose a temporal adaptation model that makes the model have the ability of time-awareness.We also propose a hierarchical modeling method to incorporate the temporal adaptation and long document modeling.We conduct experiments using different attention mechanisms, showing the effectiveness of integrating temporal information on historical text dating task.
Our study provides new evidences to support the viewpoint that encoding temporal information contributes to improving the performances of language modeling.

Limitations
In this study, we make our experiments on datasets of two languages, Chinese and English.In order to further validate the effectiveness of the proposed model, datasets of more language versions are expected in order to investigate the model performance on the modeling of multiple languages.Additionally, regarding the division and definition of historical periods, this study adopted a coarsegrained labeling standard.Due to the chronological order of the corpus, coarsegrained labeling may not accurately represent the exact time of the texts.Therefore, in the future, we plan to collect or construct fine-grained textual corpora to improve the performance of temporal information learning and enhance the accuracy of the text dating task.

Figure 2 :
Figure 2: The confusion matrix of our model on the Twenty-Four Histories Corpus.

Figure 3 :
Figure 3: The document distribution representation visualization of the two models.
The dimension of the word vector is 768, and both the temporal adaptation and the hierarchical model has 6 layers, and the dropout rate is 0.1.The batch size of the training is 8 and the learning rate is 1e-5.The optimizer we use is AdamW.All the systems are conducted over 5-fold cross validation.Performance Comparison We compare our proposed model TALM with other baseline models, and the performance results of each model are shown in Table2.We use macro precision, macro recall and macro F1 as our metrics.The bestperforming values are in bold format.

Table 2 :
Performances of our model and baselines on the Twenty-Four Histories corpus and Royal Society Corpus.The evaluation metrics include P, R, F1, Acc, Acc@3 and Acc@5.

Table 3 :
Results of ablation study on the Twenty-Four Histories corpus.

Table 5 :
Therefore, our military's decision to cease the offensive and refrain from occupying the southeastern region does not contradict the promises of our predecessors.Success can only be achieved through honesty, while being confused and ignorant of worldly matters will be the beginning of a disadvantage.Examples of errors.