Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection

Lexical semantic change detection is a new and innovative research field. The optimal fine-tuning of models including pre- and post-processing is largely unclear. We optimize existing models by (i) pre-training on large corpora and refining on diachronic target corpora tackling the notorious small data problem, and (ii) applying post-processing transformations that have been shown to improve performance on synchronic tasks. Our results provide a guide for the application and optimization of lexical semantic change detection models across various learning scenarios.


Introduction
In recent years Lexical Semantic Change Detection (LSCD), i.e. the detection of word meaning change over time, has seen considerable developments (Tahmasebi et al., 2018;Kutuzov et al., 2018;Hengchen et al., 2021). The recent publication of multi-lingual human-annotated evaluation data from SemEval-2020 Task 1  makes it now possible to compare LSCD models in a variety of scenarios. The task shows a clear dominance of type-based embeddings, although these are strongly influenced by the size of training corpora. In order to mitigate this problem we propose pre-training models on large corpora and refine them on diachronic target corpora. We further improve the obtained embeddings with several post-processing transformations which have been shown to have positive effects on performance in semantic similarity and analogy tasks (Mu et al., 2017;Artetxe et al., 2018b;Raunak et al., 2019) as well as term extraction (Hätty et al., 2020). Extensive experiments are performed on the German and English LSCD datasets from SemEval-2020 Task 1. According to our findings, pre-training is advisable when the target corpora are small and should be done using diachronic data. We further show that pre-training on large corpora strongly interacts with vector dimensionality and propose a simple solution to avoid drastic performance drops. Post-processing often yields further improvements. However, it is hard to find a reliable parameter that performs well across the board. Our experiments suggest that it is possible to use simple pre-and post-processing techniques to improve the state-ofthe-art in LSCD.

Related Work
As evident in  the field of LSCD is currently dominated by Vector Space Models (VSMs), which can be divided into typebased (static) (Turney and Pantel, 2010) and token-based (contextualized) (Schütze, 1998) models. Prominent type-based models include lowdimensional embeddings such as Global Vectors (GloVe, Pennington et al., 2014) and Skip-Gram with Negative Sampling (SGNS, Mikolov et al., 2013a,b). However, as these models come with the deficiency that they aggregate all senses of a word into a single representation, token-based embeddings have been proposed (Peters et al., 2018;Devlin et al., 2019). According to Hu et al. (2019) these models can ideally capture complex characteristics of word use, and how they vary across linguistic contexts. The results of SemEval-2020 Task 1 , however, show that contrary to this, the token-based embedding models (Beck, 2020;Kutuzov and Giulianelli, 2020) are heavily outperformed by the type-based ones (Pražák et al., 2020;Asgari et al., 2020). The SGNS model was not only widely used, but also performed best among the participants in the task. This result was recently reproduced in the DIACR-Ita shared task (Basile et al., 2020;Laicher et al., 2020;Kaiser et al., 2020b). Its fast implementation and combination possibilities with different alignment types further solidify SGNS as the standard in LSCD (Schlechtweg et al., , 2019aShoemark et al., 2019;. Hence, the embeddings used in this work are SGNS-based. Further increases in performance of type-based VSMs can be achieved by various post-processing transformations. This has been shown for semantic similarity and analogy tasks (Mu et al., 2017;Artetxe et al., 2018b;Raunak et al., 2019) as well as term extraction (Hätty et al., 2020). It is still an open question whether these transformations improve performance in the special setting of LSCD where we typically have several corpora and vector spaces which have to be transformed simultaneously ). An indication is given by Schlechtweg et al. (2019a) showing that for a simple LSCD model mean centering leads to consistent performance improvements on two German data sets. Whether this result is reproducible on further data sets, more complex models and further post-processing techniques has not been determined yet.
Post-processing methods operate on information already contained in a VSM, rather than adding additional information. Further semantic information can be introduced by pre-training vectors on a larger unspecific collection of text (Kutuzov and Kuzmenko, 2016) or by training a seperate matrix on such text and concatenating the two VSMs (Limsopatham and Collier, 2016). This is especially helpful for cases where only smaller specialized corpora are given. Combining the information from two models is also found in Kim et al. (2014), here it is used for alignment proposes. We operate similarly to Kim et al. but with the motivation of Limsopatham and Collier and Kutuzov and Kuzmenko, as we aim to enrich a VSM prior to the training process.

Data and Tasks
We train SGNS-based VSMs on various corpora and use a word similarity task and an LSCD task for evaluation. The two tasks share a common aspect: the vector representations of two words need to be compared with some metric (e.g. cosine similarity), and word pairs need to be ranked according to that metric. In the word similarity task, we have the vectors of two different words in the same vector space (w i , w j ), while for LSCD we have the vectors of the same word but from different vector spaces representing different time periods (w t1 i , w t2 i ).
Modern Data. We use two large modern English and German corpora, PUKWAC (Baroni et al., 2009) and SDEWAC (Faaß and Eckart, 2013) to validate the post-processing methods on the word similarity task and to create pre-trained embeddings for the LSCD task. PUKWAC and SDEWAC are web-crawled corpora from the .uk and .de domain respectively. Resulting in fairly large corpora, 2B tokens and 750M tokens (see Table 1). We evaluate vector representations created on the two corpora on a standard dataset of human similarity judgments, WordSim353 (Finkelstein et al., 2002), by measuring Spearman's rank correlation coefficient of the cosine similarity of vectors for target word pairs with human judgments.
Diachronic Data. We utilize the English and German datasets provided by SemEval-2020 Task 1 Subtask 2 . Each dataset contains two target corpora from different time periods, t 1 and t 2 , as well as a list of target words. The corpora originate mostly from newspaper articles and books. Their biggest difference to PUKWAC and SDEWAC is their approximately 10 to 100 times smaller size, according to token counts (see to Table 1). The task is to rank the list of target words according to their word sense divergence, gradually from 0 (no change) to 1 (total change). The rank predictions are compared against gold data which is based on human judgments. Once again Spearman's rank correlation coefficient is used to measure performance on the task.

Models
Following the popular approach taken for typebased vector space models in LSCD, we combine three sub-systems: (i) creating semantic word representations, (ii) aligning them across corpora, and (iii) measuring differences between the aligned representations (Schlechtweg et al., 2019a;Dubossarsky et al., 2019;Shoemark et al., 2019). Alignment is needed as columns from different vector spaces may not correspond to the same coordinate axes, due to the stochastic nature of many low-dimensional word representations (Hamilton et al., 2016). Additionally, we aim to refine subsystem (i) by adding pre-trained semantic word representations and using post-processing methods  to improve the quality of the created semantic word representations. 1 We use SGNS (Mikolov et al., 2013a,b) to create type-based word representations in combination with three different alignment methods, Orthogonal Procrustes (OP), Vector initialization (VI), and Word Injection (WI). The three alignment methods combined with SGNS have been proven to be state-of-the-art, even when competing against token-based embeddings Kaiser et al., 2020a;Basile et al., 2020). Cosine Distance (CD) is used to measure differences between word vectors. 2

Alignment
Vector initialization (VI). In VI we first train SGNS on one corpus and then use the learned word and context vectors to initialize the model for training on the second corpus (Kim et al., 2014;Kaiser et al., 2020a). The motivation is that the vector of a word with similar contexts across both corpora will not deviate much from its initialized value. On the other hand, vectors of words with different contexts across both corpora, will be updated to accommodate the new semantic properties. Words which only appear in the second corpus are initialized on random vectors.
Orthogonal Procrustes (OP). SGNS is trained on each corpus separately, resulting in word matrices A and B. To align them, we follow Hamilton et al. (2016) and calculate an orthogonallyconstrained matrix W * : 1 Find a comprehensive overview of type-based LSCD models including semantic representations, alignments and measures in Schlechtweg et al. (2019a). 2 We provide our code at: https://github.com/ Garrafao/LSCDetection. Prior to this alignment step both matrices are length-normalized and mean-centered (Artetxe et al., 2017;Schlechtweg et al., 2019a).
Word Injection (WI). The sentences of both corpora are shuffled into one joint corpus, but all occurrences of target words are substituted by the target word concatenated with a tag indicating the corpus it originated from (Ferrari et al., 2017;Schlechtweg et al., 2019a;Dubossarsky et al., 2019). This leads to the creation of two vectors for each target word in one vector space, while non-target words receive only one vector encoding information from both corpora.
No Alignment (NO). Comparing two vector spaces without aligning them results in poor performance on LSCD (Schlechtweg et al., 2019a). As VI shows, initializing the model with weights from the previous run, results in aligned vector spaces. We expand on this concept by initializing two models on the same pre-trained weights assuming that the resulting vector spaces are aligned to one another. The difference to VI is that instead of initializing model B with the weights from model A, the weights from a third pre-trained model C are used to initialize both models A and B.

Pre-training
The corpora used in the context of LSCD are often small, as they are restricted by the length of time periods or availability of historical data. For example the English corpora of SemEval-2020 Task 1 only have 6.6M tokens each, compared to 1.9G of PUKWAC. This reduced corpus size limits the amount of semantic information encoded into VSMs trained on the corpus. Pre-training addresses this problem by first training SGNS on a large, possibly external corpus, and then using these vectors to initialize the model for training on the smaller diachronic target corpora. The idea is that the model first learns very broad and general semantic properties followed by the training on the target corpora, where corpus and time specific details are picked up, i.e., a form of refinement. This procedure is applicable to all alignment types. We use PUKWAC and SDEWAC for pre-training, later referenced as MODERN. However, pretraining on modern corpora is only advisable if the assumption can be made that the meanings of words in the pre-training corpus roughly correspond to the meanings of words in the target corpora. It is unclear to which extent this assumption holds for our data. Hence, we also combine the two target corpora into a bigger corpus, referenced as DIACHRON, which is then used for pre-training.

Post-processing (PP)
Similarity Order Transformation (SOT). In 2nd order similarity, the similarity of two words is assessed in terms of how similar they are to a third word (Schütze and Pedersen, 1993;Artetxe et al., 2018b;Schlechtweg et al., 2019b). This can analogously be done for higher (3rd, 4th, etc.) orders. According to Artetxe et al. (2018b) these orders capture different aspects of language. Artetxe et al. propose a linear transformation deriving higher or lower orders of similarity from a given matrix X. For this, the product with the transpose matrix is split into its eigendecomposition X T X = QλQ T , so that λ is a positive diagonal matrix whose entries are the eigenvalues of X T X and Q is an orthog-onal matrix with their respective eigenvectors as columns. The linear transformation matrix is then defined as W α = Qλ α , where α is the parameter that adjusts the desired similarity order. Applying this to the original embeddings X yields the transformed embeddings X = XW α .
Mean Centering (MC). The centroid of a matrix is the average vector over all vectors in a matrix: c = 1 |V | V i w i . MC refers to subtracting c from each w i in the matrix. MC alters all dimensions so that the mean of all columns is zero. Artetxe et al. provide the intuitive motivation for MC that it moves randomly similar vectors further apart and Mu and Viswanath (2018) consider mean centering as an operation making vectors "more isotropic", i.e., more uniformly distributed across the vector space. Mu and Viswanath indicate that isotropy of word vectors is positively correlated to performance.
Principal Component Removal (PCR). Given a n-dimensional matrix X, Principal Component Analysis (PCA, Pearson, 1901) returns n vectors where each vector describes a best fitting line for the data while being orthogonal to the first n − 1 vectors. Thus, the first PC describes the greatest variance in the first direction, the second PC describes the second greatest variance in the second direction, and the nth PC describes the nth greatest variance in the nth direction. Mu and Viswanath (2018) use PCA to compute the top m PCs from a mean centered word embeddingM : p 1 , ..., p m = PCA(M ). Subsequently these PCs are used to project each vector v ∈ M onto the subspace spanned by the PCs. This projection is then subtracted from the original mean centered word vectorṽ by v =ṽ − m i=1 (p i v)p i , which results in nullifying the top m PCs in M . This is similar to the approach of Bullinaria and Levy (2012). Mu and Viswanath combine both MC and PCR into one PP transformation (MC+PCR).
As for MC Mu and Viswanath's main motivation for PCR is to make vectors more isotropic. They also demonstrate empirically that the top PCs encode word frequency and offer the removal of this noise from the matrix as an alternative explanation for observed performance improvements.
Stacking. VI and OP alignment result in two matrices, and hence, a proper way for applying PP to both of them is needed. The naïve way of simply post-processing both matrices separately (SEP) may violate the assumption that they are represented in the same space. Therefore, in a second approach, we apply PP to both matrices simultaneously by stacking them vertically beforehand (STA). Preliminary experiments showed that following the naïve way of PP (SEP) led to severe decrease in performance for SOT (but not for MC+PCR). Hence, applying SOT on two matrices separately is followed by an orthogonal postalignment (SEP+PA).

Experiments
For the most part, we chose common model hyperparameter settings in order to keep our results comparable to previous research (Hamilton et al., 2016;Schlechtweg et al., 2019a;Kaiser et al., 2020a). We fine-tune for different alignment methods and datasets by varying dimensionality d, window size w and number of training epochs e. 3  Figure 1a). For MC+PCR we observe the greatest performance improvement when the number of removed PCs is around m = d 100 (see Figure  1b). This fits the rule of thumb as stated by Mu and Viswanath.

Pre-training
We tune SGNS models for each alignment method with and without pre-training (baseline), see Table  2. Recall from Section 4.2 that we use the corpora MODERN and DIACHRON for pre-training. Table 2 lists the maximum and mean performances of the baseline and pre-training with different alignment methods, as well as the standard deviation (for a visual representation of the max values see Figure  2). The mean is calculated across different d, e and w, giving the expected performance in a realistic scenario where fine-tuning hyper-parameters is not possible Basile et al., 2020). For German, the baseline max and mean scores could not be significantly improved by pre-training across alignments. For English, pretraining on DIACHRON results in better max and mean scores for OP and WI, with max improvements up to .10. Also, the overall best result is achieved with OP and pre-training on DIACHRON. The usage of MODERN does not improve on the maximum, while reducing the mean. The overall lower performance as well as the observed performance improvements compared to German, may be attributed to the roughly 10 times smaller target corpora. That is, pre-training is helpful on the smaller target corpora.

Post-processing
For every combination of alignment and pretraining method, the matrix with the highest performance across parameters is chosen as the baseline. SOT and MC+PCR are applied individually to these matrices within a wide parameter range (see Appendix B) for both stacking methods (STA and SEP/SEP+PA). Table 3 presents the mean optimal performance gains after PP, which is calculated by extracting the best performance after PP for every matrix, subtracting the baseline values and averaging the values per language. Averaging the respective parameter values yields the mean argmax. Figure 3a and 3d show the highest performances for every baseline matrix after SOT and MC+PCR respectively.
SOT. As we see in Figure 3a, SEP+PA and STA perform similarly. We find small mean performance gains across the board (.013 for GER+STA,   .008 for GER+SEP+PA, .013 for ENG+STA), except for ENG+SEP+PA where a minuscule decrease (-.005) can be seen. Overall, STA outperforms SEP+PA slightly. We now further examine the effect of SOT+STA on individual matrices. In general, the data can approximately be described as a downward opening parabola (see Figure 3b), with different peaks for both languages and slight differences between alignment methods. Averaging the argmax for α shows us where these peaks are. The calculations yield a mean optimal α of 0 for GER+STA, and -0.  et al. (2018b). In order to predict a high-performing parameter, independent from the underlying matrix, we calculate mean performance gains for fixed parameter values. The values are chosen according to the the above-described peak intervals for the respective languages. However, on average, using a fixed parameter results in slight performance losses, notwithstanding the α-value, and hence, finding a high-performing fixed parameter value was not possible. We observe similar findings for individual alignment methods and varying dimensionality. However, GER+VI alignment represents an interesting exception: With high dimensionality (> 300) base performance drops heavily (Kaiser et al., 2020a), and is then "repaired" by the PP, bringing it close to the baseline of the best performing dimension (see Figure 3c).

MC+PCR.
As we see in Figure 3d, MC+PCR yields small improvements over the baselines for German. This is also reflected in the mean gain in Table 3. We find that no single value for m yields consistent improvements. However, we find that for m=0 (only MC) MC+PCR consistently improves the baseline slightly (see   ure 3f, 3d and mean gain in Table 3. A range of parameters shows improvements with m=3 yielding the highest (.0175). This can also be seen in Figure 3f where several parameters yield improvements. We conclude that predicting a parameter for likely performance improvement is possible for English, but not for German. However, if this PP should be used, we recommend using a parameter space of m ∈ [0, 5], as this parameter space is most likely to produce improvements on English, while not harming performance too much on German. This also roughly corresponds to the recommenda-tion of Mu and Viswanath (2018), as they predict that the parameter should be chosen around d 100 . Furthermore, we suggest using STA, as this does on average show better performance over SEP for the aforementioned parameter space. We see that the effects of SOT as well as MC+PCR are highly dependent on the underlying matrix.
6 Analysis Test Statistics. The effects of pre-training and PP methods on word embeddings are not limited to performance differences in word similarity or LSCD tasks. We use two test statistics to further analyse vector spaces: (i) isotropy (Mu and Viswanath, 2018), i.e., uniformity of vector distribution and (ii) frequency bias (Dubossarsky et al., 2017;Kaiser et al., 2020a), i.e., correlation between cosine distance and frequency. 4

Pre-training
On the German dataset it is noticeable that pretraining on DIACHRON often results in slight drop in performance at higher d. This behaviour is more pronounced, consistent and even visible on the En- glish dataset when pre-training on MODERN, see Figure 2. 5 Such a drop in performance after initializing on pre-trained vectors has already been observed by Kaiser et al. (2020a). The authors relate the drop to an increased frequency bias and reduce it by increasing e/w. It is noteworthy that the drop is much more pronounced for pre-training on MODERN compared to DIACHRON. This can be attributed to a difference in word vector lengths of the SGNS model used for initialization. We make the following observation: average word vector length increases with the amount of training word pairs. The difference more training data makes is amplified at higher d, see Figure 4c. By lengthnormalizing the word vectors between the initialization and training step, the drop in performance can be completely circumvented. Additionally, the frequency bias is reduced to 0, see Figure 4b. For English, we expected a higher performance gain from pre-training when using MODERN because of the small data size. However, we observe no improvements over the baseline. Using lengthnormalized word vectors for initialization does result in slightly improved max and mean values for MODERN but these are still lower than max and mean values of DIACHRON.

SOT
SOT has a clear effect on isotropy, which has not been described in previous research. Isotropy shows the same behaviour across both languages and all models, and is best described as a vertically mirrored S-curve (see Figure 5a). Decreasing α increases isotropy close to 1, while increasing α decreases isotropy close to 0. The average correlation (Pearson) between α and isotropy over all matrices is -.89 for both languages. However, the performance correlates only slightly with isotropy (-.25, .35). Moreover, α correlates only weakly with frequency bias (.19, -.12, however with high variance). In order to explain the above-described "repair" effect we take a closer look at the three GER+VI models. Applying SOT brings large performance increases, as stated in Section 5.2.2. For all three models a considerably higher baseline frequency bias for d=500 is visible. SOT strongly reduces this bias for MODERN, and results in a huge performance gain (see Figure 5b).

MC+PCR
As Mu and Viswanath (2018)'s main motivation behind MC+PCR is to increase isotropy of a vector space as well as removal of word frequency noise through PCR, we examine how isotropy and frequency bias develop with m. While PCR has the predicted effect on frequency bias (GER: -.94, ENG: -0.6), PCR does in fact not increase isotropy, contrary to Mu and Viswanath's motivation of "rounding towards isotropy", but has a consistent reducing effect (GER -.75, ENG: -.7). Thus, we believe that rounding towards isotropy is not suitable for explaining performance. Furthermore, we observe that MC not only exhibits effects on isotropy, but also acts on frequency bias, thus Mu and Viswanath's PCR motivation can be extended to MC.

Conclusion
We tested the effects of pre-training and postprocessing on a variety of LSCD models. We performed extensive experiments on a German and an (a) (b) Figure 5: Representative plot for the isotropy after SOT+STA (5a). Performance and frequency bias after SOT+STA for GER+VI+BIG (5b).
English LSCD dataset. According to our findings, pre-training is advisable when the target corpora are small and should be done using diachronic data. The size of the pre-training corpus is crucial, as a large number of training pairs leads to performance drops, which are probably caused by their effect on vector length. Length-normalization may be used on pre-trained vectors to counteract this effect. Further performance improvements may be reached by post-processing. While SOT+STA yielded moderate improvements for both languages, MC+PCR showed larger improvements, but only on English. However, for neither we were able to find a reliable parameter that performed well across the board. Instead, we found that a well-performing parameter value is highly dependent on the underlying matrix. Both post-processing methods affect isotropy and frequency bias.
The methods we tested are particularly helpful when tuning data is available, as performance can be optimized and becomes more predictable. Hence, we recommend to obtain a small annotated sample of target words for the target corpora and to tune pre-training, model and post-processing parameters on the sample before performing predictions for semantic changes on unseen data. With the recent upsurge of digitized historical corpora and diachronic semantic annotation efforts (Tahmasebi and Risse, 2017;Schlechtweg et al., 2018Basile et al., 2020; this may often be a likely and feasible scenario.

A Corpus details
The corpora are lemmatized and contain no punctuation, further pre-processing on the corpora by us is limited to removing low-frequency words. All words with a frequency below the value listed in row min word freq. in Table 1 are removed from the corpora. This is done to reduce noise and unwanted artifacts.
B Parameter settings SGNS. We use common hyper-parameter settings: initial learning rate of 0.025, number of negative samples k=5 and no sub-sampling. Vector dimensionality d, window size w and number of training epochs e are varied in order to fine-tune model and methods. This is important as alignment methods like VI are highly dependent on the choice of e and d (Kaiser et al., 2020a). The following values are used: w ∈ {5, 10}, e ∈ {5, 10, 20, 30}, d ∈ {25, 50, 100, 200, 300, 500}. Due to the immense amount of possible parameter combinations we only ran each setting once. PP was performed on the high-scores of each language, where we differentiate between different combinations of alignment, pre-training as well as if the matrices were STA or SEP post-processed.
SOT. As stated in Section 4.3, SEP is used in combination with post-alignment. We apply SOT with α values ranging from -1 to 1 in 0.1 increments on every baseline matrix with d ∈ {25, 50, 100, 200, 300, 500}.
MC+PCR. MC+PCR is performed using a parameter space of [0,25] in order to examine the performance development over a growing number of PCs removed. It is important to note that using the parameter 0 results in only applying MC.