JCT at SemEval-2020 Task 1: Combined Semantic Vector Spaces Models for Unsupervised Lexical Semantic Change Detection

In this paper, we present our contribution in SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection, where we systematically combine existing models for unsupervised capturing of lexical semantic change across time in text corpora of German, English, Latin and Swedish. In particular, we analyze the score distribution of existing models. Then we define a general threshold, adjust it independently to each of the models and measure the models’ score reliability. Finally, using both the threshold and score reliability, we aggregate the models for the two sub- tasks: binary classification and ranking.


Introduction
Over the last decade, research on detection of lexical semantic change has increased. Many studies were performed on various languages, corpora and periods. Two years ago, two literature surveys on computational approaches to Lexical Semantic Change (LSC) (Kutuzov et al., 2018;Tahmasebi et al., 2018) were published. Last year  first systematically compared a broad variety of LSC detection models on two data sets of different periods and domains and Shoemark et al. (2019) proposed a new evaluation framework for semantic change detection using word embeddings.
To facilitate the comparison of different systems, SemEval-2020 Task 1 (Schlechtweg et al., 2020) introduced a simple evaluation framework for unsupervised lexical semantic change detection in text corpora of German, English, Latin and Swedish. The task relies on the comparison of two time periods for each language. We participated in two sub-tasks: a classification task, where we decide which words lost or gained senses between the periods, and a ranking task, where we rank a set of target words according to their degree of lexical semantic change between the periods.
Given the large number of models that have already been explored, we built a system which systematically combines existing models in the unsupervised setting of the LSC detection task. Since no tuning data is available, we minimized the parameters number of our system. This paper is organized as follows: First, in Section 2, we describe the existing LSC detection models that we have combined. Then, in Section 3, we analyze the score distribution of the models in order to learn the general behaviour of words in our corpora. We aim to estimate the amount of words that changed their meaning between the periods. Next, we define a general classification threshold percentile (CT) parameter and adapt it to each model separately. We use the CT parameter to measure the models' score certainty too, and filtered models with certainty low than a minimal required decision certainty. The minimal required decision certainty (MCR) is also a parameter of our system. Finally, we present our aggregation methods for the two sub-tasks: classification and ranking. Our system results are detailed in Section 4, followed by conclusions in Section 5.

Related Work
In this section, we shortly describe the work of  which covers a wide range of LSC detection models. Then, we summarize the models that we have combined in our system. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.  made a comprehensive comparison between the results of the diverse existing LSC Detection models and firstly ran the models under one evaluation task and data.  tested the methods for semantic change detection both across time (Diachronic) and across domains (Synchronic). For the Diachronic task they used the evaluation framework and the DURel German corpora (introduced in (Schlechtweg et al., 2018)). The framework was expanded for the Synchronic task using the SURel German corpora .
Existing LSC Detection models are based on three methods for meaning representations: semantic vector spaces (Hamilton et al., 2016a;Hamilton et al., 2016b;Hellrich and Hahn, 2016;Rosenfeld and Erk, 2018), topic distributions (Cook et al., 2014;Frermann and Lapata, 2016), and sense clusters (Mitra et al., 2015). In semantic vector spaces, each term is represented as two vectors indicating its co-occurrence statistics at different eras. Then, the semantic change is commonly measured by either similarity measures, such as Cosine similarity, or contextual measures. In topic distributions, each term is modeled as a probability distribution over different topics, or term senses. Then, the semantic change is measured based on the senses' frequency of use. Whereas, sense clustering and topic models are similar in their mapping (uses to senses) and semantic change measures, in sense clustering, some contextual property is used to assign all uses of a term into sense clusters.  focused on two meaning representations: semantic vector spaces and topic distributions. They tested a list of known LSC Detection models with different combinations of semantic representations, alignment methods and detection measures. They experimented various parameter settings for comparing the models' predictions with the true results.  concluded that we could use the same modelling methods for both Diachronic and Synchronic LSC Detection.
In addition, all model predictions had a strong positive correlation with the true results. The model with the best performance was Skip-Gram with Orthogonal Procrustes alignment and Cosine Distance (SGNS+OP+CD).
Since  observed that topic distributions models (SCAN) (Frermann and Lapata, 2016) have poor and unstable performance, in our system we integrated only the Semantic Vector Spaces models (reminded at Schlechtweg et al. (2019)).

System Description
Two time-specific corpora for each of the four languages, German, English, Latin and Swedish, were provided by the task organizers. Each line contains one sentence, where the punctuation was eliminated and each token was replaced by its lemma. Within each corpus sentences were shuffled randomly. One-word sentences were removed form the Latin corpus and for the other languages, sentences with less than 10 tokens were removed. Due to the big size of the corpora, we removed low-frequency words for improving the efficiency of the models. The input of the system for each language is a corpus pair and a list of target words. Our system scripts are publicly available on GitHub https: //github.com/efratiamar/CombinedModelsLSC.
RI SRV CD Table 1: Models' combinations integrated by our system.

Analyzing the score distribution of the LSC detection models
Since we were not given any information on the amount of words that changed their meaning, we analyzed the score distribution of the LSC detection models to learn the general behavior of words in our corpora. First, for each language, we randomly selected n = 200 words with more that 30 appearances in each of the two periods. Then, for each of the LSC detection models, we applied the following steps: 1. Calculate the scores for all the 200 words.
2. Draw a histogram of the scores, as illustrated in Figure 1. 3. Calculate the skewness of the score distribution. Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values. Skewness is defined as where x is the sample average and s is the standard derivation.
Our exploration revealed that most of the models have positive skewness. This implies that there are more words that preserved their meaning than words that changed their meaning. The percentage of models with positive skewness for English, German, Latin, and Swedish are 62.96%, 81.48%, 74.07% and 62.96%, respectively. Over 62% of the models in all languages have positive skewness. Therefore, in our system, we combined only models with positive skewness.

Setting thresholds in an unsupervised setting
The setting of the LSC detection task is unsupervised, there is no labeled data nor training set. To eliminate the need for tuning parameters for each LSC model separately, we defined a general classification threshold (CT) parameter in terms of percentile and adjusted it to the various LSC models. For example, if the threshold parameter is set to 90%, for each model, we calculated its numeric value which corresponds to the score that 90% of the scores in our random sample (n = 200) are lower than it. For each model separately, we set a numeric classification threshold (NCT). The NCT is used for making a decision whether a word has changed it meaning or not as well as for measuring the certainty of our decision, as detailed next.

Measuring the score's decision certainty
After setting a numeric classification threshold (NCT) for each model based on our random sample and the CT parameter, we took the input of the LSC detection task, a list of target terms and calculate the models' scores. Then, for each target and model, we compared its score with its numeric classification threshold. If the score was higher than the numeric classification threshold, we assumed the target term has changed its meaning. Next, for each target and model, we measured the decision certainty for each score in the following way: 1. Calculate the percentile of the model score (percentile(score)) based on our random sample.
2. Calculate the distance between the score percentile and the model's classification threshold (CT).
3. Divide the distance by the range size, where the range depends on the placement of the score percentile in relation to the classification threshold (CT): The impact of this division is illustrated in Figure 2. Since we uniformly divided all the values on the same side of the classification threshold, scores above the threshold were weakly affected, while scores below the threshold were strongly affected. In future work, we plan to apply a differential division method.

Binary Classification
As detailed in the previous section, the binary classification for each model was determined by comparison of the model score with the model numeric classification threshold. Additionally, for each classification, we calculated its decision certainty. A minimal required decision certainty (in percentage) is a parameter of our system, termed MRC. First, For each target term, we ran all the models and got a binary classification for each of them. Then, we filtered models with decision certainty below the MRC parameter. Finally, we applied the majority rule, a decision rule that selects alternatives with a majority, i.e. more than half of the votes.

Ranking
For the ranking task, to rank a set of target words according to their degree of lexical semantic change, we applied a similar approach.
First, for each target term, we ran all the models, got a score for each of them and normalized the models' scores to values between 0 and 1. Then, we filtered models with certainty below the MRC parameter. Finally, we calculated a weighted average. Our weighted average takes into account the scores' certainty by multiplying each score with its certainty. Thus, models with higher certainty had more affect on the target term ranking.
Since the scores of each model had a different scale, it was essential to normalize the scores before averaging them.

Results
Given the unsupervised LSC detection task, our concept was to use a minimum number of parameters beyond the models' parameters. Our system used two parameters (see Section 3): general classification threshold (CT) and minimal required decision certainty (MRC). In Table 2 we report our system results in the post-evaluation phase for the two sub-tasks with different configuration settings. For task 1, binary classification, we report the Accuracy (ACC) and for task 2, ranking, we report the Spearman correlation (SPR).
In the evaluation phase our system achieved the highest score with configuration number 4 (CT=90, MRC=0.5). Our system was ranked 8 th in task 1 (ACC=0.636) and 13 th in task 2 (SPR=0.254). As seen in Table 2, the score for this configuration in the post evaluation phase, is higher (ACC=0.647, SPR=0.283). The reason for this gap is that in the post-evaluation phase we improved our system by normalizing the models' scores to values between 0 and 1, as explained in Section 3.5.
In the post-evaluation phase, we realized that low CT and MCR achieved higher accuracy, but lower spearman correlation. As seen in Table 2, the best configuration for task 1 is configuration no. 8 (CT=80, MRC=0.4) with ACC=0.664 and SPR=0.321. Whereas, in task 2, the best configuration is configuration no.  We also looked at what the results could have been if we were trying to fine-tune the system parameters (CT and MCR) and set for each language the configuration with the highest performance. In this case, our system obtains an average accuracy of 0.685 and average Spearman correlation of 0.436. This can be deduced from Table 2 as follows: if for each language we select the bolded configuration with the maximum ACC, we can see that their averaged ACC is 0.685. In the same way an average of 0.436 will be obtained in the SPR column.
To test the performance of our system, we analyzed the results of the best configurations (no.8 and no. 12). We compared the score of each model separately to the score of our system that weighs all the models' scores together. We found that there is no consistent behaviour. In other words, there was not any model that consistently outperformed all the other models or our weighted score. In the case of the German and the Swedish languages, the SGNS-based models produced better results than the other models. In the other languages, other models performed better. For example: In task 1, the PPMI-WI-LND and COUNT-TD models produced the best results in English, while in the Swedish language, the best results were obtained by the PPMI-WI-LND and COUNT-TD models. In task 2, the SGNS-WI-CD model produced the best results in English, while in the Swedish language, it was the SGNS-OP-CD model.
We also noticed that dispersion measures models, which strongly rely on frequency, had low performance. This could be resulted from the fact that the organizers' controlled each test set for frequency (which we could not know before they published the task description paper).

Conclusions and Future Work
We have implemented a system that systematically incorporates existing models to identify LSC over time in text corpora of four languages. We evaluated the score distribution of existing models, suggested a general classification threshold and applied it to each of the models individually. We calculated the models' score certainty and used it to aggregate the models. In the evaluation phase of the SemEval-2020 Task 1, our system was ranked 8 th and 13 th in the classification and ranking sub-tasks, respectively.
We plan to investigate additional aggregation methods and explore the impact of the individual models on the combined system to improve our system results. We also plan to try our system on other languages of different families, such as Semitic languages (Liebeskind and Liebeskind, 2020) and use LSC models to construct diachronic thesaurus, which bridges the lexical gap between modern and ancient language (Zohar et al., 2013;Liebeskind and Dagan, 2015;Liebeskind et al., 2016;Liebeskind et al., 2019).