UoB at SemEval-2020 Task 1: Automatic Identification of Novel Word Senses

Much as the social landscape in which languages are spoken shifts, language too evolves to suit the needs of its users. Lexical semantic change analysis is a burgeoning field of semantic analysis which aims to trace changes in the meanings of words over time. This paper presents an approach to lexical semantic change detection based on Bayesian word sense induction suitable for novel word sense identification. This approach is used for a submission to SemEval-2020 Task 1, which shows the approach to be capable of the SemEval task. The same approach is also applied to a corpus gleaned from 15 years of Twitter data, the results of which are then used to identify words which may be instances of slang.


Introduction
Automatic lexical semantic change detection is a field of semantic analysis which aims to discern how the meanings of words change over time. As interest in the field has increased, a variety of different procedures, languages and corpora have been used, which leads to difficulty when attempting to compare different sets of results. SemEval-2020 Task 1, "Unsupervised Lexical Semantic Change Detection" (Schlechtweg et al., 2020), is a task aimed at providing a single unified framework with which to compare approaches using a standardised dataset, in order to address the difficulty in attempting to compare different sets of results which arises due to the variety in procedures, languages and corpora which have previously been used. The task involves determining whether a set of target words have changed meaning in two corpora, each of which corresponds to a different time period. Corpora are provided for German, English, Latin and Swedish. The task consists of two subtasks, one involving a binary classification of target words into words which have or have not changed meaning, the other involving a ranking of words according to degree of change.
This paper describes our submission to SemEval-2020 Task 1 1 . To produce this submission, we use an approach based on the Hierarchical Dirichlet Process (HDP) (Teh et al., 2004), an extension of Latent Dirichlet Allocation (LDA) which allows the number of senses to be unbounded. The next section presents works related to methods for identifying novel word senses, which is followed by a description of our system in Section 3 and the results of the application of the system in Section 4. Section 5 provides an overview of our conclusions and possible directions of future work. We also use our method on a corpus constructed from data from Twitter in order to explore the possibility of detecting semantic change over a shorter period of time, with a focus on where that detection can be used for identifying slang.
(2011) to identify semantic change over time by using context vectors. Context vectors can also be used as part of a different representation. Yao et al. (2017) use a PPMI matrix for each time period to learn dynamic embeddings, an approach to identifying lexical change which reduces the need to align vectors that arises when using the static embeddings which are traditionally used. Bamler and Mandt (2017) also use dynamic embeddings, presenting a probabilistic version of word2vec Mikolov et al. (2013), an architecture for static embeddings. Embeddings are also used by Asgari et al. (2020), who use domain-specific embeddings to detect domain shifts. A survey of the use of embeddings for semantic shifts detection is given in Kutuzov et al. (2018).
Context vector and embeddings approaches to identifying change of meaning do not have the ability to recover the word senses. Brody and Lapata (2009) give an approach to the problem of word sense induction which is based on the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003) of text generation. The LDA model is extended from parametric to nonparametric space by Zhai and Boyd-Graber (2013) with an infinite vocabulary. Lau et al. (2012) instead use a Hierarchical Dirichlet Process (HDP) model (Teh et al., 2004) for the task of word sense induction. A model based on the HDP is similarly used by Yao and Van Durme (2011), for extending the flexibility of word sense induction models in order to adapt to varying degrees of polysemy. An HDP is also used by Cook et al. (2014), who apply word sense induction to dictionaries and newspapers. A similar approach is taken by Cook et al. (2013), in the context of updating dictionaries. Frermann and Lapata (2016) use a Bayesian model for tracking gradual sense changes over time.
Graph-based models to finding lexical semantic change are proposed by Tahmasebi et al. (2011), tracking language evolution by clustering word senses over various time periods, an approach echoed by Mitra et al. (2014), studying word sense change in digitised books from 1520 to 2008. The most comprehensive survey of the field of lexical semantic change detection is presented by Tahmasebi et al. (2018).

System Overview
The data from SemEval-2020 Task 1 consists of two corpora for each language, each consisting of a number of short contexts of words. Though data are provided in a number of languages, only the data for English was used. Of each of the two corpora, one corpus corresponds to a reference corpus and the other to a focus corpus. The reference corpus is taken to represent standard usage, and the focus corpus contains newer texts. Each corpus is partitioned into short pseudo-documents, which may be treated as documents. 'Pseudo-document' and 'document' will hereafter be used interchangeably. The creation of these pseudo-documents is under the assumption that each corpus takes the form of short contexts of a small number of sentences of text. All stopwords and low frequency terms are removed.

Notation
Description n jt Number of words in document j at table t n −ji   (Teh et al., 2004), is used to model the senses for each word. Notation is given in Table 1. An HDP is selected over a Latent Dirichlet Allocation (LDA) model as used in Brody and Lapata (2009) because the HDP allows the number of latent factors in the HDP to grow with the data.
An initial random set of senses is induced modelled after the generative process of the HDP, which corresponds to a partition of words inside documents, a process based on the Chinese Restaurant Franchise Figure 1: A depiction of a Chinese Restaurant Franchise (Teh et al., 2004). Customers θ i sit at tables ψ t which each order menu items φ k (CRF) representation of a two level HDP (Teh et al., 2004), which partitions customers at the group level and dishes at the top level. Each document j may be considered a group. Its document level Chinese Restaurant Process (CRP) generates a table index t ji for each observation i according to where α is a concentration parameter. This partitions the document j into tables. After all words are assigned to a table, the topic index k jt of all tables are generated by the corpus-level CRP. Each table is assigned a topic index according to where γ is a concentration parameter for the base distribution. This assigns tables to topics, which completes the initial partition. Each observation x ji is associated with a table index t ji , which is in turn associated with a topic index k jt . This topic index links the table to one of the corpus topics φ k . This partition is updated using the sampling scheme outlined in Teh et al. (2004), which we summarise shortly. Each word in each document is assigned to a per-document table t ji , with each of these tables associated with a factor k jt . The conditional distribution of t ji is calculated by combining the conditional prior distribution for t ji with the likelihood of generating x ji . The prior probability of t ji taking a previously used value is proportional to n −ji jt . The probability it takes on a new value is proportional to the parameter α. The likelihood due to x ji of t ji for a previously used t is f dφ is the prior density of x ji . From this, the conditional distribution of t ji can be formed as This provides the distribution according to which the tables for words are sampled. The PPMI values in the co-occurrences matrix are used for estimation. If the sampled value is t new , then a new table is created for the restaurant and the word assigned to it. A dish must then be selected for this table. A sample for k jt new is obtained according to This gives an assignment of each word to a cluster inside a document, and of each cluster to a corpus-level topic. Senses are formed from each topic by considering the distribution of assigned words. These topics represent the senses across both corpora, inferred on pooled instances of each word. Since the topics are jointly modelled, discovered topics are applicable to both corpora, which means that there is no need to reconcile these senses (Cook et al., 2014). Words with multiple senses are represented by their position in multiple topics. Once sampling has concluded, the word distributions over topics are calculated. Word distributions are calculated based on the topics assigned to each observed word x ji in each document to determine a distribution over topics separately for each word for the reference and focus instances. From this, the Jensen-Shannon distance for the target words between the distribution in the reference and focus corpus is calculated. Additionally, a method for the application to the Twitter corpus is used based on the Novelty Diff method in Cook et al. (2014), a description of which follows. The distribution of words over senses is categorised according to which instances originated in the reference corpus and which in the focus corpus. For each sense s, the novelty score is calculated as where p f (s) and p r (s) are the proportion of usages of a given word corresponding to sense s in the focus and reference corpus, respectively. The score for each word is the maximum for any of its induced senses.
Once the novelty scores have been calculated for all words, the words with novel senses can be identified as those with the highest scores of Novelty Diff . This will be used when applying the model to Twitter data rather than the SemEval task in order to identify which words have gained senses, a narrower focus which is useful in the applied context of identifying slang.

Results
The model's performance on SemEval-2020 Task 1 is presented in Table 2, alongside results from the baselines for each subtask. The model was only run against the English data for each subtask, using the sample files for all other languages to form a complete submission. To reflect this, we record both the overall score and the score based on performance on the English component in Table 2. The frequency difference baseline is the absolute difference in normalised target word frequencies in the corpora. The count vector baseline is the cosine difference between vector representations of words in the two corpora. The values for these baselines are provided by the task organisers 2 . The model was run for 1 iteration. The window size was set to 2, the floor for determining low frequency terms was set to 1, and the concentration parameters α and γ were both set to 1. To produce the submission, the Jensen-Shannon score for each target is calculated. This provides the submission for Subtask 2. For Subtask 1, lemmas are determined to have changed meaning if their Jensen-Shannon distance is above a threshold value of 0.6. The results from the SemEval competition (see Table 2 for complete results), for which we only produced a submission for the English component, led to a placement, when considering performance on the English data only, of 7 th of 9 th for the classification subtask and 16 th of 21 st for the ranking subtask, and were more accurate than both of the baselines for the ranking subtask, and for one of the baselines for the classification subtask, indicating the model works as expected to a reasonable degree of accuracy. The lacklustre performance on the first subtask as compared with the second may have been influenced by the choice of threshold parameter, which was never altered and chosen with little empirical evidence, which resulted in a threshold  parameter which was too low and as such was unable to appropriately identify instances of meaning change. Experimenting with this value, as well as those of other hyperparameters, would likely have improved performance.

Word
Novel Definition ill cool, tight, sweet hater a person that simply cannot be happy for another person's success. So rather than be happy they make a point of exposing a flaw in that person. like a meaningless word teenagers insert liberally into both colloquial and formal speech in order to maintain a steady stream of words roll used to describe the effects of Ecstasy safe a cool person;to signify agreement;to signify something is good checked the process by which someone puts another individual in their place verbally or physically either in a joking manner or a serious beatdown. bars (1) sentences in lyrical hiphop songs (2) slang name for a 2mg Xanax tablet In addition to the SemEval competition, the same model was also used to experiment with data from Twitter with the goal of determining whether the same procedure can be used for sense change over a shorter timeframe, with a particular focus on identifying slang through words which had novel senses. The corpus was constructed from a set of 500,000 tweets, spanning a period of 15 years. Using the approach detailed in Section 3 with the Twitter data produced a set of 7 slang words. These results are the words identified as having the most difference in the senses between the two corpora according to the Novelty Diff metric. The model was run for 1 iteration against the Twitter corpora in order to identify the lemmas determined as having most differed between the older and more recent corpus, and therefore most likely to have gained a novel sense. The 25 words with the highest value of Novelty Diff were extracted. The online slang dictionary Urban Dictionary 3 was used to verify whether or not each of these words had a meaning aside from its standard usage or was generated erroneously. Where an identified word was a genuine instance of slang, it is listed alongside this sense as defined by Urban Dictionary in Table 3. Of the identified words, 7 were able to be verified. The full set of words which were identified is available on GitHub.

Conclusions and Future Work
This paper used Bayesian word sense induction methods as the basis for identifying lexical semantic change, and produced a model for determining sense differences which was then evaluated on Task 1 in the SemEval-2020 workshop, which confirmed the method is appropriate to the task and works as expected. The same approach was also used with data from Twitter with the goal of identifying slang, which produced a set of candidate words. Whilst some of these were genuine instances of slang, a significant amount were erroneously identified, or amounted to mere lexical innovation rather than any true semantic change.
The system was only tested against the English data for each subtask. Future work would involve extending the system into all of the competition's languages.