Data-driven Identification of Idioms in Song Lyrics

The automatic recognition of idioms poses a challenging problem for NLP applications. Whereas native speakers can intuitively handle multiword expressions whose compositional meanings are hard to trace back to individual word semantics, there is still ample scope for improvement regarding computational approaches. We assume that idiomatic constructions can be characterized by gradual intensities of semantic non-compositionality, formal fixedness, and unusual usage context, and introduce a number of measures for these characteristics, comprising count-based and predictive collocation measures together with measures of context (un)similarity. We evaluate our approach on a manually labelled gold standard, derived from a corpus of German pop lyrics. To this end, we apply a Random Forest classifier to analyze the individual contribution of features for automatically detecting idioms, and study the trade-off between recall and precision. Finally, we evaluate the classifier on an independent dataset of idioms extracted from a list of Wikipedia idioms, achieving state-of-the art accuracy.


Introduction
Traditional accounts of idiomaticity distinguish idiomatic use of language from literal use, claiming that idioms are multiword expressions (MWEs) which do not conform to Frege's principle, i.e. whose meaning as a whole cannot fully be derived from the aggregated meaning of their components (Gibbon, 1982). In other words, the definition refers to non-compositionality and non-transparencyidiomatic MWEs seem semantically opaque; Baldwin and Kim (2010) consider this "lexical idiomaticity" to be one of five sub-types of idiomacity. Classifying idioms is not trivial: With reference to recent findings in discourse analysis and psycholinguistics, Wulff (2008) describes idiomati-city as a non-binary, multifactorial concept for a "continuum ranging from clearly non-idiomatic patterns to core idioms"; Pradhan et al. (2018) support this observation experimentally. At least core idioms are considered to be (mentally) lexicalized: Schneider et al. (2014) describe them as "lexicalized combinations of two or more words" which, though often syntactically diverse, "are exceptional enough to be considered as individual units in the lexicon". This corresponds to Sinclair's idiom principle (Sinclair, 1991), postulating that text is often constructed from ready made phrases. Due to morphological and syntactic variation, the degree of formal fixedness ranges from semi-to fully-fixed. However, idiomaticity should be corpus-based verifiable, as e.g. Gries (2008, p. 22) states that "researchers interested in phraseologisms use frequencies and other more elaborated statistics" to identify "symbolic units and constructions". Some of these statistics may relate to local contexts, because one can reasonably argue that words that are not used literally will probably be somehow surprising in their context.
Against this background, we regard idioms as a subcategory of MWEs that are conspicuous in function, form and distribution -and with fuzzy boundaries to other multiword units like metaphors (Stefanowitsch and Gries, 2007) or proverbs. Our objective is to cover idiom characteristics with an innovative set of quantitative features, taking up some ideas described in the subsequent section, and to apply and evaluate machine-learning classifiers for a presumable idiomatically rich specialized corpus.

Related work
Idioms are a key concern and pose challenging problems for NLP applications such as information extraction, retrieval, summarization and translation, as well as for lexicographical studies or lan-guage learning; see Constant et al. (2017). Sag et al. (2002) refer to them as "a pain in the neck for NLP"; consequently their machine-supported recognition constitutes an ideal testbed for a variety of methodical approaches and is the subject of shared tasks; see, e.g., Markantonatou et al. (2020). Fazly and Stevenson (2006) propose measures that quantify the degree of lexical and syntactic fixedness. Verma and Vuppuluri (2015) rely on lexical features in order to identify MWEs whose meanings differ from their components' meanings. Sporleder and Li (2009) include the collocational contexts of idiomatic MWEs into their computation; they model semantic relatedness with the help of lexical chains and cohesion graphs, and, based on this, compare supervised with unsupervised approaches for token-based idiom classification. Katz and Giesbrecht (2006) use latent semantic analysis in order to verify whether context word vectorsimilarity between idiomatic MWEs and its constituents helps with the calculation. Muzny and Zettlemoyer (2013) achieve a precision level of 65% for the distinction between idiomatic and literal wiktionary phrases, using lexical and graph-based features in order to quantify the assumption that literal phrases are more likely to have closely related words in their definition clause than idiomatic phrases. Salton et al. (2016) investigate whether Sentential Distributed Semantics of idiomatic verbnoun (VN) combinations show significant differences from non-idiomatic usage, and therefore train Sent2Vec models for sentence-level contexts. Using the same dataset,  compute local context differences between word vector matrices on the basis of Frobenius norm. Senaldi et al. (2019) train vector-based models on a gold standard of VN constructions that has been annotated regarding idiomaticity on a 1-7 Likert scale. Hashempour and Villavicencio (2020) use contextualized word embeddings in order to distinguish between literal and idiomatic senses of MWEs that are treated as individual tokens in training and testing, producing average F1-scores of more than 70%.
We take up the idea of evaluating different context representations, expand corresponding measures with syntagmatic and other statistical features, and analyze how they complement each other to characterize idioms. Furthermore, we broaden the scope by extending the dataset beyond VN combinations, including all kinds of MWEs without morphosyntactic restrictions.

Dataset and features
The aim of this study is to evaluate quantitative features of MWEs with regard to their suitability of detecting idiomatic MWEs in a given text corpus. Contemporary pop song lyrics -a yet sparsely examined register -seem intrinsically promising for two reasons: Firstly, lyrics combine qualities of spoken and written language (Werner, 2012) with wordplay creativity (Kreyer, 2012) and can thus be expected to constitute a valuable source of both well-known and innovative idiomatic constructions. Secondly, on account of their formal structure, catchy and often idiomatic phrases tend to be repeated in choruses, so that there should be good prospects for empirical evidence. We use the freely available Corpus of German Song Lyrics (Schneider, 2020), covering a period of five decades and a broad range of artists, in order to ensure that our findings can be reproduced and compared by future studies. The general approach should also be applicable to languages other than German.
Although the corpus comes with XML-coded multi-layer annotations, we mainly work on the raw data and do not rely on linguistic preprocessing like parsing or lemmatization. To avoid reference to lexica or pre-defined syntactic template lists (like V-NP constructions), we include any ngram, spanning a minimum of two word tokens and a maximum of six word tokens within sentence boundaries. This yields a dataset of more than six million ngrams. From these we randomly select a sample of 10,000 ngrams.
This dataset is manually annotated by a native speaker in order to serve as a gold standard. To cope with the abovementioned fact that idiomatic status cannot always be described as either clearly idiomatic or clearly literal, we allow for three categories and mark idiom candidates as either literal, idiomatic, or partly idiomatic, where the latter comprises ngrams with both idiomatic and nonidiomatic content, which are excluded for our analysis, see Table 4 in Section 4, for exact numbers.
As a starting point for our evaluation, each dataset entry is automatically annotated with a number of features. We distinguish between three main groups of features to characterize idioms, for a detailed break down see Table 5.
Syntagmatic features (SY) measure collocation strength between all word pairs within an idiom candidate. Context features (CO) measure semantic similarity between the words within an idiom can-didate and the words in its left/right context. Finally, other features (O) represent a variety of counts to assess the amount of evidence available, such as number of words in an idiom candidate.
SY_C1 and SY_C2 comprise a number of countbased collocation measures between a word and its neighbours within a window of +/-5 1 (Evert, 2008). SY_C1 are based on the counts in DeReKo (Kupietz et al., 2010), whereas SY_C2 are based on the counts in the pop lyrics corpus. These count-based measures all aim at identifying MWEs that occur more often than randomly expected. We expect that idioms, like other MWEs, are characterized by high SY_C.
SY_W comprises a number of predictive collocation measures. These are all calculated by aggregating the output activations in a three layer neural network using the structured skipgram variant (Ling et al., 2015) of word2vec (Mikolov et al., 2013), again with a window size of +/-5 2 . As shown by Levy and Goldberg (2014), these output activations approximate the shifted pointwise mutual information 3 . These predictive measures generalize from actually used collocations by means of dimensionality reduction in the hidden layer and thus can also predict unseen but meaningful collocations. However, due to generalization they are typically biased towards the dominant, usually literal usage. Thus, we expect that idioms, unlike other MWEs, are characterized by low SY_W.
Tables 1 and 2 exemplify the interplay between count-based and predictive collocations. Among the top 10 count-based collocates of 'Kuh' (cow), there are 6 collocates (in bold) stemming from idiomatic use, for example, 'die Kuh vom Eis kriegen' literally for 'getting the cow from the ice' meaning 'working out a situation'. In contrast, the predictive collocates all pertain to the literal meaning of cow as a domestic animal; e.g., 'Eis' does not occur among the top 400 predictive collocates.
The count-based and predictive collocates of 'Versuch' ('attempt'), on the other hand, show no such difference. Both refer to the literal meaning http://corpora.ids-mannheim.de/ openlab/derekovecs,   of 'Versuch'. However, also here we can observe a bias of the predictive collocates towards 'failed attempts'.
SY_R comprises non-parametric variants for some collocation measures by means of their ranks to account for the different scales of SY_C1 and SY_W. This includes SY_C1_R, SY_W_R1, SY_W_R2, and the rank difference SY_R_D.
As depicted in Equation 1, for all syntagmatic collocation measures , we take the average over all pairs of words , in an idiom candidate of size | |. Null-values, occurring when there exists no pair with measures from DeReKo, are transformed The context features CO_VEC and CO_VEC_LEX aim at identifying idioms based on the heuristics that they occur within unusual thematic contexts. Idiomatic ngrams such as 'Perlen vor die Säue werfen' ('cast pearls before swine') are often found in local contexts that are thematically rather untypical for non-idiomatic uses of the individual ngram words. The expression can be expected in a theatre review or a political speech, but rather not in texts explicitly dealing with jewellery or livestock. To this end, CO_VEC uses cosine similarity between word vectors, which identifies paradigmatically related words occurring in similar usage contexts, comprising (near) synonyms, but also hyponyms, meronyms, etc.
More specifically, CO_VEC is calculated as the mean cosine similarity between all pairs of words in the idiom candidate of size | | and words in the left/right context of size | | (in the present case we include five context words to the left and right 4 ; see Figure 1 and Equation 2). CO_VEC_LEX is calculated like CO_VEC, but only takes lexical words into account, i.e. nouns, verbs, adverbs and adjectives. If the idiom candidate appears at several places within the corpus, an average is calculated.
The last group O comprises O_GRAM, the number of words in an idiom candidate, O_NSTOPW 5 , the number of non stopwords, and O_DEREKO, the number of words for which a word embedding is available.
In summary, the syntagmatic features (SY) analyze idiom candidates for frequent (SY_C), but unusual (SY_W) collocations along the syntagmatic axis to assess their phraseness and non transparency. The context features (CO) analyze their surrounding context for unsimilar words along the paradigmatic axis as a complementary measure of non transparency. Both feature sets utilize the observation that word embeddings are typically biased towards the dominant/transparent meaning.

Methods and results
To evaluate our feature set we have trained a Random Forest classifier 6 . Unless stated explicitly otherwise, all results have been obtained using 5-fold cross validation. To avoid overlap between training and test sets, we have removed all duplicates after lower-casing and stopword removal, leaving a dataset with 542 idioms and 8697 non-idioms.
Because this dataset is highly unbalanced, we have systematically varied the Random Forest's cutoff hyperparameter (default 0.5). As shown in Figure 2, a cutoff of 0.3 achieves the best F1-Score of 61.9%, balancing recall and precision around 62%. The best balanced accuracy of 83% is achieved at a much smaller cutoff of about 0.05. This may be a more appropriate cutoff for explorative idiom detection, where sensitivity (recall) is more important than precision.
To assess the contribution of the individual feature sets, we compare classification performance between using all features, each feature set individually, and subsets of features obtained by excluding individual feature sets. Table 3 summarizes the results 7 : All individual feature sets except O contribute to classification performance. The biggest contribution comes from the collocation features based on DeReKo counts (SY_C1), followed by the collocation features based on the (much smaller) pop lyrics corpus (SY_C2) and the predictive collocation features SY_W. 5 SY_C1 and S_W features are calculated on the idiom candidate after stopword removal. 6 Support Vector Machines yield similar accuracies and scores. 7 Standard deviation of Balanced Accuracy, measured over 10 5 x cross validations with different seeds is around 0.5 for all feature combinations.  The bottom half of the table analyzes how much performance is lost when excluding a feature set. The relative order is largely consistent with the upper half. In particular, also from this perspective, count-based collocations SY_C1 (including their rank variants) turn out to be most important, i.e., they lead to the largest loss in performance.
Interestingly, omitting the other features (O) also decreases performance, even though they do not contribute individually. This may be due to the fact that they do not model intrinsic characteristics of idioms, but just the number of word pairs available for estimating SY and CO feature sets, i.e., essentially the amount of evidence available. Thus they are only useful in combination with other feature sets.
For SY_R the effect is the other way around. SY_R has a remarkable F1-Score of 29.5% when taken alone, but the overall performance increases, when the classifier is trained on all feature sets but SY_R. The lack of loss in performance may be due to the fact SY_R is highly correlated with SY_C1 and SY_W by construction, and thus does not add information. The slight increase seems to be a random effect. Table 4 details the classification performance for the best feature set (w/o SY_R). Interestingly enough, when inspecting the false positives, we find that our approach identifies full idioms overlooked by the manual dataset annotation, such as 'in meine Fußstapfen treten' ('follow in my footsteps') or 'hinter Gitterstäben' (lit. 'behind thick bars', meaning: 'in prison'). We also see partly idiomatic MWEs like 'süßes Gift' ('sweet poison'), as well as supposedly incomplete idioms like 'nur ein leeres [Versprechen?]' ('only an empty [promise?]'). The automatic classification even detects previously hidden teenage slang idioms such as 'Optik schieben' (lit'ṫo push optics', approximately: 'to be under the influence of hallucinogenic drugs'). Besides, related phenomena like metaphors ('fahren in Richtung Gold', literal: 'drive towards gold') and allegories ('das ganze Leben ist ein Quiz', literal: 'all of life is a quiz') are labelled. Indeed, approximately 8% of the false positives show idiomatic or figurative use.
In order to better understand the interplay between features, Table 5   the information gain (*1000), TTest the degree of significance by a Welch two sample t-test for confidence levels 0.95 (*), 0.99 (**), and 0.999 (***), and Δ the sign of the difference between the mean of a feature for idioms vs. non-idioms. The context features CO_VEC and CO_VEC_LEX have the highest MDA followed by the other features O and the count-based collocation features estimated from the pop lyrics corpus SY_C2. All collocation (and rank) features estimated from DeReKo are in a similar range. Note however, that MDA tends to be shared among correlated features.
IGain assesses the individual (univariate) contribution of the features for classification. The two estimates of the overall frequency of an idiom candidate O_C2_N and O_C2_SGT have the highest IGain, closely followed by the count-based collocation features SY_C2 and SY_C1. The predictive collocation features SY_W and context features CO have slightly smaller IGain. This largely corroborates the results of the analysis of feature sets above.
With the exception of CO_VEC and two of the predictive collocation features, the difference between the means of all features in idioms vs. nonidioms is highly significant.
To better understand the contribution of the individual features, it is helpful to look at the difference Δ between their means: Compared to all non-idioms, words within idioms have a lower cosine similarity CO_VEC (but still higher CO_VEC_LEX) to their left and right neighbours, i.e., indeed they occur in unusual contexts. On the other hand, they have a higher count-based and predictive collocation strength among each other (SY_C1, SY_C2, SY_W) with some exceptions (SY_C1_LL,SY_W_CON,SY_W_NSUMAF). Consequently, they also have a smaller rank for these measures (SY_C1_R, SY_W_R1, SY_W_R2), although we would expect larger ranks.
However, non-idioms comprise random ngrams that do not occur more often than expected as well as frequent MWEs with high collocation strength. Thus it is instructive to constrain the comparsion as follows: Δ ′ gives the sign of the difference between the mean for idioms and all those non-idioms with SY_C1_LD larger than the mean of SY_C1_LD of all non-idioms, i.e., only the non-idiomatic but still frequent MWEs. Incidentally, all these differences are highly significant (at least 0.99), with the exception of CO_VEC. In this comparison, the context features CO and both, the count-based and predictive collocation features estimated from DeReKo (SY_C1 and SY_W, except SY_C1_MI,) are smaller, and accordingly the corresponding rank features are larger for idioms. In particular, the rank difference SY_R_D between count-based and predictive collocation is larger, i.e., co-occuring words in an idiom tend to be less represented by the predictive collocations which are biased towards the dominant meaning.
In summary, idioms, like non-idiomatic MWEs, are characterized by high collocation strength in comparison to randomly selected ngrams. However, in comparison with non-idiomatic but frequent MWEs, they are characterized by occurring in unusual contexts (low CO_VEC), and by low predictive collocation strength SY_W; or, put more bluntly, idiomatic MWEs occur frequently but are unusual.
To demonstrate the transferability of our approach, we have applied it to a dataset of German idioms extracted from German Wikipedia 8 . After removing duplicates (72) with our gold standard 9 , and all idioms that consist of less than 2 words after stopword removal, this set comprises 760 idioms.
As training set for this out-of-domain scenario, we use a sample of 80% of non-idioms and all idioms of our base data set. The test set consists of the remaining 20% of the non-idioms and the Wikipedia idioms. We train the classifier on the feature ensemble SY_C1 + SY_W + SY_R + O (without the feature O_DEREKO). This is because the feature sets SY_C2 and CO are calculated based on    the ngram context within the pop lyrics corpus and are consequently not available for out-of-domain data. Figure 3 shows the trade-off curves of the predictions on the Wikipedia dataset for a range of cut-off thresholds. The obtained results are rather convincing. With a cutoff threshold of 0.05, the classifier achieves an F1-Score of 71.0% and a recall of 80.3%, which means that the classifier is able to detect the majority of the unknown Wikipedia idioms. While not directly comparable due to different datasets and classification tasks, these results are in the same ballpark as e.g. Hashempour and Villavicencio (2020) who report F1-Scores of 70%. Table 6 gives the confusion matrix of the prediction on the unknown idioms.

Conclusions
The aim of this study was to model well-studied idiom characteristics with quantitative features and to evaluate them on suitable datasets. Our evaluations show that count-based collocation measures indeed characterize idioms' frequent usage and stable occurrence, i.e. phraseness. The predictive collocation measures and the context features on the other hand are able to model uncommon usage, that is, non transparency. By applying our model, trained on an annotated dataset that was sampled from a pop lyrics corpus, to an out-of-domain dataset of idioms crawled from Wikipedia, we demonstrated the generalizability of our approach.
The introduced features do not require sophisticated or knowledge intensive preprocessing, and need only minimal context. Even, when no context is available, as for the out-of-domain dataset, we achieve state-of-the art classification performance.
However, the feature set also has limitations. For idioms that consist of only one content word, possibly with some stopwords, the collocation measures do not produce very meaningful results. In this case we need to entirely rely on the context features. In a similar vein, count based collocation strength obviously does not apply to novel idioms. Moreover, when idiomatic use constitutes the overwhelmingly dominant use, such as 'kenne meine Pappenheimer' (literal: 'know my Pappenheimers', roughly: 'know the weak people (in my team)'), neither CO nor SY_W features can contribute.
But in sum, all evaluation results -and the detailed analysis of how the count-based and predictive features complement each other for discriminating between idioms and non idioms -shed an additional empirical light on the linguistically intricate and multifaceted phenomenon of idiomaticity. Waiving limitations on morphosyntactic templates (like, e.g., VN constructions), our approach should work well for any potentially idiomatic MWEs.
For future work, we intend to apply the approach to bigger datasets; attractive candidates might be the corpora of the PARSEME (PARsing and Multiword Expressions) network Savary et al. (2018) or the COLF-VID dataset of verbal idioms Ehren et al. (2020). We will also experiment with additional features, in particular to better capture fixedness of idiomaticity and cope with non transparent compound idiomatic words.
All data and source code is publicly available under a Creative Commons license at http:// songkorpus.de/data/.