Evaluating language models for the retrieval and categorization of lexical collocations

Lexical collocations are idiosyncratic combinations of two syntactically bound lexical items (e.g., “heavy rain” or “take a step”). Understanding their degree of compositionality and idiosyncrasy, as well their underlying semantics, is crucial for language learners, lexicographers and downstream NLP applications. In this paper, we perform an exhaustive analysis of current language models for collocation understanding. We first construct a dataset of apparitions of lexical collocations in context, categorized into 17 representative semantic categories. Then, we perform two experiments: (1) unsupervised collocate retrieval using BERT, and (2) supervised collocation classification in context. We find that most models perform well in distinguishing light verb constructions, especially if the collocation’s first argument acts as subject, but often fail to distinguish, first, different syntactic structures within the same semantic category, and second, fine-grained semantic categories which restrict the use of small sets of valid collocates for a given base.


Introduction
Language models (LMs) such as BERT (Devlin et al., 2018), and its variants SpanBERT (Joshi et al., 2020), ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), etc. have proven extremely flexible, as they behave as unsupervised multitask learners (Radford et al., 2019), and can be leveraged in a wide array of NLP tasks almost out-of-thebox (see, e.g., the GLUE and SuperGLUE results in Wang et al. (2019b) and Wang et al. (2019a), respectively). They have also been harnessed as supporting resources for knowledge-based NLP (Petroni et al., 2019), as they capture a wealth of linguistic phenomena (Rogers et al., 2020). Recently, a great deal of research analyzed the degree to which they encode, e.g., morphological (Edmiston, 2020), syntactic (Hewitt and Manning, 2019), or lexico-semantic structures (Joshi et al., 2020). However, less work explored so far how LMs interpret phraseological units at various degrees of compositionality. This is crucial for understanding the suitability of different text representations (e.g., static vs contextualized word embeddings) for encoding different types of multiword expressions (Shwartz and Dagan, 2019), which, in turn, can be useful for extracting latent world or commonsense information (Zellers et al., 2018).
One central type of phraselogical units are lexical collocations, defined as restricted cooccurrences of two syntactically bound lexical items (Kilgarriff, 2006), such that one of the items (the base) conditions the selection of the other item (the collocate) to express a specific meaning. For instance, the base lecture conditions the collocates give or deliver to express the meaning 'perform', the base applause conditions the selection of the collocate thunderous to express the meaning 'intense', and so on. Lexical collocations are of high relevance to lexicography, NLP and second language learning alike, and constitute a challenge for computational models because of their heterogeneity in terms of idiosyncrasy and degree of semantic composition (Mel'čuk, 1995).
In this paper, we analyze a suite of LMs in the context of two tasks that involve lexical collocation modeling. First, unsupervised collocate retrieval, where we mask a collocation's collocate (e.g., "heavy" in "heavy rain"), and quantify how well a LM of choice (BERT in particular) predicts, via its masked language modeling (MLM) objective, a valid collocate for that particular base ({"heavy", "torrential", "violent", . . . } for the base "rain" and the meaning intense). Second, su-pervised in-context collocation categorization, where we fine-tune LMs on the task of predicting a semantic category of a collocation in terms of its lexical function (LF), given its sentential context; cf. Section 3.1. Modeling, recognizing, and classifying collocations in corpora has obvious applications for automatically creating and expanding lexicographic resources, as well as for various downstream NLP applications, among them, e.g., machine translation (Seretan, 2014), word sense disambiguation (Maru et al., 2019), or natural language generation (Wanner and Bateman, 1990). The two main contributions of this paper thus are: 1. A "collocations-in-context" dataset, with instances of collocations of 16 different semantic categories (in terms of LFs) in context, and with a fixed and lexical (i.e., no overlapping) train/dev/test split (Section 3).
2. An evaluation framework for assessing the degree of compositionality of lexical collocations, pivoting around two tasks: unsupervised collocate retrieval (Section 4) and incontext collocation categorization (Section 5).
Our results suggest that modeling collocations in context is a challenge, even for widely used LMs, and that this is particularly true for less semantic (and thus less compositional and more idiosyncratic) collocations. We also find that jointly recognizing the semantics and the syntactic structure (e.g., whether the collocate acts as subject or object in verbal constructions) of a collocation also constitutes non-trivial challenges for current architectures. Moreover, as a byproduct of our analysis, we also find an interesting behaviour in LMs when modeling antonymy in adjectives, specifically that their representations undergo substantial transformations as they flow through BERT's transformer layers, with many contextualized embeddings clustered together in the tip of a narrow cone that seems to represent adjectives in collocations denoting intensity ("heavy" rain) and weakness ("minor" issue).

Related Work
In this section, we discuss related works in two methodological areas that are relevant to this paper, namely conditioning MLMs (Section 2.1) and recognition of multiword epressions (MWEs) (Section 2.2).

Conditioning MLMs
A Masked Language Model (MLM) can be used as a proxy for gaining insights into how language is encoded by the weights of the (usually transformerbased) LM architecture. Moreover, simply asking an LM to predict words in context (without taskspecific fine-tuning) has proved useful in NLP applications dealing with lexical items (affixes, words or phrases). For example,  use BERT's MLM for augmenting their training data in sentiment analysis tasks; Qiang et al. (2019) use BERT for lexical simplification by conditioning the predictions over the [MASK] token by providing the original sentence as context; and Zhou et al. (2019) obtain SotA results in lexical substitution by conditioning BERT via embedding dropout on the target (unmasked) word. Inspired by the findings in these works (especially Qiang et al. (2019)), we will explore the predictions of BERT over masked lexical collocations (with and without conditioning) in Section 4, with the aim to understand whether these predictions can be used to measure the idiosyncrasy of the underlying semantics of a lexical collocation, i.e., whether the restrictions imposed by a collocation's base are due to the frozenness of the phrase itself or, on the contrary, sentential context is neccessary.

Distributional Lexical Composition
Building representations that account for noncompositional meanings within the broader spectrum of encoding semantic relations between words is a long-standing problem in computational semantics (Baroni and Zamparelli, 2010;Mitchell and Lapata, 2010;Boleda et al., 2013). Interestingly, there seems to be little agreement on how these representations should be defined, with recent attempts focusing on verbal multiword expressions (see an overview of approaches in Ramisch et al. (2018)), phrases of variable length encoded via LSTMs, based on their definitions (Hill et al., 2016), or arbitrary lexical and commonsense relations between word pairs for downstream NLP. As a testimony of the broad methods explored in the most recent literature, let us refer to, for instance, the combination of word vector averages with conditional autoencoders (Espinosa-Anke and Schockaert, 2018), expectation maximization (Camacho-Collados et al., 2019), LSTMs for predicting word pair contexts , and explicit encoding of generalized lexico-syntactic patterns (Washio and   2018). Parting ways with the above works, in this paper we will follow the experimental setting described in Shwartz and Dagan (2019), based on injecting sentential contexts into multiword expressions (in our case, only lexical collocations) to leverage the contextual nature of current LMs. However, our goal is not to compare different combinations of feature-extraction and training/finetuning methods, but rather to understand lexical collocations' learnability, idiosyncrasy and their internal vector-space representations.

Lexical Collocations
Let us first introduce the notion of lexical collocation and LF. The term collocation has been used in computational linguistics research to denote two different concepts. On the one hand, following Firth (1957) (2016); Garcia et al. (2017) adopt the definition that is common in lexicography and phraseology (Hausmann, 1985;Cowie, 1994;Mel'čuk, 1995), according to which, a collocation is an idiosyncratic combination of two lexical items, the base and the collocate, as defined above in Section 1. This interpretation states that collocations are phraseological units, although their degree of compositionality can vary. For instance, win [a] war is perceived to possess a higher degree of (free) composition than, e.g., hold [a] meeting, and heavy rain is less compositional than [a] well-justified argument. We adopt this definition of the notion of collocation, and in order to avoid any confusion, we refer to it, following Krenn (2000), as lexical collocation.
Lexical collocations can be typified with respect to the meaning of the collocate and the syntactic structure formed by the base and the collocate. LFs provide a fine-grained typology of this kind (Mel'čuk, 1996). An LF can be considered a function f (L) that delivers for a base L a set of synonymous collocates that express the meaning of f . Where pertinent, f also codifies the subcategorization structure of the base+collocate combination. LFs are assigned Latin acronyms as names; cf., e.g., "Oper1" ('operare'), which means 'perform' and realizes the first argument of the base as subject: Oper1(lecture) = {deliver, give, hold}; "Magn" ('magnum'), which stands for 'intense': The encoding of LFs in NLP research has in recent years revolved around applying word embeddings-based techniques, e.g., in terms of linear projections  and semantic generalizations . Recently, Shwartz and Dagan (2019) analyzed, from the perspective of "static" vs. "contextualized" representations and their applicability to studying compositional phenomena like "meaning shift", one specific type of lexical collocations, namely light verb constructions (LVCs), which are well illustrated by the LFs Oper1 and Oper2 (and also, partially, by Real1 and Real2). While we find that the above research directions (i.e., embeddingsbased and contextualized representations for modeling MWEs) are complementary, in this work, we LF semantic gloss example Oper1 'perform'; 1st argument → subject Oper1(support) = lend IncepOper1 'begin to perform'; 1st argument → subject IncepOper1(impression) = gain Oper2 'undergo'; 2nd argument → subject Oper2(support) = find Real1 'realize'; 1st argument → subject Real1(accusation) = prove Real2 'apply'; 2nd argument → subject Real2(support) = enjoy AntiReal2 'fail to apply'; 2nd argument → subject AntiReal2(war) = lose CausFunc0 'cause the existence' CausFunc0(hope) = raise Caus1Func0 'cause the existence; 1st argument' Caus1Func0(hope) = gain LiquFunc0 'cause termination of the existence' LiquFunc0(hope) = destroy IncepPredPlus 'increase' AntiBon(performance) = poor Table 2: LFs used in this paper. The 'semantic gloss' column provides both a definition and the actantial structure, which is required in cases where one LF may express the same semantics but with a different syntactic structure (e.g., Real1 vs. Real2).
specifically focus on the existing (and learnable) knowledge LMs have concerning lexical collocations, and whether they can be used to recognize and categorize LFs in free text.
For our experiments, we use, as initial lexical collocation source, a collocations dataset, LEXFUNC , which we have extended to cover a wider range of LFs (listed in Table  2). The original LEXFUNC dataset and this extended version are both the result of an initial collection of collocations categorized into LFs made available by Igor Mel'čuk. Each collocation has been manually lemmatized, and bases and collocates have been manually annotated with part-of-speech tags and their syntactic dependency relation.
With the lexical collocations of the extended LEXFUNC dataset at hand, we first compile from the English Gigaword 3 a collocations corpus, which contains the occurrences of these lexical collocations. In principle, the identification of a given collocation in corpora is a straightforward procedure, as we know its elements (base and collocate) and the syntactic dependency relation between them. However, automatic dependency parsing is far from perfect, which complicates the task. Therefore, and in order not to lose any relevant collocation occurrence in the GigaWord corpus, we apply a cascaded procedure for their identification on the lemmatized and POS-and head-modifier relation tagged  In the first stage, we identify sentences in which between the collocation elements in question one of the relevant syntactic dependency relations has been identified. In the second (more relaxed) stage, we match adjacent lemmatized collocation elements and their PoS tags. In the third stage, finally, we match lemmatized collocation elements and their PoS tags within a distance of up to 5 tokens. While this procedure inevitably introduces some noise (we might retrieve sentences where base and collocate co-occur, but not as a collocation), we performed a manual inspection on a random sample, and calculated precision of our collocation retrieval strategy, which resulted in >0.95. This confirms the quality of our retrieval strategy, and hence, our resource.
In terms of corpus statistics, Table 3 indicates the number of sentences for each LF distributed across training (70% of the sentences), development (15%) and test (15%) sets. The split was done maintaining this proportion across all LF. Note that these splits are constructed such that there are no overlapping collocations, in an effort to avoid the well-known phenomenon of lexical memorization (Levy et al., 2015), which may artificially inflate the results on the test set. The number of different collocations per split, globally and for each LF, also maintains the same proportions (70/15/15 ±1%), such that, e.g., AntiReal2 has 55 different collocations in the 942 sentences of the training set and 11 different collocations in the development and test sets, distributed across 205 and 200 sen-  tences respectively. In the overall corpus, there is an average of 18 samples per collocation (work hard being the most frequent one with 102 samples). Hope, attack, criticism, fire and thread are bases that each co-occur with more than 30 different collocates, across most LF. These bases are also among the ones with more samples in the corpus. On the other side, half of the bases are combined with one single collocate only. Overall, the statistical properties of our dataset arguably make it a faithful replica of the distribution of collocations in, at least, newswire corpora. At the same time, it is a challenging dataset, as the results we report in this paper suggest.

Setup
In the first experiment, we aim to analyze how well an MLM retrieves valid collocates for a given base when being provided with the original (sentencelevel) context. We use BERT (bert-base) (Devlin et al., 2018), as it is the de-facto model on top of most specialized and distilled/quantized language models. Its behaviour should thus be a good proxy for the general distributional behaviour of lexical collocations. This experiment serves, first, as an opportunity to understand how much semantics that is underlying LFs can be encoded via a MLM pretraining objective, and second, as a testbed for exploring conditioning strategies often used in tasks involving data augmentation and lexical substitution and simplification (cf. Section 2.1). Since this is an "in-context collocate retrieval" task, we consider it a ranking problem. Intuitively, if BERT is able to retrieve a base's valid collocates (e.g., {heavy, torrential, violent, . . . } for rain as base for Magn) in the position of a masked token, this could mean that: (1) the sentence is giving enough context for the model to "know" the lexical restrictions involved in that collocation, and/or (2) the LF is sufficiently frozen, and therefore the base alone may restrict which collocates are acceptable. For the first point, and continuing with the heavy rain example, consider the following sentence.
Intuitively, we would expect the sentential context to be informative enough for the model to select heavy or any other collocate denoting the notion of intensity, and restricted by the presence of the base rain. In fact, here, BERT predicts heavy with 79.5% probability. However, in example (2) (2) Policeman earns applause for staying on duty in [MASK] rain.
there is much lesser evidence for the rain to be 'intense', and in fact BERT predicts here 'the' with 85.1% probability. This disparity lets us investigate ways to prompt BERT to select heavy or any other valid collocate for example (2). Thus, in addition to simply passing one masked sentence, we explore an approach based on passing the masked sentence concatenated with the original sentence, which is a natural way to encode not only the context surrounding the word, but also the meaning of the target word itself. This strategy was successfully used for the task of unsupervised lexical simplification (Qiang et al., 2019 Recall that we consider as valid hits all the collocates for a given base and its corresponding LF. In practice, this means that for bases that have just one valid collocate, both metrics yield the same score. We lemmatize BERT's predictions using SpaCy's lemmatizer. 5

Results and Discussion
Our results (Table 4) show, first, that conditioning BERT's MLM by passing the original sentence as additional context for the [MASK] token is useful for predicting an embedding whose semantics is more related to the original collocate. The improvements are particularly relevant for LVCs (Oper1, and, to a certain extent, Real1 and Real2), suggesting that these LFs, while perhaps easy to distinguish from others (cf. Section 5.1), they do benefit from additional contexts to be well represented. Interestingly, the Magn LF has small gains in both MRR and MAP, clearly showing that additional context helps little, and thus highlighting a strong semantic dependency between sentence meaning and the collocation's base.
A potential limitation of this setup, however, is that we cannot possibly include all possible collocates for all the bases in our resource. An estimate of the quality of BERT's predictions can be obtained by measuring the semantic similarity (for instance, by cosine distance) between the original masked collocate and the predicted collocates. In the example we already referred to above, heavy rain, the similarity between 'the' and 'heavy' is low, whereas, if the model predicts hard or even any other adjective, it should be considered less wrong. We obtain a broad picture of the quality of BERT's predictions by plotting a histogram (Figure 1) of the similarities obtained by comparing the original collocate's and BERT's predicted GloVe embeddings (Pennington et al., 2014) under both settings (MASKED and CONDITIONED) for the same three LFs as in Table 1), namely Magn, Oper1 and Real1. The conditioning strategy is helpful; it con-5 https://spacy.io/api/lemmatizer.   tributes not only to retrieving the original collocate (which would be trivial if we do not mask it), but also candidates with clearly similar meanings. We see, for instance, more cases for Oper1 and Real1, where the correct verb is predicted, whereas for Magn we see a more sustained improvement across all similarities, but not necessarily for retrieving the original collocate.

Experiment 2: Collocation categorization
In the second experiment, we test the performance of a number of well-known LMs for the task of LF categorization using the train/test splits we sampled and annotated from GigaWord (Section 3). This experiment serves two purposes. First, we expect to learn about the predictability of LFs in context, which is a long-standing problem in computational lexicography and the cornerstone of automatic construction of collocation resources. Second, previous work has shown that some LFs are quite easy to distinguish, without  and with sentential context (Shwartz and Dagan, 2019). However, it is still unclear whether, by focusing exclusively on the phenomenon of collocations, and excluding, e.g., idiomatic expressions or non-compositional phrasal verbs (which are not only semantically but, more importantly, syntactically different from collocations), an LM can indeed be used to construct a resource for second language learners, or whether (and to what extent) an LM can be trained to select appropriate collocates. Our setting is essentially a sentence-pair classification problem, where the second sentence is the lexical collocation itself. Specifically, a training instance is a tuple <sentence, collocation, label>, as in Example (3)  We use as labels the LFs listed in Table 2, with their respective training/test splits, and train all LMs with the same hyperparameters. 6 The considered LMs are BERT (base and large, uncased) (Devlin et al., 2018), RoBERTa (base and large) (Liu et al., 2019), DistilBERT (Sanh et al., 2019), ALBERT  and XLNet (base and large) (Yang et al., 2019). We use the implementation in the Transformers Python library . 7

Results and discussion
The results of this experiment (cf. Table 5) clearly highlight what has already pointed out in Shwartz and Dagan (2019): the prototypical LVCs (as modeled by Oper1) can be identified with a rather high quality. Interesting enough, this is not true for LVCs captured by Oper2, whose only difference to Oper1 is the subcategorization frame: while in Oper1, it is the 1st argument of the base that is realized as the grammatical subject, in Oper2, it is the 2nd argument. Frequency cannot explain this discrepancy since, e.g., IncepOper1, which appears in our corpus in nearly the same number of sentences as Oper2, is categorized with a significantly higher quality.
Results are also lower for some other verbal LFs with more semantic load, among them, e.g., Real1/2 and Caus1Func0, suggesting that the semantics expressed in the notions of 'realize' and 'cause', especially when the 2nd argument of the collocation functions as a subject, are more challenging. Again, these results cannot be fully explained by the amount of training data and neither by the semantic load. Thus, the categorization of IncepPredPlus achieves the highest score (the best model on IncepPredPlus obtains an average F1 of 95.21 with little variability across runs), and it is clearly an LF with a semantic load, namely 'increase'. Interestingly, Ver ('genuine') and Bon ('positive') are the worst categorized LFs in our sample, while their antonyms AntiVer and Anti-Bon are categorized considerably better.
As for the considered LMs, the best overall performing model is the RoBERTa family, with an overall F1 score of 71.19% for RoBERTa-base and 70.6% for RoBERTa-large, and both models accounting for the best results on 7 of the 16 target LFs. The second best results are achieved by XL-Net (base and large), with XLNet-large being the best model on both Magn and AntiMagn, two LFs which have been traditionally challenging to tell apart due to the fact that the representations of antonyms are clustered together in distributional spaces. We also note that, interestingly, DistilBERT is the best at categorizing Oper1 and Real2, which may suggest that small models may be sufficient to obtain good performnace on categorizing LVCs.
In order to gain further insights on why a LM may err in the task of in-context collocation categorization, we display a confusion matrix obtained from random runs for the two LMs with the highest avg score for Oper1 (Distilbert) and Oper2 (XLNetbase) (Figure 2) -the two LVC LFs that differ only in terms of their subcategorization patterns (cf. above). We may hypothesize that the categorization of a collocation based mainly on its actantial structure is challenging, and indeed, we observe that for these two models, 8 syntax-based categorization over the same semantics proves hard. Specifically, XLNet-base has as the greatest source for confusion regarding Oper1, precisely, Oper2; and this also occurs with Real1 vs. Real2 (which also differ only with respect to their subcategorization pattern). The results for DistilBERT show a greater spread among the misclassifications of Oper1, namely across Oper2, Caus1Func0 and In-cepOper1, and for Oper2 across Oper1 and Caus-Func0. Caus1Func0 and IncepOper1 have the same subcategorization pattern as Oper1 (but different semantics). In the case of CausFunc0 ('cause the existence', e.g., CausFunc0(hope) = raise), the subcategorization pattern is very similar to Caus1Func0, only that the grammatical subject of the corresponding syntactic construction is not an argument of the base. As we can observe, CausFunc0 is easily miscategorized as a full LVC. Finally, let us highlight the fact that while Magn is generally well categorized, the few misclassifications come, as would be expected, from collocations which convey a similar notion of amplification (e.g., Bon), but interestingly, also collocations that convey opposite semantics, such as AntiMagn or AntiBon.

Subspace Analysis
In this section, we further explore the semantics of some selected LFs. We generate visualizations of PCA-projected BERT vectors for all collocation mentions of Magn, AntiMagn, Oper1 and Oper2. These four LFs are sufficiently frequent, and they encode different morphosyntactic structures. 9 We can see that antonymy (Ono et al., 2015;Schwartz et al., 2015;Nguyen et al., 2016) is relatively well captured in contextualized models, although the subspaces are clearly different between the embedding and the last transformer layer. More specifically, as the representations of collocates for Magn Figure 3: Oper1 (red) and Oper2 (blue) collocate embeddings for BERT's embedding layer (top row, left), and for the 1st (top row, right), and 5th and 12th transformer layers (second row, left and right, respectively). The bottom quadrant corresponds to Magn (blue) vs AntiMagn (red), with the same arrangements (embedding, 1st, 5th and 12th layer). and AntiMagn undergo the self-attention-based transformations through BERT's layers, many of these contextualized embeddings tend to group in a narrow cone, with many antonymic collocates indistinguishably overlapping with each other. Similarly, we also observe a tendency of representation overlap in the Oper1 vs Oper2 case, with the embeddings in the last transformer layer showing a cluttered distribution, suggesting that there is little inherent knowledge in BERT to categorize a collocation into the syntactic typification of a LF.  Table 5: Average Precision, Recall and F1 results for the collocation classification experiment, computed by averaging the results of three independent runs. We also report standard deviation figures. Results are provided per LF as well as the average over each metric (Average).

Conclusions
We have analyzed LMs in tasks revolving around modeling, recognizing and categorizing lexical collocations. We conclude that some prominet types of LVCs require little context to be well encoded, as opposed to other LFs involving, e.g., nouns and adjectives, and that predictability of LFs is challenging, not a function of training data, and that syntax plays a major role.

Future Work
In the future, we will make this work multilingual using linguistic equivalences as anchors, in the spirit of cross-lingual embedding research, in order to align collocations of the same LF across languages (e.g., in English and Norwegian we take a nap, in German, we 'make' it, in Portuguese we 'pull' it, in Spanish, we 'throw' it, etc.). We would also like to explore the idea of "semantic masking" for collocate discovery, where we would train models for dynamically masking (or removing) idiosyncratic information such that only the semantics of the collocate remain, thus largely corresponding to a latent abstraction over the LF. This approach has been applied recently in the lexical substitution task, with the limitation, however, that the dropout rate was tuned in a validation set, whereas a promising avenue to explore would be to automatically learn the embedding dropout in a fully supervised setting. Finally, motivated by the observed large gap in performance between the categorization of, e.g., Oper1 and Oper2, Bon and AntiBon, Ver and AntiVer, we plan to investigate in more depth the codification of collocational information in pretrained LMs.