Lexical Inference over Multi-Word Predicates: A Distributional Approach

Representing predicates in terms of their argument distribution is common practice in NLP. Multi-word predicates (MWPs) in this context are often either disregarded or considered as ﬁxed expressions. The latter treatment is unsatisfactory in two ways: (1) identifying MWPs is notoriously difﬁ-cult, (2) MWPs show varying degrees of compositionality and could beneﬁt from taking into account the identity of their component parts. We propose a novel approach that integrates the distributional representation of multiple sub-sets of the MWP’s words. We assume a latent distribution over sub-sets of the MWP, and estimate it relative to a downstream prediction task. Focusing on the supervised identi-ﬁcation of lexical inference relations, we compare against state-of-the-art baselines that consider a single sub-set of an MWP, obtaining substantial improvements. To our knowledge, this is the ﬁrst work to address lexical relations between MWPs of varying degrees of compositionality within distributional semantics.


Introduction
Multi-word expressions (MWEs) constitute a large part of the lexicon and account for much of its growth (Jackendoff, 2002;Seaton and Macaulay, 2002). However, despite their importance, MWEs remain difficult to define and model, and consequently pose serious difficulties for NLP applications (Sag et al., 2001). Multi-word Predicates (MWPs; sometimes termed Complex Predicates) form an important and much addressed subclass of MWEs and are the focus of this paper.
MWPs are informally defined as multiple words that constitute a single predicate (Alsina et al., 1997). MWPs encompass a wide range of phenomena, including causatives, light verbs, phrasal verbs, serial verb constructions and many others, and pose considerable challenges to both linguistic theory and NLP applications (see Section 2). Part of the difficulty in treating them stems from their position on the borderline between syntax and the lexicon. It is therefore often unclear whether they should be treated as fixed expressions, as compositional phrases that reflect the properties of their component parts or as both.
This work addresses the modelling of MWPs within the context of distributional semantics (Turney and Pantel, 2010), in which predicates are represented through the distribution of arguments they may take. In order to collect meaningful statistics, the predicate's lexical unit should be sufficiently frequent and semantically unambiguous.
MWPs pose a challenge to such models, as naïvely collecting statistics over all instances of highly ambiguous verbs is likely to result in noisy representations. For instance, the verb "take" may appear in MWPs as varied as "take time", "take effect" and "take to the hills". This heterogeneity of "take" is likely to have a negative effect on downstream systems that use its distributional representation. For instance, while "take" and "accept" are often considered lexically similar, the high frequency in which "take" participates in non-compositional MWPs is likely to push the two verbs' distributional representations apart.
A straightforward approach to this problem is to represent the predicate as a conjunction of multiple words, thereby trading ambiguity for sparsity. For instance, the verb "take" could be conjoined with its object (e.g., "take care", "take a bus"). This approach, however, raises the challenge of identifying the sub-set of the predicate's words that should be taken to represent it (henceforth, its lexical components or LCs).
We propose a novel approach that addresses this challenge in the context of identifying lexical inference relations between predicates (Lin and Pantel, 2001;Schoenmackers et al., 2010;Melamud et al., 2013a, inter alia). A (lexical) inference relation p L → p R is said to hold if the relation denoted by p R generally holds between a set of arguments whenever the relation p L does. For instance, an inference relation holds between "annex" and "control" since if a country annexes another, it generally controls it. Most works to this task use distributional similarity, either as their main component (Szpektor and Dagan, 2008;Melamud et al., 2013b), or as part of a more comprehensive system (Berant et al., 2011;Lewis and Steedman, 2013). For example, consider the verb "take". While the inference relation "have → take" does not generally hold, it does hold in the case of some light verbs, such as "have a look → take a look", underscoring the importance of taking more inclusive LCs into account. On the other hand, the predicate "likely to give a green light" is unlikely to appear often even within a very large corpus, and could benefit from taking its lexical sub-units (e.g., "likely" or "give a green light") into account.
We present a novel approach to the task that models the selection and relative weighting of the predicate's LCs using latent variables. This approach allows the classifier that uses the distributional representations to take into account the most relevant LCs in order to make the prediction. By doing so, we avoid the notoriously difficult problem of defining and identifying MWPs and account for predicates of various sizes and degrees of compositionality. To our knowledge, this is the first work to address lexical relations between MWPs of varying degrees of compositionality within distributional semantics.
We conduct experiments on the dataset of Zeichner et al. (2012) and compare our methods with analogous ones that select a fixed LC, using stateof-the-art feature sets. Our method obtains substantial performance gains across all scenarios.
Finally, we note that our approach is cognitively appealing. Significant cognitive findings support the claim that a speaker's lexicon consists of partially overlapping lexical units of various sizes, of which several can be evoked in the interpretation of an utterance (Jackendoff, 2002;Wray, 2008).

Background and Related Work
Inference Relations. The detection of inference relations between predicates has become a central task over the past few years (Sekine, 2005;Zanzotto et al., 2006;Schoenmackers et al., 2010;Berant et al., 2011;Melamud et al., 2013a, inter alia). Inference rules are used in a wide variety of applications including Question Answering (Ravichandran and Hovy, 2002), Information Extraction (Shinyama and Sekine, 2006), and as a main component in Textual Entailment systems (Dinu and Wang, 2009;Dagan et al., 2013). Most approaches to the task used distributional similarity as a major component within their system. Lin and Pantel (2001) introduced DIRT, an unsupervised distributional system for detecting inference relations. The system is still considered a state-of-the-art baseline (Melamud et al., 2013a), and is often used as a component within larger systems. Schoenmackers et al. (2010) presented an unsupervised system for learning inference rules directly from open-domain web data. Melamud et al. (2013a) used topic models to combine typelevel predicate inference rules with token-level information from their arguments in a specific context. Melamud et al. (2013b) used lexical expansion to improve the representation of infrequent predicates. Lewis and Steedman (2013) combined distributional and symbolic representations, evaluating on a Question Answering task, as well as on a quantification-focused entailment dataset. Several studies tackled the task using supervised systems. Weisman et al. (2012) used a set of linguistically motivated features, but evaluated their system on a corpus that consists almost entirely of single-word predicates. Mirkin et al. (2006) presented a system for learning inference rules between nouns, using distributional similarity and pattern-based features. Hagiwara et al. (2009) identified synonyms using a supervised approach relying on distributional and syntactic features. Berant et al. (2011) used distributional similarity between predicates to weight the edges of an entailment graph. By imposing global constraints on the structure of the graph, they obtained a more accurate set of inference rules.
Previous work used simple methods to select the predicate's LC. Some filtered out frequent highly ambiguous verbs (Lewis and Steedman, 2013), others selected a single representative word (Melamud et al., 2013a), while yet others used multi-word LCs but treated them as fixed expressions (Lin and Pantel, 2001;Berant et al., 2011).
The goals of the above studies are largely com-plementary to ours. While previous work focused either on improving the quality of the distributional representations themselves or on their incorporation into more elaborate systems, we focus on the integration of the distributional representation of multiple LCs to improve the identification of inference relations between MWPs. MWP Extraction and Identification. MWPs have received considerable attention over the years in both theoretical and applicative contexts. Their position on the crossroads of syntax and the lexicon, their varying degrees of compositionality, as well as the wealth of linguistic phenomena they exhibit, made them the object of ongoing linguistic discussion (Alsina et al., 1997;Butt, 2010).
In NLP, the discovery and identification of MWEs in general and MWPs in particular has been the focus of much work over the years (Lin, 1999;Baldwin et al., 2003;Biemann and Giesbrecht, 2011). Despite wide interest, the field has yet to converge to a general and widely agreed-upon method for identifying MWPs. See (Ramisch et al., 2013) for an overview.
Most work on MWEs emphasized idiosyncratic or non-compositional expressions. Other lines of work focused on specific MWP classes such as light verbs (Tu and Roth, 2011;Vincze et al., 2013) and phrasal verbs (McCarthy et al., 2003;Pichotta and DeNero, 2013). Our work proposes a uniform treatment to MWPs of varying degrees of compositionality, and avoids defining MWPs explicitly by modelling their LCs as latent variables.

Compositional
Distributional Semantics. Much work in recent years has concentrated on the relation between the distributional representations of composite phrases and the representations of their component sub-parts (Widdows, 2008;Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010;Coecke et al., 2010). Several works have used compositional distributional semantics (CDS) representations to assess the compositionality of MWEs, such as noun compounds (Reddy et al., 2011) or verb-noun combinations (Kiela and Clark, 2013). Despite significant advances, previous work has mostly been concerned with highly compositional cases and does not address the distributional representation of predicates of varying degrees of compositionality.

Our Proposal: A Latent LC Approach
This section details our approach for distributionally representing MWPs by leveraging their component LCs. Section 3.1 describes our general approach, Section 3.2 presents our model and Section 3.3 details the feature set.

General Approach and Notation
We propose a method for addressing MWPs of varying degrees of compositionality through the integration of the distributional representation of multiple sub-sets of the predicate's words (LCs). We use it to tackle a supervised prediction task that represents predicates distributionally. Our model assumes a latent distribution over the LCs, and estimates its parameters so to best conform to the goals of the target prediction task.
Formally, given a predicate p, we denote the set of words comprising it as W (p). The set of allowable LCs for p is denoted with H p ⊂ 2 W (p) . H p contains all sub-sets of p that we consider as apriori possible to represent p. For instance, if p is "likely to give a green light", H p may include LCs such as "likely" or "give light". As our method is aimed at discovering the most relevant LCs, we do not attempt to analyze the MWPs in advance, but rather take an inclusive H p , allowing the model to estimate the relative weights of the LCs.
The task we use as a testbed for our approach is the lexical inference identification task between predicates. Given a pair of predicates p = (p L , p R ), the task is to predict whether an inference relation holds between them. For instance, if p L is "devour" and p R is "eat greedily", the classifier should use the similarity between "devour" and "eat" in order to correctly predict an inference relation in this case. Selecting the wider LC "eat greedily" might result in sparser statistics. In other examples, however, taking a wider LC is potentially beneficial. For instance, the dissimilarity between "take" and "make" should not prevent the classifier from identifying the inference relation between "take a step" and "make a step".
Our statistical model aims at predicting the correct label by making use of partially overlapping LCs of various sizes, both for the premise lefthand side (LHS) predicate p L and the hypothesis right-hand side (RHS) predicate p R . More formally, we take the space of values for our latent LC variables to be H Our evaluation dataset consists of pairs p (i) = (p . We also as-sume the existence of a feature function Φ(p, y, h) which maps a triplet of a predicate pair p, an inference label y, and a latent state h ∈ H p to R d for some integer d. We denote the training set by D.

The Model
We address the task with a latent variable loglinear model, representing the LCs of the predicates. We choose this model for its generality, conceptual simplicity, and because it allows to easily incorporate various feature sets and sets of latent variables. We introduce L 2 regularization to avoid over-fitting. We use maximum likelihood estimation, and arrive at the following objective function: We maximize L using the BFGS algorithm (Nocedal and Wright, 1999). The gradient (with respect to w) is the following: H p can be defined to be any sub-set of 2 W (p) given that taking an expectation over H can be done efficiently. It is therefore possible to use prior linguistic knowledge to consider only sub-sets of p that are likely to be non-compositional (e.g., verbpreposition or verb-noun pairs).
In our experiments we attempt to keep the approach maximally general, and define H p to be the set of all subsets of size 1 or 2 of content words in W p 1 . We bound the size of h ∈ H p in order to retain computational efficiency and a sufficient frequency of the LCs in H p . MWPs of length greater than 2 are effectively approximated by their set of subsets of sizes 1 and 2.
Each h can therefore be written as a 4-tuple denotes the (possibly empty) second word of the predicate. Inference is carried out by maximizing P (y|p (i) ) over y. As |H p | = O(k 4 ), where k is the 1 We use a POS tagger to identify content words. Prepositions are considered content words under this definition. number of content words in p, and as the number of content words is usually small 2 , inference can be carried out by directly summing over H (i) . Initialization. The introduction of latent variables into the log-linear model leads to a nonconvex objective function. Consequently, BFGS is not guaranteed to converge to the global optimum, but rather to a stationary point. The result may therefore depend on the parameter initialization. Indeed, preliminary experiments showed that both initializing w to be zero and using a random initializer results in lower performance.
Instead, we initialize our model with a simplified convex model that fixes the LCs to be the pair of left-most content words comprising each of the predicates. This is a common method for selecting the predicate's LC (e.g., Melamud et al., 2013a). Once h has been fixed, the model collapses to a convex log-linear model. The optimal w is then taken as an initialization point for the latent variable model. While this method may still not converge to the global maximum, our experiments show that this initialization technique yields high quality values for w (see Section 6).

Feature Set
This section lists the features used for our experiments. We intentionally select a feature set that relies on either completely unsupervised or shallow processing tools that are available for a wide variety of languages and domains.
Given a predicate pair p (i) , a label y ∈ {1, −1} and a latent state h ∈ H (i) , we define their feature vector as Φ(p (i) , y, h) = y · Φ(p (i) , h). The computation of Φ(p (i) , h) requires a reference corpus R that contains triplets of the type (p, x, y) where p is a binary predicate and x and y are its arguments. We use the Reverb corpus as R in our experiments (Fader et al., 2011; see Section 4). We refrain from encoding features that directly reflect the vocabulary of the training set. Such features are not applicable beyond that set's vocabulary, and as available datasets contain no more than a few thousand examples, these features are unlikely to generalize well. Table 1 presents the set of features we use in our experiments. The features can be divided into two main categories: similarity features between the LHS and the RHS predicates (table's   Distributional Similarity Features. The distributional similarity features are based on the DIRT system (Lin and Pantel, 2001). The score defines for each predicate p and for each argument slot s ∈ {L, R} (corresponding to the arguments to the right and left of that predicate) a vector v p s which represents the distribution of arguments appearing in that slot. We take v p s (x) to be the number of times that the argument x appeared in the slot s of the predicate p. Given these vectors, the similarity between the predicates p 1 and p 2 is defined as: where sim is some vector similarity measure. We use two common similarity measures: the vector cosine metric, and the BInc (Szpektor and Dagan, 2008) similarity measure. These measures give complementary perspectives on the similarity between the predicates, as the cosine similarity is symmetric between the LHS and RHS predicates, while BInc takes into account the directionality of the inference relation. Preliminary experiments with other measures, such as those of Lin (1998) and Weeds and Weir (2003) did not yield additional improvements.
We encode the similarity of all measures for the pair h L and h R as well as the pair h A L and h A R . The latter feature is an approximation to the similarity between the heads of the predicates, as heads in English tend to be to the left of the predicates. These two features coincide for h values of size 1. Word and Pair Features. These features encode the basic properties of the LC. The motivation behind them is to allow a more accurate leveraging of the similarity features, as well as to better determine the relative weights of h ∈ H (i) .
The feature set is composed of four analogous sets corresponding to h A L ,h B L ,h A R and h B R , as well as two sets of features that capture relations between h A L , h B L and h A R , h B R (in cases h is of size 2). The features include the ordinal index of the word within the predicate, the lemma's frequency according to R, and a feature that indicates whether that word's lemma also appears in both predicates of the pair. For instance, when considering the predicates "likely to come" and "likely to leave", "likely" appears in both predicates, while "come" and "leave" appear only in one of them.
In addition, we use POS-based features that encode the most frequent POS tag for the word lemma and the second most frequent POS tag (according to R). Information about the second most frequent POS tag can be important in identifying light verb constructions, such as "take a swim" or "give a smile", where the object is derived from a verb. It can thus be interpreted as a generalization of the feature that indicates whether the object is a deverbal noun, which is used by some light verb identification algorithms (Tu and Roth, 2011).
In cases where h L is of size 2, we additionally encode features that apply to the conjunction of h A L and h B L . We encode the conjunction of their POS and the number of times the two lemmas occurred together in R. We also introduce features that capture the statistical correlation between the words of h L . To do so, we use point-wise mutual information, and the conditional probabilities P (h A L |h B L ) and P (h B L |h A L ). Similar measures have often been used for the unsupervised detection of MWEs (Villavicencio et al., 2007;Fazly and Stevenson, 2006). We also include the analogous set of features for h R . LDA-based Features. We further incorporate features based on a Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003). Several recent works have underscored the usefulness of using topic models to model a predicate's selectional preferences (Ritter et al., 2010;Dinu and Lapata, 2010;Séaghdha, 2010;Lewis and Steedman, 2013;Melamud et al., 2013a). We adopt the approach of Lewis and Steedman (2013), and define a pseudo-document for each LC in the evaluation corpus. We populate the pseudo-documents of an LC with its arguments according to R. We then train an LDA model with 25 topics over these documents. This yields a probability distribution P (topic|h) for each LC h, reflecting the types of arguments h may take.
We further include a feature for the entropy of the topic distribution of the predicate, which reflects its heterogeneity. This feature is motivated by the assumption that a heterogeneous predicate is more likely to benefit from selecting a more inclusive LC than a homogeneous one. Technical Issues. All features used, except the similarity ones and the topic distribution features are binary. Frequency features are binned into 4 bins of equal frequency. We conjoin some of the feature sets by multiplying their values. Specifically, we add the cross product of the features of the category "Similarity" (see Table 1) with the rest of the features. In addition, we conjoin all LHS (RHS) features with an indicator feature that indicates whether h L (h R ) is of size two. This results in 1605 non-constant features.
We further note that some LCs that appear in the evaluation corpus do not appear at all in R. In our experiments they amounted to 0.2% of the LCs in our evaluation dataset. While previous work often discarded predicates below a certain frequency from the evaluation, we include them in order to facilitate comparison to future work. We assign the similarity features of such examples a 0 value, and assign their other numerical features the mean value of those features.

Experimental Setup
Corpora and Preprocessing. As a reference corpus R, we use Reverb (Fader et al., 2011), a web-based corpus consisting of 15M web extractions of binary relations. Each relation is a triplet of a predicate and two arguments, one preceding it and one following it. Relations were extracted using regular expressions over the output of a POS tagger and an NP chunker. Each predicate may consist of a single verb, a verb and a preposition or a sequence of words starting in a verb and ending in a preposition, between which there may nouns, adjectives, adverbs, pronouns, determiners and verbs. The verb may also be a copula. Examples of predicates are "make the most of", "could be exchanged for" and "is happy with".
Reverb is an appealing reference corpus for this task for several reasons. First, it uses fairly shallow preprocessing technology which is available for many domains and languages. Second, Reverb applies considerable noise filtering, which results in extractions of fair quality. Third, our evaluation dataset is based on Reverb extractions.
We evaluate our algorithm on the dataset of Zeichner et al. (2012). This publicly available corpus 3 provides pairs of Reverb binary relations and an indication of whether an inference relation holds between them within the context of a specific pair of argument fillers. The corpus was compiled using distributional methods to detect pairs of relations in Reverb that are likely to have an inference relation between. Annotators, employed through Amazon Mechanical Turk, were then asked to determine whether each pair is meaningful, and if so, to determine whether an inference relation holds. Further measures were taken to monitor the accuracy of the annotation.
For example, the pair of predicates "make the most of" and "take advantage of" appears in the corpus as a pair between which an inference relation holds. The arguments in this case are "students" and "their university experience". An ex-ample of a pair between which an inference relation does not hold is "tend to neglect" and "underestimate the importance of", where the arguments are "Robert" and "his family".
The dataset contains 6,565 instances in total. We use 5,411 pairs of them, discarding instances that were deemed as meaningless by the annotators. We also discard cases where the set of arguments is reversed between the LHS and RHS predicates. In these examples, p R (x, y) is inferable from p L (y, x), rather than from p L (x, y). As there are less than 150 reversed instances in the corpus, experimenting on this sub-set is unlikely to be informative.
The average length of a predicate in the corpus is 2.7 words (including function words). In 87.3% of the predicate pairs, there was more than one LC (i.e., |H p | > 1), underscoring the importance of correctly leveraging the different LCs. We randomly partition the corpus into a training set which contains 4,343 instances (∼80%), and a test set that contains 1,068 instances, maintaining the same positive to negative label ratio in both datasets 4 . Development was carried out using cross-validation on the training data (see below).
We use a Maximum Entropy POS Tagger, trained on the Penn Treebank, and the WordNet lemmatizer, both implemented within the NLTK package (Loper and Bird, 2002). To obtain a coarse-grained set of POS tags, we collapse the tag set to 7 categories: nouns, verbs, adjectives, adverbs, prepositions, the word "to" and a category that includes all other words. A Reverb argument is represented as the conjunction of its content words that appear more than 10 times in the corpus. Function words are defined according to their POS tags and include determiners, possessive pronouns, existential "there", numbers and coordinating conjunctions. Auxiliary verbs and copulas are also considered function words.
To compute the LDA features, we use the online variational Bayes algorithm of (Hoffman et al., 2010) as implemented in the Gensim software package (Rehurek and Sojka, 2010).
Evaluated Algorithms. The only two previous works on this dataset (Melamud et al., 2013a;Melamud et al., 2013b) are not directly comparable, as they used unsupervised systems and evalu-ated on sub-sets of the evaluation dataset. Instead, we use several baselines to demonstrate the usefulness of integrating multiple LCs, as well as the relative usefulness of our feature sets.
The simplest baseline is ALLNEG, which predicts the most frequent label in the dataset (in our case: "no inference"). The other evaluated systems are formed by taking various subsets of our feature set. We experiment with 4 feature sets. The smallest set, SIM, includes only the similarity features. This feature set is related to the compositional distributional model of Mitchell and Lapata (2010) (see Section 6). We note that despite recent advances in identifying predicate inference relations, the DIRT system (Lin and Pantel, 2001) remains a strong baseline, and is often used as a component in state-of-the-art systems (Berant et al., 2011), and specifically in the two aforementioned works that used the same evaluation corpus.
The next feature set BASIC includes the features found to be most useful during the development of the model: the most frequent POS tag, the frequency features and the feature Common. More inclusive is the feature set NO-LDA, which includes all features except the LDA features. Experiments with this set were performed in order to isolate the effect of the LDA features. Finally, ALL includes our complete set of features.
The more direct comparison is against partial implementations of our system where the LC h is deterministically selected. Determining h for each predicate yields a regular log-linear binary classification model. We use two variants of this baseline. The first, LEFTMOST, selects the left-most content word for each predicate. Similar selection strategy was carried out by Melamud et al. (2013a). The second, VPREP, selects h to be the verb along with its following preposition. In cases the predicate contains multiple verbs, the one preceding the preposition is selected, and where the predicate does not contain any non-copula verbs, it regresses to LEFTMOST. This LC selection method approximates a baseline that includes subcategorized prepositions. Such cases are highly frequent and account for a large portion of the MWPs in English. Including a verb's preposition in its LC was commonly done in previous work (e.g., Lewis and Steedman, 2013).
We also attempted to identify verb-preposition constructions using a dependency parser. Unfortunately, our evaluation dataset is only available in a lemmatized version, which posed a difficulty for the parser. Due to the low quality of the resulting parses, we implemented VPREP using POS-based regular expressions as defined above.
The full model is denoted with LATENTLC. For each system and feature set, we report results using 10-fold cross-validation on the training set, as well as results on the test set. Both cases use the same set of parameters determined by crossvalidation on the training set. As the task at hand is a binary classification problem, we use accuracy scores to rate the performance of our systems. Table 2 presents the results of our experiments. Rows correspond to the evaluated algorithms, while columns correspond to the feature sets used and the evaluation scenarios (i.e., training set cross-validation or test set evaluation). Our experiments make first use of this dataset in its fullest form for the problem of supervised learning of inference relations, and may serve as a starting point for further exploration of this dataset.

Results
For all feature sets and settings, LATENTLC scored highest, often with a considerable margin of up to 3.0% in the cross-validation and up to 4.6% on the test set relative to the LEFTMOST baseline, and 5.1% (cross-validation) and 6.8% (test) margins relative to VPREP.
The best scoring result of our LATENTLC model in the cross-validation scenario is 65.72%, obtained by the feature set All. The best scoring result by any of the baseline models in this scenario is 62.7%, obtained by the same feature set. For the test set scenario, LATENTLC obtained its highest accuracy, 65.73%, when using the feature set Basic. This is a substantial improvement over the highest scoring baseline model in this scenario that obtained 61.6% accuracy, using the feature set All. This performance gap is substantial when taking into consideration that the improvements obtained by the highly competitive DIRT similarity features using the stronger LEFTMOST baseline, result in an improvement of 3.1% and 5.3% over the trivial ALLNEG baseline in the test set and cross-validation scenarios respectively.
Comparing the different feature sets on our proposed model, we find that the Basic feature set gives a consistent and substantial increase over the Sim feature set. Improvements are of 2.8% (test) and 2.2% (cross-validation). Introducing more elaborate features (i.e., the feature sets NoLDA and All) yields some improvements in the crossvalidation, but these improvements are not replicated on the test set. This may be due to idiosyncrasies in the test set that are averaged out in the cross-validation scenario.
For a qualitative analysis, we took the best performing model of the data set (i.e., with the Basic feature set), and extracted the set of instances where it made a correct prediction while both baselines made an error. This set contains many verb-preposition pairs, such as "list as → report as" or "submit via → deliver by", underscoring the utility of leveraging multiple LCs rather than considering only a head word (as with LEFTMOST) or the entire phrase (as with VPREP). Other examples in this set contain more complex patterns. These include the positive pairs "talk much about → have much to say about" and "increase with → go up with", and the negative "make prediction about → meet the challenge of" and "enjoy watching → love to play".

Discussion
Relation to CDS. Much recent work subsumed under the title Compositional Distributional Semantics addressed the distributional representation of multi-word phrases (see Section 2). This line of work focuses on compositional predicates, such as "kick the ball" and not on idiosyncratic predicates such as "kick the bucket".
A variant of the CDS approach can be framed within ours. Assume we wish to compute the similarity of the predicates p L = (w 1 , ..., w n ) and p R = (w 1 , ..., w m ). Let us denote the vector space representations of the individual words as v 1 , ..., v n and v 1 , ..., v m respectively. A standard approach in CDS is to compose distributional representations by taking their vector sum v L = v 1 + v 2 ... + v n and v R = v 1 + ... + v m (Mitchell and Lapata, 2010). One of the most effective similarity measures is the cosine similarity, which is a normalized dot product. The distributional similarity between p L and p R under this model is This similarity score is similar in spirit to a simplified version of our statistical model that restricts the set of allowable LCs H p to be {({w i }, {w j })|i ≤ n, j ≤ m}, i.e., only LCs of size 1. Indeed, taking H p as above, and cosine similarity as the only feature (i.e., w ∈ R), yields the distribution  Table 2: Results for the various evaluated systems. Accuracy results are presented in percents, followed in the cross validation scenario by the standard deviation over the folds. The rows correspond to the various systems as defined in Section 4. LATENTLC is our proposed model. The columns correspond to the various feature sets, from the least to the most inclusive. SIM includes only similarity features. BASIC additionally includes POS-based and frequency features. NOLDA includes all features except LDA-based features. ALL is the full feature set. ALLNEG is the classifier that invariably predicts the label "no inference". Bold marks best overall accuracy per column, and * marks figures that are not significantly worse (McNemar's test, p < 0.05). The same positive to negative label ratio was maintained in both the cross validation and test set scenarios. In all cases, LATENTLC obtains substantial improvements over the baseline systems.
This derivation highlights the relation of a simplified version of our approach to the additive CDS model, as both approaches effectively average over the similarities of all pairs of words in p L and p R . The derivation also highlights a few advantages of our approach. First, our approach allows to straightforwardly introduce additional features and to weight them in a way most consistent with the task at hand. Second, it allows much more flexibility in defining the set of allowable LCs, H p . Specifically, H p may contain LCs of sizes greater than 1. Third, our approach uses standard probabilistic modelling, and therefore has a natural statistical interpretation.
In order to appreciate the effect of these advantages, we perform an experiment that takes H to be the set of all LCs of size 1, and uses a single similarity measure. We run a 10-fold crossvalidation on our training data, obtaining 61.3% accuracy using COSINE and 62.2% accuracy using BInc. The performance gap between these results and the accuracy obtained by our full model (65.7%) underscores the latter's effectiveness in integrating multiple features and LCs. Effectiveness of Optimization Method. Our maximization of the log-likelihood function is not guaranteed to converge to a global optimum. Therefore, the quality of the learned parameters may be sensitive to the initialization point. We hereby describe an experiment that tests the sensitivity of our approach to such variance.
Selecting the highest scoring feature set on our test set (i.e., BASIC), we ran the model with multiple initializers, by randomly perturbing our standard convex initializer (see Section 3). Concretely, given a convex initializer w, we select the starting point to be w + η, where η i ∼ N (0, α|w i |). We ran this experiment 400 times with α = 0.8.
To combine the resulting weight vectors into a single classifier, we apply two types of standard approaches: a Product of Experts (Hinton, 2002), as well as a voting approach that selects the most frequently predicted label. Neither of these experiments yielded any significant performance gain. This demonstrates the robustness of our optimization method to the initialization point.

Conclusion
We have presented a novel approach to the distributional representation of multi-word predicates. Since MWPs demonstrate varying levels of compositionality, a uniform treatment of MWPs either as fixed expressions or through head words is lacking. Instead, our approach integrates multiple lexical units contained in the predicate. The approach takes into account both multi-word LCs that address low compositionality cases, as well as single-word LCs that address compositional cases and are more frequent. It assumes a latent distribution over the LCs of the predicates, and estimates it relative to a target application task.
We addressed the supervised inference identification task, obtaining substantial improvement over state-of-the-art baseline systems. In future work we intend to assess the benefit of this approach in MWP classes that are well-known from the literature. We believe that a permissive approach that integrates multiple analyses would perform better than standard single-analysis methods in a wide range of applications.