Combining Word Patterns and Discourse Markers for Paradigmatic Relation Classification

Distinguishing between paradigmatic relations such as synonymy, antonymy and hy-pernymy is an important prerequisite in a range of NLP applications. In this paper, we explore discourse relations as an alternative set of features to lexico-syntactic patterns. We demonstrate that statistics over discourse relations, collected via explicit discourse markers as proxies, can be utilized as salient indicators for paradigmatic relations in multiple languages, out-performing patterns in terms of recall and F 1 -score. In addition, we observe that markers and patterns provide complementary information, leading to signiﬁcant classiﬁcation improvements when applied in combination.


Introduction
Paradigmatic relations (such as synonymy, antonymy and hypernymy; cf. Murphy, 2003) are notoriously difficult to distinguish automatically, as first-order co-occurrences of the related words tend to be very similar across the relations. For example, in The boy/girl/person loves/hates the cat, the nominal co-hyponyms boy, girl and their hypernym person as well as the verbal antonyms love and hate occur in identical contexts, respectively. Vector space models, which represent words by frequencies of co-occurring words to enable comparisons in terms of distributional similarity (Schütze, 1992;Turney and Pantel, 2010), hence perform below their potential when inferring the type of relation that holds between two words. This distinction is crucial, however, in a range of tasks: in sentiment analysis, for example, words of the same and opposing polarity need to be distinguished; in textual entailment, systems further need to identify hypernymy because of directional inference requirements. Accordingly, while there is a rich tradition on identifying word pairs of a single paradigmatic relation, there is little work that has addressed the distinction between two or more paradigmatic relations (cf. Section 2 for details). In more general terms, previous approaches to distinguishing between several semantic relations have predominantly relied on manually created knowledge sources, or lexico-syntactic patterns that can be automatically extracted from text. Each option comes with its own shortcomings: knowledge bases, on the one hand, are typically developed for a single language or domain, meaning that they might not generalize well; word patterns, on the other hand, are noisy and can be sparse for infrequent word pairs.
In this paper, we propose to strike a balance between availability and restrictedness by making use of discourse markers. This approach has several advantages: markers are frequently found across genres (Webber, 2009), they exist in many languages (Jucker and Yiv, 1998), and capture various semantic properties (Hutchinson, 2004). We implement discourse markers within a vector space model that aims to distinguish between the three paradigmatic relations synonymy, antonymy and hypernymy in German and in English, across the three word classes of nouns, verbs, adjectives. We examine the performance of discourse markers as vector space dimensions in isolation and also explore their contribution in combination with lexical patterns.

Related Work
As mentioned above, there is a rich tradition of research on identifying a single paradigmatic relations. Work on synonyms includes Edmonds and Hirst (2002), who employed a co-occurrence network and second-order co-occurrence, and Curran (2003), who explored word-based and syntaxbased co-occurrence for thesaurus construction.
Van der Plas and Tiedemann (2006) compared a standard distributional approach against crosslingual alignment; Erk and Padó (2008) defined a vector space model to identify synonyms and the substitutability of verbs. Most computational work on hypernyms was performed for nouns, cf. the lexico-syntactic patterns by Hearst (1992) and an extension of the patterns by dependency paths (Snow et al., 2004). Weeds et al. (2004), Lenci and Benotto (2012) and Santus et al. (2014) identified hypernyms in distributional spaces. Computational work on antonyms includes approaches that tested the co-occurrence hypothesis (Charles and Miller, 1989;Fellbaum, 1995), and approaches driven by text understanding efforts and contradiction frameworks (Harabagiu et al., 2006;Mohammad et al., 2008;de Marneffe et al., 2008).
Among the few approaches that distinguished between paradigmatic semantic relations, Lin et al. (2003) used patterns and bilingual dictionaries to retrieve distributionally similar words, and relied on clear antonym patterns such as 'either X or Y' in a post-processing step to distinguish synonyms from antonyms. The study by Mohammad et al. (2013) on the identification and ranking of opposites also included synonym/antonym distinction. Yih et al. (2012) developed an LSA approach incorporating a thesaurus, to distinguish the same two relations. Chang et al. (2013) extended this approach to induce vector representations that can capture multiple relations. Whereas the above mentioned approaches rely on additional knowledge sources, Turney (2006) developed a corpusbased approach to model relational similarity, addressing (among other tasks) the distinction between synonyms and antonyms. More recently, Schulte im Walde and Köper (2013) proposed to distinguish between the three relations antonymy, synonymy and hyponymy based on automatically acquired word patterns.
Regarding pattern-based approaches to identify and distinguish lexical semantic relations in more general terms, Hearst (1992) was the first to propose lexico-syntactic patterns as empirical pointers towards relation instances, focusing on hyponymy. Girju et al. (2003) applied a single pattern to distinguish pairs of nouns that are in a causal relationship from those that are not, and Girju et al. (2006) extended the work towards part-whole relations, applying a supervised, knowledge-intensive approach. Chklovski and Pantel (2004) were the first to apply pattern-based relation extraction to verbs, distinguishing five non-disjoint relations (similarity, strength, antonymy, enablement, happens-before). Pantel and Pennacchiotti (2006) developed Espresso, a weakly-supervised system that exploits patterns in large-scale web data to distinguish between five noun-noun relations (hypernymy, meronymy, succession, reaction, production). Similarly to Girju et al. (2006), they used generic patterns, but relied on a bootstrapping cycle combined with reliability measures, rather than manual resources. Whereas each of the aforementioned approaches considers only one word class and clearly disjoint categories, we distinguish between paradigmatic relations that can be distributionally very similar and propose a unified framework for nouns, verbs and adjectives.

Baseline Model and Data Set
The task addressed in this work is to distinguish between synonymy, antonymy and hypernymy. As a starting point, we build on the approach and data set used by Schulte im Walde and Köper (2013, henceforth just S&K). In their work, frequency statistics over automatically acquired co-occurrence patterns were found to be good indicators for the paradigmatic relation that holds between two given words of the same word class. They further experimented with refinements of the vector space model, for example, by only considering patterns of a specific length, weighting by pointwise mutual information and applying thresholds based on frequency and reliability.
Baseline Model. We re-implemented the best model from S&K with the same setup: word pairs are represented by vectors, with each entry corresponding to one out of almost 100,000 patterns of lemmatized word forms (e.g., X affect how you Y ). Each value is calculated as the log frequency of the corresponding pattern occurring between the word pairs in a corpus, based on exact match. For English, we use the ukWaC corpus (Baroni et al., 2009); for German, we rely on the COW corpus instead of deWaC, as it is larger and better balanced (Schäfer and Bildhauer, 2012).
Data Set. The evaluation data set by S&K is a collection of target and response words in German that has been collected via Amazon Mechanical Turk. The data contains a balanced amount of instances across word categories and relations, also taking into account corpus frequency, degree of ambiguity and semantic classes. In total, the  (2013) and our reimplementation. All numbers in percent. data set consists of 692 pairs of instances, distributed over three word classes (nouns, verbs, adjectives) and three paradigmatic relations (synonymy, antonymy, hypernymy).
Intermediate Evaluation. We compare our reimplementation to the model by S&K using their 80% training and 20% test split, focusing on 2way classifications involving synonymy. The results, summarized in Table 1, confirm that our reimplementation achieves similar results. Observed differences are probably an effect of the distinct corpora applied to induce patterns and counts.
We notice that the performance of both models strongly depends on the affected pair of relations and word category. For example, precision varies in the 2-way classification between synonymy and antonymy from 70.6% to 94.1%. Given the small amount of test data, some of the 80/20 splits might be better suited for the model than others. To avoid resulting bias effects, we perform our final evaluation using 5-fold cross-validation on a merged set of all training and test instances. To illustrate the performance of models in multiple languages, we further conduct experiments on a data set for English relation pairs that has been collected by Giulia Benotto and Alessandro Lenci, following the same methodology as the German collection. The English data set consists of 648 pairs of instances, also distributed over nouns, verbs, adjectives, and covering synonymy, antonymy, hypernymy.

Markers for Relation Classification
The aim of this work is to establish corpus statistics over discourse relations as a salient source of CONTRAST but, altough, rather . . . RESTATEMENT indeed, specifically, . . . INSTANTIATION (for) example, instance, . . . Table 2: Examples of discourse relations/markers. information to distinguish between paradigmatic relations. Our approach is motivated by linguistic studies that indicated a connection between discourse relations and lexical relations of words occurring in the respective discourse segments: Murphy et al. (2009) have shown, for example, that antonyms frequently serve as indicators for contrast relations in English and Swedish. More generally, pairs of word tokens have been identified as strong features for classifying discourse relations when no explicit discourse markers are available (Pitler et al., 2009;Biran and McKeown, 2013).
Whereas word pairs have frequently been used as features for disambiguating discourse relations, to the best of our knowledge, our approach is novel in that we are the first to apply discourse relations as features for classifying lexical relations. One reason for this might be that discourse relations in general are only available in manually annotated corpora. Previous work has shown, however, that such relations can be classified reliably given the presence of explicit discourse markers. 1 We hence rely on such markers as proxies for discourse relations (for examples, cf. Table 2).

Model and Hypothesis
We propose a vector space model that represents pairs of words using as features the discourse markers that occur between them. The underlying hypothesis of this model is as follows: if two phrases frequently co-occur with a specific discourse marker, then the discourse relation expressed by the corresponding marker should also indicate the relation between the words in the affected phrases. Following this hypothesis, contrast relations might indicate antonymy, whereas elaborations may indicate synonymy or hyponymy. Although such relations will not hold between every pair of words in two connected discourse segments, we hypothesize that correct instances (of all considered word classes) can be identified based on high relative frequency.
In our model, frequency statistics are computed over sentence-internal co-occurrences of word pairs and discourse markers. Since discourse relations are typically directed, we take into consideration whether a word occurs to the left or to the right of the respective marker. Accordingly, the features of our model are special cases of single-word patterns with an arbitrary number of wild card tokens (e.g., the marker feature 'though' corresponds to the pattern "X * though * Y "). Yet, our specific choice of features has several advantages: Whereas strict and potentially long patterns can be rare in text, discourse markers such as "however", "for example" and "additionally" are frequently found across genres (Webber, 2009). Although combinations of tokens could also be replaced by wild cards in any automatically acquired pattern, this would generally lead to an exponentially growing feature space. In contrast, the set of discourse markers in our work is fixed: for English, we use 61 markers annotated in the Penn Discourse TreeBank 2.0 (Prasad et al., 2008); for German, we use 155 one-word translations of the English markers, as obtained from an online dictionary. 23 Taking directionality into account, our vector space model consists of 2x61 and 2x155 features, respectively.

Development Set and Hyperparameters
We select the hyperparameters of our model using an independent development set, which we extract from the lexical resource GermaNet (Hamp and Feldweg, 1997). For each considered word category, we extract instances of synonymy, antonymy and hypernymy. In total, 1502 instances are identified, with 64 of them overlapping with the evaluation data set described in Section 3. Note though that the development set is not used for evaluation but only to select the following hyperparameters.
We experimented with different vector values (absolute frequency, log frequency, pointwise mutual information (PMI)), distance measures (cosine, euclidean) and normalization schemes. In contrast to S&K, who did not observe any improvements using PMI, we found it to perform best, combined with euclidean distance and no additional normalization. This finding might be an immediate effect of discourse markers being generally more frequent than strict word patterns, which also leads to more reliable PMI values.

Evaluation
In our evaluation, we assess the performance of the marker-based model and demonstrate the benefits of incorporating discourse markers into a patternbased model, which we apply as a baseline. We evaluate on several data sets: the collection of target-response pairs in German from previous work, and a similar data set that was collected for English target words (cf. Section 3); for comparison reasons, we also apply our models to the balanced data set of related and unrelated noun pairs by Yap and Baldwin (2009). 4 We perform 3-way and 2-way relation classification experiments, using 5-fold cross-validation and a nearest centroid classifier (as applied by S&K).
Results. The 3-way classification results of the baseline and our marker-based model are summarized in Table 3, with best results for each setting marked in bold. On the German data set, our model always outperforms a random baseline (33% F 1 -score). The results on the English data set are overall a bit lower, possibly due to corpus size. In almost all classification tasks, our markerbased model achieves a higher recall and F 1 -score than the pattern-based approach. The precision results of the marker-based model are overall below the pattern-based model. This drop in performance does not come as a surprise though, considering that the model only makes use of 122 and 310 features, in comparison to tens of thousands of features in the pattern approach.
A randomized significance test over classified instances (cf. Yeh, 2000) revealed that only two differences in results are significant. We hypothesize that one reason for this outcome might be that both models cover complementary sets of instances. To verify this hypothesis, we apply a combined model, which is based on a weighted linear combination of distances computed by the two individual models. 5 As displayed in Table 3   in recall and F 1 -score, leading to the best 3-way classification results. All gains in recall are significant, confirming that the single models indeed contribute complementary information. For example, only the pattern-based model classifies "intentional"-"accidental" as antonyms, and only the marker-based model predicts the correct relation for "double"-"multiple" (hypernymy). The combined model classifies both pairs correctly.  A final experiment is performed on the data set by Yap and Baldwin (2009) to see whether our models can also distinguish word pairs of individual relations from unrelated pairs of words. The results, listed in Table 5, show that the markerbased model cannot perform this task as well as the pattern-based model. The combined model, however, outperforms both individual models in 2 out of 3 cases. Despite their simplicity, our models achieve results close to the F 1 -scores reported by , who employed syntactic pre-processing and an SVM-based classifier, and experimented with different corpora.

Conclusions
In this paper, we proposed to use discourse markers as indicators for paradigmatic relations between words and demonstrated that a small set of such markers can achieve higher recall than a pattern-based model with tens of thousands of features. Combining patterns and markers can further improve results, leading to significant gains in recall and F 1 . As our new model only relies on a raw corpus and a fixed list of discourse markers, it can easily be extended to other languages.