Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) Stella Markantonatou Carlos Ramisch Agata Savary Veronika Vincze April 2017

Valencia, Spain

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-17 book MWE2017:2017 ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs PetraBarancikova VáclavaKettnerová Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 1–10 http://www.aclweb.org/anthology/W17-1701 We present a new freely available dictionary of paraphrases of Czech complex predicates with light verbs, ParaDi. Candidates for single predicative paraphrases of selected complex predicates have been extracted automatically from large monolingual data using word2vec. They have been manually verified and further refined. We demonstrate one of many possible applications of ParaDi in an experiment with improving machine translation quality. inproceedings barancikova-kettnerova:2017:MWE2017 Multi-word Entity Classification in a Highly Multilingual Environment SophieChesney GuillaumeJacquet RalfSteinberger JakubPiskorski Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 11–20 http://www.aclweb.org/anthology/W17-1702 This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results. inproceedings chesney-EtAl:2017:MWE2017 Using bilingual word-embeddings for multilingual collocation extraction MarcosGarcia MarcosGarcía-Salido MargaritaAlonso-Ramos Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 21–30 http://www.aclweb.org/anthology/W17-1703 This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains. inproceedings garcia-garciasalido-alonsoramos:2017:MWE2017 The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions AgataSavary CarlosRamisch SilvioCordeiro FedericoSangati VeronikaVincze BehrangQasemiZadeh MarieCandito FabienneCap VoulaGiouli IvelinaStoyanova AntoineDoucet Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 31–47 http://www.aclweb.org/anthology/W17-1704 Multiword expressions (MWEs) are known as a "pain in the neck" for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as "words with spaces". We describe an initiative meant to bring about substantial progress in understanding, modelling and process- ing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million- word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems. inproceedings savary-EtAl:2017:MWE2017 USzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques Katalin IlonaSimkó ViktóriaKovács VeronikaVincze Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 48–53 http://www.aclweb.org/anthology/W17-1705 The paper describes our system submitted for the Workshop on Multiword Expressions’ shared task on automatic identification of verbal multiword expressions. It uses POS tagging and dependency parsing to identify single- and multi-token verbal MWEs in text. Our system is language independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis for the languages it was submitted for. inproceedings simko-kovacs-vincze:2017:MWE2017 Parsing and MWE Detection: Fips at the PARSEME Shared Task LukaNerima VasilikiFoufi EricWehrli Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 54–59 http://www.aclweb.org/anthology/W17-1706 Identifying multiword expressions (MWEs) in a sentence in order to ensure their proper processing in subsequent applications, like machine translation, and performing the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alternatives, with the aim to lead to substantial improvements in the performance of both tasks (collocation identification and parsing), and in that of a subsequent task (machine translation). In this paper, we are going to present our system and the procedure that we have followed in order to participate to the open track of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) in running texts. inproceedings nerima-foufi-wehrli:2017:MWE2017 Neural Networks for Multi-Word Expression Detection NataliaKlyueva AntoineDoucet MilanStraka Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 60–65 http://www.aclweb.org/anthology/W17-1707 In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages. inproceedings klyueva-doucet-straka:2017:MWE2017 Factoring Ambiguity out of the Prediction of Compositionality for German Multi-Word Expressions StefanBott SabineSchulte im Walde Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 66–72 http://www.aclweb.org/anthology/W17-1708 Ambiguity represents an obstacle for distributional semantic models(DSMs), which typically subsume the contexts of all word senses within one vector. While individual vector space approaches have been concerned with sense discrimination (e.g., Schütze 1998, Erk 2009, Erk and Pado 2010), such discrimination has rarely been integrated into DSMs across semantic tasks. This paper presents a soft-clustering approach to sense discrimination that filters sense-irrelevant features when predicting the degrees of compositionality for German noun-noun compounds and German particle verbs. inproceedings bott-schulteimwalde:2017:MWE2017 Multiword expressions and lexicalism: the view from LFG Jamie Y.Findlay Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 73–79 http://www.aclweb.org/anthology/W17-1709 Multiword expressions (MWEs) pose a problem for lexicalist theories like Lexical Functional Grammar (LFG), since they are prima facie counterexamples to a strong form of the lexical integrity principle, which entails that a lexical item can only be realised as a single, syntactically atomic word. In this paper, I demonstrate some of the problems facing any strongly lexicalist account of MWEs, and argue that the lexical integrity principle must be weakened. I conclude by sketching a formalism which integrates a Tree Adjoining Grammar into the LFG architecture, taking advantage of this relaxation. inproceedings findlay:2017:MWE2017 Understanding Idiomatic Variation KristinaGeeraert R. HaraldBaayen JohnNewman Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 80–90 http://www.aclweb.org/anthology/W17-1710 This study investigates the processing of idiomatic variants through an eye-tracking experiment. Four types of idiom variants were included, in addition to the canonical form and the literal meaning. Results suggest that modifications to idioms, modulo obvious effects of length differences, are not more difficult to process than the canonical forms themselves. This fits with recent corpus findings. inproceedings geeraert-baayen-newman:2017:MWE2017 Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment NatalieVargas CarlosRamisch HelenaCaseli Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 91–96 http://www.aclweb.org/anthology/W17-1711 We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE using a linear combination of these features. Preliminary experiments on light verb constructions show promising results. inproceedings vargas-ramisch-caseli:2017:MWE2017 Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach JustinaMandravickaite TomasKrilavičius Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 97–101 http://www.aclweb.org/anthology/W17-1712 We discuss an experiment on automatic identification of bi-gram multi-word expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown some nice results for Lithuanian and Latvian as well. Combining LAMs with ML we have achieved 92,4% precision and 52,2% recall for Latvian and 95,1% precision and 77,8% recall for Lithuanian. inproceedings mandravickaite-krilavivcius:2017:MWE2017 Show Me Your Variance and I Tell You Who You Are - Deriving Compound Compositionality from Word Alignments FabienneCap Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 102–107 http://www.aclweb.org/anthology/W17-1713 We use word alignment variance as an indicator for the non-compositionality of German and English noun compounds. Our work-in-progress results are on their own not competitive with state-of-the art approaches, but they show that alignment variance is correlated with compositionality and thus worth a closer look in the future. inproceedings cap:2017:MWE2017 Semantic annotation to characterize contextual variation in terminological noun compounds: a pilot study MelaniaCabezas-García AntonioSan Martín Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 108–113 http://www.aclweb.org/anthology/W17-1714 Noun compounds (NCs) are semantically complex and not fully compositional, as is often assumed. This paper presents a pilot study regarding the semantic annotation of environmental NCs with a view to accessing their semantics and exploring their domain-based contextual variation. Our results showed that the semantic annotation of NCs afforded important insights into how context impacts their conceptualization. inproceedings cabezasgarcia-sanmartin:2017:MWE2017 Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking AlfredoMaldonado LifengHan ErwanMoreau AshjanAlsulaimani Koel DuttaChowdhury CarlVogel QunLiu Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 114–120 http://www.aclweb.org/anthology/W17-1715 A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme. inproceedings maldonado-EtAl:2017:MWE2017 A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper TiberiuBoroş SoniaPipa VerginicaBarbu Mititelu DanTufiş Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 121–126 http://www.aclweb.org/anthology/W17-1716 "Multiword expressions" are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent the subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the verbal ones are particularly interesting for tasks such as parsing, as the verb is the central element in the syntactic organization of a sentence. In this paper we introduce our data-driven approach to verbal multiword expressions which was objectively validated during the PARSEME shared task on verbal multiword expressions identification. We tested our approach on 12 languages, and we provide detailed information about corpora composition, feature selection process, validation procedure and performance on all languages. inproceedings borocs-EtAl:2017:MWE2017 The ATILF-LLF System for Parseme Shared Task: a Transition-based Verbal Multiword Expression Tagger HazemAl Saied MatthieuConstant MarieCandito Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 127–132 http://www.aclweb.org/anthology/W17-1717 We describe the ATILF-LLF system built for the MWE 2017 Shared Task on automatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available languages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transition. The system was meant to accommodate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntactic information. Using per-MWE Fscore, the system was ranked first for all but two languages (Hungarian and Romanian). inproceedings alsaied-constant-candito:2017:MWE2017 Investigating the Opacity of Verb-Noun Multiword Expression Usages in Context ShivaTaslimipoor OmidRohanian RuslanMitkov AfsanehFazly Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 133–138 http://www.aclweb.org/anthology/W17-1718 This study investigates the supervised token-based identification of Multiword Expressions (MWEs). This is an ongoing research to exploit the information contained in the contexts in which different instances of an expression could occur. This information is used to investigate the question of whether an expression is literal or MWE. Lexical and syntactic context features derived from vector representations are shown to be more effective over traditional statistical measures to identify tokens of MWEs. inproceedings taslimipoor-EtAl:2017:MWE2017 Compositionality in Verb-Particle Constructions ArchnaBhatia Choh ManTeng JamesAllen Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 139–148 http://www.aclweb.org/anthology/W17-1719 We are developing a broad-coverage deep semantic lexicon for a system that parses sentences into a logical form expressed in a rich ontology that supports reasoning. In this paper we look at verb-particle constructions (VPCs), and the extent to which they can be treated compositionally vs idiomatically. First we distinguish between the different types of VPCs based on their compositionality and then present a set of heuristics for classifying specific instances as compositional or not. We then identify a small set of general sense classes for particles when used compositionally and discuss the resulting lexical representations that are being added to the lexicon. By treating VPCs as compositional whenever possible, we attain broad coverage in a compact way, and also enable interpretations of novel VPC usages not explicitly present in the lexicon. inproceedings bhatia-teng-allen:2017:MWE2017 Rule-Based Translation of Spanish Verb-Noun Combinations into Basque UxoaIñurrieta ItziarAduriz ArantzaDiaz de Ilarraza GorkaLabaka KepaSarasola Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 149–154 http://www.aclweb.org/anthology/W17-1720 This paper presents a method to improve the translation of Verb-Noun Combinations (VNCs) in a rule-based Machine Translation (MT) system for Spanish-Basque. Linguistic information about a set of VNCs is gathered from the public database Konbitzul, and it is integrated into the MT system, leading to an improvement in BLEU, NIST and TER scores, as well as the results being evidently better according to human evaluators. inproceedings inurrieta-EtAl:2017:MWE2017 Verb-Particle Constructions in Questions VeronikaVincze Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 155–160 http://www.aclweb.org/anthology/W17-1721 In this paper, we investigate the behavior of verb-particle constructions in English questions. We present a small dataset that contains questions and verb-particle construction candidates. We demonstrate that there are significant differences in the distribution of WH-words, verbs and prepositions/particles in sentences that contain VPCs and sentences that contain only verb + prepositional phrase combinations both by statistical means and in machine learning experiments. Hence, VPCs and non-VPCs can be effectively separated from each other by using a rich feature set, containing several novel features. inproceedings vincze:2017:MWE2017 Simple Compound Splitting for German MarionWeller-Di Marco Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 161–166 http://www.aclweb.org/anthology/W17-1722 This paper presents a simple method for German compound splitting that combines a basic frequency-based approach with a form-to-lemma mapping to approximate morphological operations. With the exception of a small set of hand-crafted rules for modeling transitional elements, this approach is resource-poor. In our evaluation, the simple splitter outperforms a splitter relying on rich morphological resources. inproceedings wellerdimarco:2017:MWE2017 Identification of Ambiguous Multiword Expressions Using Sequence Models and Lexical Resources ManonScholivet CarlosRamisch Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 167–175 http://www.aclweb.org/anthology/W17-1723 We present a simple and efficient tagger capable of identifying highly ambiguous multiword expressions (MWEs) in French texts. It is based on conditional random fields (CRF), using local context information as features. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from a lexicon. inproceedings scholivet-ramisch:2017:MWE2017 Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction AgnèsTutin OlivierKraif Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 176–180 http://www.aclweb.org/anthology/W17-1724 This paper aims at assessing to what extent a syntax-based method (Recurring Lexico-syntactic Trees (RLT) extraction) allows us to extract large phraseological units such as prefabricated routines, e.g. "as previously said" or "as far as we/I know" in scientific writing. In order to evaluate this method, we compare it to the classical ngram extraction technique, on a subset of recurring segments including speech verbs in a French corpus of scientific writing. Results show that the LRT extraction technique is far more efficient for extended MWEs such as routines or collocations but performs more poorly for surface phenomena such as syntactic constructions or fully frozen expressions. inproceedings tutin-kraif:2017:MWE2017 Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data MatthieuConstant HéctorMartínez Alonso Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 181–186 http://www.aclweb.org/anthology/W17-1725 This article evaluates the extension of a dependency parser that performs joint syntactic analysis and multiword expression identification. We show that, given sufficient training data, the parser benefits from explicit multiword information and improves overall labeled accuracy score in eight of the ten evaluation cases. inproceedings constant-martinezalonso:2017:MWE2017 Semi-Automated Resolution of Inconsistency for a Harmonized Multiword Expression and Dependency Parse Annotation KingChan JulianBrooke TimothyBaldwin Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 187–193 http://www.aclweb.org/anthology/W17-1726 This paper presents a methodology for identifying and resolving various kinds of inconsistency in the context of merging dependency and multiword expression (MWE) annotations, to generate a dependency treebank with comprehensive MWE annotations. Candidates for correction are identified using a variety of heuristics, including an entirely novel one which identifies violations of MWE constituency in the dependency tree, and resolved by arbitration with minimal human intervention. Using this technique, we identified and corrected several hundred errors across both parse and MWE annotations, representing changes to a significant percentage (well over 10%) of the MWE instances in the joint corpus. inproceedings chan-brooke-baldwin:2017:MWE2017 Combining Linguistic Features for the Detection of Croatian Multiword Expressions MajaBuljan JanŠnajder Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 194–199 http://www.aclweb.org/anthology/W17-1727 As multiword expressions (MWEs) exhibit a range of idiosyncrasies, their automatic detection warrants the use of many different features. Tsvetkov and Wintner (2014) proposed a Bayesian network model that combines linguistically motivated features and also models their interactions. In this paper, we extend their model with new features and apply it to Croatian, a morphologically complex and a relatively free word order language, achieving a satisfactory performance of 0.823 F1-score. Furthermore, by comparing against (semi)naive Bayes models, we demonstrate that manually modeling feature interactions is indeed important. We make our annotated dataset of Croatian MWEs freely available. inproceedings buljan-vsnajder:2017:MWE2017 Complex Verbs are Different: Exploring the Visual Modality in Multi-Modal Models to Predict Compositionality MaximilianKöper SabineSchulte im Walde Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) April 2017

Valencia, Spain

Association for Computational Linguistics 200–206 http://www.aclweb.org/anthology/W17-1728 This paper compares a neural network DSM relying on textual co-occurrences with a multi-modal model integrating visual information. We focus on nominal vs. verbal compounds, and zoom into lexical, empirical and perceptual target properties to explore the contribution of the visual modality. Our experiments show that (i) visual features contribute differently for verbs than for nouns, and (ii) images complement textual information, if (a) the textual modality by itself is poor and appropriate image subsets are used, or (b) the textual modality by itself is rich and large (potentially noisy) images are added. inproceedings koper-schulteimwalde:2017:MWE2017